我编写了一个应用程序,它解析imdb (从它们的公共ftp文件)数据,并将其存储到我选择的数据库中。
我的两个数据库表有超过100万条记录(2.316.952和1.392.866)。我的应用程序的目的是在每次用户单击按钮时选择1部电影(其中的数据驻留在两个表中)。所以我的问题是,什么是存储我的数据最有效的方法?
我试过PostgreSQL,我对此很满意。我唯一的问题是很难找到安装了PostgreSQL的廉价主机。
我试过MySQL,但出于某种原因,它比Postgres慢得多。如果Postgres select将花费3.5s,MySQL将超过10s。有什么好主意吗?MySQL使用了innodb引擎。
总结一下:有谁能想到更好的存储数据的方法吗?为什么MySQL慢一点?
编辑:
这有点问题,我使用hibernate/JPA,使用动态查询(hibernate api生成所有SQL)。
至于表格:
CREATE TABLE "movies" (
"id" INTEGER NOT NULL,
"name" TEXT NULL DEFAULT NULL,
"rank" DOUBLE PRECISION NOT NULL,
"releaseyear" INTEGER NULL DEFAULT NULL,
"summary" TEXT NULL DEFAULT NULL,
"votes" BIGINT NOT NULL,
PRIMARY KEY ("id")
)
CREATE TABLE "movies_genres" (
"moviejpa_id" INTEGER NOT NULL,
"genres_id" INTEGER NOT NULL
)
CREATE TABLE "genres" (
"id" INTEGER NOT NULL,
"name" VARCHAR(255) NULL DEFAULT NULL,
PRIMARY KEY ("id")
)表也由hibernate/jpa创建。
数据示例:



编辑(管理打印hibernate生成的查询):
select
moviejpa0_.id as col_0_0_
from
movies moviejpa0_
inner join
movies_genres genres1_
on moviejpa0_.id=genres1_.MovieJpa_id
inner join
genres genrejpa2_
on genres1_.genres_id=genrejpa2_.id
where
(
genrejpa2_.id in (
? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ?
)
)
and (
moviejpa0_.rank between 0.0 and 10.0
)
and (
moviejpa0_.votes between 0 and 1814484
)
select
moviejpa0_.id as id1_4_0_,
moviejpa0_.name as name2_4_0_,
moviejpa0_.rank as rank3_4_0_,
moviejpa0_.releaseYear as releaseY4_4_0_,
moviejpa0_.summary as summary5_4_0_,
moviejpa0_.votes as votes6_4_0_,
genres1_.MovieJpa_id as MovieJpa1_5_1_,
genrejpa2_.id as genres_i2_5_1_,
genrejpa2_.id as id1_3_2_,
genrejpa2_.name as name2_3_2_
from
movies moviejpa0_
left outer join
movies_genres genres1_
on moviejpa0_.id=genres1_.MovieJpa_id
left outer join
genres genrejpa2_
on genres1_.genres_id=genrejpa2_.id
where
moviejpa0_.id=?编辑:
Postgre计划:(我已经有一段时间没有直接使用数据库了,我使用了explain SQL,不知道这是否仍然正确)
第一个SQL:
Nested Loop Left Join (cost=0.43..39238.32 rows=2 width=1108)
Join Filter: (genres1_.genres_id = genrejpa2_.id)
-> Nested Loop Left Join (cost=0.43..39222.37 rows=2 width=588)
-> Materialize (cost=0.00..12.10 rows=140 width=520)
Join Filter: (moviejpa0_.id = genres1_.moviejpa_id)
-> Seq Scan on movies_genres genres1_ (cost=0.00..39213.90 rows=2 width=8)
-> Seq Scan on genres genrejpa2_ (cost=0.00..11.40 rows=140 width=520)
-> Index Scan using movies_pkey on movies moviejpa0_ (cost=0.43..8.45 rows=1 width=580)
Index Cond: (id = 916424)
Filter: (moviejpa_id = 916424)第二个SQL:
Hash Join (cost=84978.62..150810.17 rows=579193 width=4)
Hash Cond: (genres1_.moviejpa_id = moviejpa0_.id)
-> Hash Join (cost=17.96..47920.43 rows=579238 width=4)
Hash Cond: (genres1_.genres_id = genrejpa2_.id)
-> Seq Scan on movies_genres genres1_ (cost=0.00..33421.52 rows=2316952 width=8)
-> Hash (cost=17.53..17.53 rows=35 width=4)
-> Seq Scan on genres genrejpa2_ (cost=0.00..17.53 rows=35 width=4)
Filter: (id = ANY ('{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34}'::integer[]))
-> Hash (cost=62116.86..62116.86 rows=1392384 width=4)
-> Seq Scan on movies moviejpa0_ (cost=0.00..62116.86 rows=1392384 width=4)
Filter: ((rank >= '0'::double precision) AND (rank <= '10'::double precision) AND (votes >= 0) AND (votes <= 1814484))MySQL计划:
第一个sql:
<row>
<field name="id">1</field>
<field name="select_type">SIMPLE</field>
<field name="table">moviejpa0_</field>
<field name="partitions" xsi:nil="true" />
<field name="type">const</field>
<field name="possible_keys">PRIMARY</field>
<field name="key">PRIMARY</field>
<field name="key_len">4</field>
<field name="ref">const</field>
<field name="rows">1</field>
<field name="filtered">100.00</field>
<field name="Extra" xsi:nil="true" />
</row>
<row>
<field name="id">1</field>
<field name="select_type">SIMPLE</field>
<field name="table">genres1_</field>
<field name="partitions" xsi:nil="true" />
<field name="type">ref</field>
<field name="possible_keys">FK1oevm2ns4a61icpilhyq1dwr7</field>
<field name="key">FK1oevm2ns4a61icpilhyq1dwr7</field>
<field name="key_len">4</field>
<field name="ref">const</field>
<field name="rows">1</field>
<field name="filtered">100.00</field>
<field name="Extra" xsi:nil="true" />
</row>
<row>
<field name="id">1</field>
<field name="select_type">SIMPLE</field>
<field name="table">genrejpa2_</field>
<field name="partitions" xsi:nil="true" />
<field name="type">eq_ref</field>
<field name="possible_keys">PRIMARY</field>
<field name="key">PRIMARY</field>
<field name="key_len">4</field>
<field name="ref">test.genres1_.genres_id</field>
<field name="rows">1</field>
<field name="filtered">100.00</field>
<field name="Extra" xsi:nil="true" />
</row>第二个sql:
<field name="id">1</field>
<field name="select_type">SIMPLE</field>
<field name="table">moviejpa0_</field>
<field name="partitions" xsi:nil="true" />
<field name="type">ALL</field>
<field name="possible_keys">PRIMARY</field>
<field name="key" xsi:nil="true" />
<field name="key_len" xsi:nil="true" />
<field name="ref" xsi:nil="true" />
<field name="rows">1321403</field>
<field name="filtered">1.23</field>
<field name="Extra">Using where</field>
</row>
<row>
<field name="id">1</field>
<field name="select_type">SIMPLE</field>
<field name="table">genres1_</field>
<field name="partitions" xsi:nil="true" />
<field name="type">ref</field>
<field name="possible_keys">FKabwobqnegu888274nercpwc9p,FK1oevm2ns4a61icpilhyq1dwr7</field>
<field name="key">FK1oevm2ns4a61icpilhyq1dwr7</field>
<field name="key_len">4</field>
<field name="ref">test.moviejpa0_.id</field>
<field name="rows">1</field>
<field name="filtered">100.00</field>
<field name="Extra">Using where</field>
</row>
<row>
<field name="id">1</field>
<field name="select_type">SIMPLE</field>
<field name="table">genrejpa2_</field>
<field name="partitions" xsi:nil="true" />
<field name="type">eq_ref</field>
<field name="possible_keys">PRIMARY</field>
<field name="key">PRIMARY</field>
<field name="key_len">4</field>
<field name="ref">test.genres1_.genres_id</field>
<field name="rows">1</field>
<field name="filtered">100.00</field>
<field name="Extra">Using index</field>
</row>编辑: 2017-06-13设法找出,这个查询占用了大部分时间:
select
moviejpa0_.id as col_0_0_
from
movies moviejpa0_
inner join
movies_genres genres1_
on moviejpa0_.id=genres1_.MovieJpa_id
inner join
genres genrejpa2_
on genres1_.genres_id=genrejpa2_.id
where
(
genrejpa2_.id in (
? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ? , ?
)
)
and (
moviejpa0_.rank between 0.0 and 10.0
)
and (
moviejpa0_.votes between 0 and 1814484
)
CREATE TABLE "movies_genres" (
"genres_id" INTEGER NOT NULL,
"moviejpa_id" INTEGER NOT NULL,
INDEX "" ("moviejpa_id"),
UNIQUE INDEX "UNIQUE" ("genres_id", "moviejpa_id")
)解释:
QUERY PLAN
-
Hash Join (cost=85187.47..150712.33 rows=576153 width=4)
Hash Cond: (genres1_.genres_id = moviejpa0_.id)
-> Hash Join (cost=17.96..47666.67 rows=576165 width=4)
Hash Cond: (genres1_.moviejpa_id = genrejpa2_.id)
-> Seq Scan on movies_genres genres1_ (cost=0.00..33244.59 rows=2304659 width=8)
-> Hash (cost=17.53..17.53 rows=35 width=4)
-> Seq Scan on genres genrejpa2_ (cost=0.00..17.53 rows=35 width=4)
Filter: (id = ANY ('{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34}'::integer[]))
-> Hash (cost=62271.76..62271.76 rows=1395660 width=4)
-> Seq Scan on movies moviejpa0_ (cost=0.00..62271.76 rows=1395660 width=4)
Filter: ((rank >= '0'::double precision) AND (rank <= '10'::double precision) AND (votes >= 0) AND (votes <= 1816967)) 解释分析:
QUERY PLAN
-
Hash Join (cost=85187.47..150712.33 rows=576153 width=4) (actual time=619.514..2039.717 rows=2304659 loops=1)
Hash Cond: (genres1_.genres_id = moviejpa0_.id)
-> Hash Join (cost=17.96..47666.67 rows=576165 width=4) (actual time=0.099..424.101 rows=2304659 loops=1)
Hash Cond: (genres1_.moviejpa_id = genrejpa2_.id)
-> Seq Scan on movies_genres genres1_ (cost=0.00..33244.59 rows=2304659 width=8) (actual time=0.025..108.536 rows=2304659 loops=1)
-> Hash (cost=17.53..17.53 rows=35 width=4) (actual time=0.038..0.038 rows=35 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on genres genrejpa2_ (cost=0.00..17.53 rows=35 width=4) (actual time=0.013..0.027 rows=35 loops=1)
Filter: (id = ANY ('{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34}'::integer[]))
-> Hash (cost=62271.76..62271.76 rows=1395660 width=4) (actual time=615.854..615.854 rows=1396310 loops=1)
Buckets: 131072 Batches: 32 Memory Usage: 2562kB
-> Seq Scan on movies moviejpa0_ (cost=0.00..62271.76 rows=1395660 width=4) (actual time=0.065..436.606 rows=1396310 loops=1)
Filter: ((rank >= '0'::double precision) AND (rank <= '10'::double precision) AND (votes >= 0) AND (votes <= 1816967))
Planning time: 0.964 ms
Execution time: 2071.123 ms EXPLAIN
QUERY PLAN
-
Nested Loop Left Join (cost=12.92..33.08 rows=2 width=1112)
Join Filter: (moviejpa0_.id = genres1_.genres_id)
-> Index Scan using movies_pkey on movies moviejpa0_ (cost=0.43..8.45 rows=1 width=584)
Index Cond: (id = 481380)
-> Hash Right Join (cost=12.49..24.61 rows=2 width=524)
Hash Cond: (genrejpa2_.id = genres1_.moviejpa_id)
-> Seq Scan on genres genrejpa2_ (cost=0.00..11.40 rows=140 width=520)
-> Hash (cost=12.46..12.46 rows=2 width=8)
-> Index Only Scan using movies_genres_pkey on movies_genres genres1_ (cost=0.43..12.46 rows=2 width=8)
Index Cond: (genres_id = 481380)
EXPLAIN analyze
QUERY PLAN
-
Nested Loop Left Join (cost=12.92..33.08 rows=2 width=1112) (actual time=0.883..0.910 rows=3 loops=1)
Join Filter: (moviejpa0_.id = genres1_.genres_id)
-> Index Scan using movies_pkey on movies moviejpa0_ (cost=0.43..8.45 rows=1 width=584) (actual time=0.219..0.221 rows=1 loops=1)
Index Cond: (id = 481380)
-> Hash Right Join (cost=12.49..24.61 rows=2 width=524) (actual time=0.655..0.679 rows=3 loops=1)
Hash Cond: (genrejpa2_.id = genres1_.moviejpa_id)
-> Seq Scan on genres genrejpa2_ (cost=0.00..11.40 rows=140 width=520) (actual time=0.028..0.034 rows=35 loops=1)
-> Hash (cost=12.46..12.46 rows=2 width=8) (actual time=0.589..0.589 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Only Scan using movies_genres_pkey on movies_genres genres1_ (cost=0.43..12.46 rows=2 width=8) (actual time=0.557..0.563 rows=3 loops=1)
Index Cond: (genres_id = 481380)
Heap Fetches: 3
Planning time: 0.689 ms
Execution time: 1.026 ms 发布于 2017-06-06 04:27:36
使用PostgreSQL,
rank可以是real类型。NULL DEFAULT NULL。这是默认的。PRIMARY KEY (genres_id, moviejpa_id);text而不是varchar(255)。你可以像这样快速修复你现在的设置,
CREATE INDEX ON movies_genres(MovieJpa_id);
ALTER TABLE genres ADD PRIMARY KEY (id);
ALTER TABLE movies_genres ADD PRIMARY KEY (genres_id, moviejpa_id);
ANALYZE movies_genres;
ANALYZE genres;另外,还应该添加show work mem的结果;
发布于 2017-06-06 01:26:28
甚至3.5秒也是太多了。你需要索引。
这是很多:电影和体裁之间的许多映射,对吗?
CREATE TABLE "movies_genres" (
"moviejpa_id" INTEGER NOT NULL,
"genres_id" INTEGER NOT NULL
)它应该是(至少对于MySQL):
CREATE TABLE `movies_genres` (
`moviejpa_id` INT UNSIGNED NOT NULL,
`genres_id` SMALLINT UNSIGNED NOT NULL,
PRIMARY KEY(moviejpa_id, genres_id),
INDEX(genres_id, moviejpa_id)
) ENGINE=InnoDB;有关它的更多讨论,请参见这。
请为EXPLAIN SELECT ...尝试提供MySQL。
https://dba.stackexchange.com/questions/175485
复制相似问题