blocks|key|1231916|text|这是一个非常普遍的问题，不考虑数据库后端。在数据库后端上使用40或1000台无法处理负载的机器触发将不会给您带来任何好处。这样的问题是，要想在特定的way..you中回答这个问题，首先应该与您的组织中拥有足够的DB级别技能的人员联系，然后再提出一个更具体的问题。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1231917|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

This is a very generic question and does not take the database backend into account. Firing with 40 or 1000 machines on a database backend that can not handle the load will give you nothing. Such a problem is truly to broad to answer it in a specific way..you should get in touch with people inside your organization with enough skills on the DB level first and then come back with a more specific question.

blocks|key|2787989|text|将CSV数据加载到数据库中很慢，因为它需要读取、拆分和验证数据。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2787990|所以你应该尝试的是：|2787991|在每台计算机上设置一个本地数据库。这将消除网络延迟。|ordered-list-item|2787992|在每台计算机上加载数据的不同部分。试着给每台计算机一个相同的块。如果由于某种原因这并不容易，请给每台计算机10,000行。当他们完成后，给他们下一块。|2787993|使用DB工具转储数据|2787994|将所有转储加载到单个DB中。|2787995|确保加载器工具可以将数据导入已经包含数据的表中。如果无法做到这一点，请查看DB文档中的“远程表”。许多数据库允许在本地显示来自另一个DB服务器的表。|2787996|允许运行像insert+into+TABLE+(....)+select+....+from+REMOTE_SERVER.TABLE这样的命令。|offset|length|style|CODE|2787997|如果您需要主键(而且您应该这样做)，那么在导入到本地DB期间，您也会遇到分配PKs的问题。我建议将PKs添加到CSV文件中。|2787998|检查编辑后的编辑，下面是您应该尝试的内容：|BOLD|2787999|编写一个小程序，提取CSV文件的第一列和第二列中的唯一值。这可能是一个简单的脚本，如：
裁剪-d；“-f1排序-u+\+nawk‘{+print”；“$0}”
这是一个相当便宜的过程(几分钟，甚至对于巨大的文件)。它会给你ID值文件。|2788000|编写一个程序，读取新的ID值文件，将它们缓存在内存中，然后读取巨大的CSV文件并用ID替换值。
如果ID值文件太大，只需对小文件执行此步骤，并将大文件加载到每台机器的所有40个数据库中。|2788001|将巨大的文件分割成40块，并在每台机器上加载它们。
如果您有巨大的ID值文件，您可以使用在每台机器上创建的表来替换所有剩馀的值。|2788002|使用备份/还原或远程表合并结果。
或者，更好的是，将数据保存在这40台机器上，并使用并行计算中的算法来分割工作并合并结果。这就是谷歌如何在几毫秒内从数十亿网页中创建搜索结果的方法。|2788003|见在这里作一个介绍。|2788004|entityMap|0|LINK|mutability|MUTABLE|url|http://java.dzone.com/announcements/forkjoin-slow-motion^0|0|0|0|0|0|0|0|5|1P|0|0|6|2|0|0|0|0|0|1|8|0|0^^$0|@$1|2|3|4|5|6|7|1H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|1I|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|1J|8|@]|9|@]|A|$]]|$1|G|3|H|5|F|7|1K|8|@]|9|@]|A|$]]|$1|I|3|J|5|F|7|1L|8|@]|9|@]|A|$]]|$1|K|3|L|5|F|7|1M|8|@]|9|@]|A|$]]|$1|M|3|N|5|6|7|1N|8|@]|9|@]|A|$]]|$1|O|3|P|5|6|7|1O|8|@$Q|1P|R|1Q|S|T]]|9|@]|A|$]]|$1|U|3|V|5|6|7|1R|8|@]|9|@]|A|$]]|$1|W|3|X|5|6|7|1S|8|@$Q|1T|R|1U|S|Y]]|9|@]|A|$]]|$1|Z|3|10|5|F|7|1V|8|@]|9|@]|A|$]]|$1|11|3|12|5|F|7|1W|8|@]|9|@]|A|$]]|$1|13|3|14|5|F|7|1X|8|@]|9|@]|A|$]]|$1|15|3|16|5|F|7|1Y|8|@]|9|@]|A|$]]|$1|17|3|18|5|6|7|1Z|8|@]|9|@$Q|20|R|21|1|22]]|A|$]]|$1|19|3|-4|5|6|7|23|8|@]|9|@]|A|$]]]|1A|$1B|$5|1C|1D|1E|A|$1F|1G]]]]

Loading CSV data into a database is slow because it needs to read, split and validate the data.

So what you should try is this:

<ol>
<li>Setup a local database on each computer. This will get rid of the network latency.</li>
<li>Load a different part of the data on each computer. Try to give each computer the same chunk. If that isn't easy for some reason, give each computer, say, 10'000 rows. When they are done, give them the next chunk.</li>
<li>Dump the data with the DB tools</li>
<li>Load all dumps into a single DB</li>
</ol>

Make sure that your loader tool can import data into a table which already contains data. If you can't do this, check your DB documentation for "remote table". A lot of databases allow to make a table from another DB server visible locally.

That allows you to run commands like <code>insert into TABLE (....) select .... from REMOTE_SERVER.TABLE</code>

If you need primary keys (and you should), you will also have the problem to assign PKs during the import into the local DBs. I suggest to add the PKs to the CSV file.

[EDIT] After checking with your edits, here is what you should try:

<ol>
<li>Write a small program which extract the unique values in the first and second column of the CSV file. That could be a simple script like:

<pre><code> cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
</code></pre>

This is a pretty cheap process (a couple of minutes even for huge files). It gives you ID-value files.</li>
<li>Write a program which reads the new ID-value files, caches them in memory and then reads the huge CSV files and replaces the values with the IDs.

If the ID-value files are too big, just do this step for the small files and load the huge ones into all 40 per-machine DBs.</li>
<li>Split the huge file into 40 chunks and load each of them on each machine.

If you had huge ID-value files, you can use the tables created on each machine to replace all the values that remained.</li>
<li>Use backup/restore or remote tables to merge the results.

Or, even better, keep the data on the 40 machines and use algorithms from parallel computing to split the work and merge the results. That's how Google can create search results from billions of web pages in a few milliseconds.</li>
</ol>

See <a href="http://java.dzone.com/announcements/forkjoin-slow-motion" rel="nofollow">here for an introduction</a>.

blocks|key|2428901|text|假设N台计算机，X文件每台约50+of，目标是在最后拥有一个包含所有内容的数据库。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2428902|问:现在需要15个小时。你知道这个过程的哪一部分花的时间最长吗？(读取数据、清理数据、将读取的数据保存在表中、索引…)您正在将数据插入到未编入索引的表中，然后再进行索引，对吗？)|2428903|为了将这项工作分配给N台计算机，我会做一些类似的事情(这是一个信封背面的设计)：|2428904|有一个“中央”或主数据库。用这个来管理整个过程，并持有最终的完整仓库。|unordered-list-item|2428905|它包含所有X文件和所有N-1+(不包括自身)“工作者”数据库的列表。|2428906|每个工作人员数据库都以某种方式链接到主数据库(主要取决于RDBMS，您还没有指定)。|2428907|在启动和运行时，“就绪”工作人员数据库将轮询主数据库以获取要处理的文件。主数据库将文件分发给工作系统，确保一次处理的文件不会超过一个。(必须跟踪加载给定文件的成功/失败；监视超时(员工失败)，管理重试。)|2428908|员工数据库具有星型架构的本地实例。当分配一个文件时，它将清空架构并从该文件加载数据。(对于可伸缩性，一次加载几个文件可能是值得的吗？)这里对包含在该文件中的数据进行“第一阶段”数据清理。|2428909|加载后，主数据库将被更新为该工作人员的“就绪flagy”，并进入等待模式。|2428910|主数据库有自己的已完成加载数据的工人数据库的待办事项列表.它依次处理每个等待工作人员集；当一个工作人员集被处理时，该工作人员将被设置为“检查是否有另一个要处理的文件”模式。|2428911|在进程开始时，主数据库中的星型架构将被清除。加载的第一组可能只是逐字复制。|2428912|对于第二个集合和向上，必须读取和“合并”数据--丢弃冗余条目，通过一致的维度合并数据，等等。现在也必须执行适用于所有数据的业务规则，而不仅仅是一次一组。这将是“第二阶段”数据清理。|2428913|同样，对每个工作人员数据库重复上述步骤，直到上载所有文件为止。|2428914|优势：|2428915|将数据从文件中读取/转换到数据库，并在N台计算机上进行“第一阶段”清理。|2428916|理想情况下，主数据库只剩下很少的工作(“第二阶段”，合并数据集)。|2428917|限制：|2428918|首先将大量数据读入辅助数据库，然后再通过网络重新读取(尽管是DBMS-本机格式)。|2428919|主数据库可能是一个瓶颈。一切都得经过这里。|2428920|捷径：|2428921|当工作站“签入”一个新文件时，它可能会刷新已加载在主文件中的数据的本地存储，并在其“第一阶段”工作中添加基于此的数据清理注意事项(也就是说，它知道代码5484J已经加载，因此它可以过滤掉它，而不是将它传递回主数据库)。|2428922|Server表分区或其他RDBMS的类似物理实现技巧可能会得到很好的使用。|2428923|可能还有其他捷径，但这完全取决于正在执行的业务规则。|2428924|不幸的是，如果没有更多的信息或对所涉及的系统和数据的理解，人们就无法判断这一过程最终是否会比“一箱一做”解决方案更快或更慢。最后，它在很大程度上取决于您的数据:它是提交“分而治之”的技术，还是所有这些都必须通过一个处理实例来运行？|2428925|entityMap^0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|1O|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|1P|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|1Q|8|@]|9|@]|A|$]]|$1|F|3|G|5|H|7|1R|8|@]|9|@]|A|$]]|$1|I|3|J|5|H|7|1S|8|@]|9|@]|A|$]]|$1|K|3|L|5|H|7|1T|8|@]|9|@]|A|$]]|$1|M|3|N|5|H|7|1U|8|@]|9|@]|A|$]]|$1|O|3|P|5|H|7|1V|8|@]|9|@]|A|$]]|$1|Q|3|R|5|H|7|1W|8|@]|9|@]|A|$]]|$1|S|3|T|5|H|7|1X|8|@]|9|@]|A|$]]|$1|U|3|V|5|H|7|1Y|8|@]|9|@]|A|$]]|$1|W|3|X|5|H|7|1Z|8|@]|9|@]|A|$]]|$1|Y|3|Z|5|H|7|20|8|@]|9|@]|A|$]]|$1|10|3|11|5|6|7|21|8|@]|9|@]|A|$]]|$1|12|3|13|5|H|7|22|8|@]|9|@]|A|$]]|$1|14|3|15|5|H|7|23|8|@]|9|@]|A|$]]|$1|16|3|17|5|6|7|24|8|@]|9|@]|A|$]]|$1|18|3|19|5|H|7|25|8|@]|9|@]|A|$]]|$1|1A|3|1B|5|H|7|26|8|@]|9|@]|A|$]]|$1|1C|3|1D|5|6|7|27|8|@]|9|@]|A|$]]|$1|1E|3|1F|5|H|7|28|8|@]|9|@]|A|$]]|$1|1G|3|1H|5|H|7|29|8|@]|9|@]|A|$]]|$1|1I|3|1J|5|H|7|2A|8|@]|9|@]|A|$]]|$1|1K|3|1L|5|6|7|2B|8|@]|9|@]|A|$]]|$1|1M|3|-4|5|6|7|2C|8|@]|9|@]|A|$]]]|1N|$]]

Assuming N computers, X files at about 50GB files each, and a goal of having 1 database containing everything at the end.

Question: It takes 15 hours now. Do you know which part of the process is taking the longest? (Reading data, cleansing data, saving read data in tables, indexing… you are inserting data into unindexed tables and indexing after, right?)

To split this job up amongst the N computers, I’d do something like (and this is a back-of-the-envelope design):

<ul>
<li>Have a “central” or master database. Use this to mangae the overall process, and to hold the final complete warehouse.</li>
<li>It contains lists of all X files and all N-1 (not counting itself) “worker” databases</li>
<li>Each worker database is somehow linked to the master database (just how depends on RDBMS, which you have not specified)</li>
<li>When up and running, a "ready" worker database polls the master database for a file to process. The master database dolls out files to worker systems, ensuring that no file gets processed by more than one at a time. (Have to track success/failure of loading a given file; watch for timeouts (worker failed), manage retries.)</li>
<li>Worker database has local instance of star schema. When assigned a file, it empties the schema and loads the data from that one file. (For scalability, might be worth loading a few files at a time?) “First stage” data cleansing is done here for the data contained within that file(s). </li>
<li>When loaded, master database is updated with a “ready flagy” for that worker, and it goes into waiting mode.</li>
<li>Master database has it’s own to-do list of worker databases that have finished loading data. It processes each waiting worker set in turn; when a worker set has been processed, the worker is set back to “check if there’s another file to process” mode.</li>
<li>At start of process, the star schema in the master database is cleared. The first set loaded can probably just be copied over verbatim.</li>
<li>For second set and up, have to read and “merge” data – toss out redundant entries, merge data via conformed dimensions, etc. Business rules that apply to all the data, not just one set at a time, must be done now as well. This would be “second stage” data cleansing.</li>
<li>Again, repeat the above step for each worker database, until all files have been uploaded.</li>
</ul>

Advantages:

<ul>
<li>Reading/converting data from files into databases and doing “first stage” cleansing gets scaled out across N computers.</li>
<li>Ideally, little work (“second stage”, merging datasets) is left for the master database</li>
</ul>

Limitations:

<ul>
<li>Lots of data is first read into worker database, and then read again (albeit in DBMS-native format) across the network</li>
<li>Master database is a possible chokepoint. Everything has to go through here.</li>
</ul>

Shortcuts:

<ul>
<li>It seems likely that when a workstation “checks in” for a new file, it can refresh a local store of data already loaded in the master and add data cleansing considerations based on this to its “first stage” work (i.e. it knows code 5484J has already been loaded, so it can filter it out and not pass it back to the master database).</li>
<li>SQL Server table partitioning or similar physical implementation tricks of other RDBMSs could probably be used to good effect.</li>
<li>Other shortcuts are likely, but it totally depends upon the business rules being implemented.</li>
</ul>

Unfortunately, without further information or understanding of the system and data involved, one can’t tell if this process would end up being faster or slower than the “do it all one one box” solution. At the end of the day it depends a lot on your data: does it submit to “divide and conquer” techniques, or must it all be run through a single processing instance?

blocks|key|2788066|text|最简单的办法是让一台计算机负责分发新的维度项id。每个维度都可以有一个。如果尺寸处理计算机在同一个网络上，您可以让它们广播id。这应该足够快。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2788067|你计划在23维星图中使用什么数据库？导入可能不是唯一的性能瓶颈。您可能需要在分布式主存系统中执行此操作。这就避免了大量的母化问题。|2788068|你应该调查一下是否有高度相关的维度。|2788069|一般来说，对于大维度的23维星型方案，标准的关系数据库(+Server、PostgreSQL、MySQL)在处理数据仓库问题时会表现得非常糟糕。为了避免进行全表扫描，关系数据库使用物化视图。有了23个维度，你负担不起足够的费用。分布式主存数据库可能能够足够快地完成全表扫描(2004年，我在Delphi中的奔腾43+GHz上执行了大约800万行/秒/线程的扫描)。Vertica可能是另一个选择。|2788070|另一个问题:压缩文件时文件有多大？这提供了一个很好的一阶估计量的正常化，你可以做。|2788071|我看过你的其他问题了。这看起来不太适合PostgreSQL+(或MySQL或server)。你愿意等待多久的查询结果？|2788072|entityMap^0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|O|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|P|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|Q|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|R|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|S|8|@]|9|@]|A|$]]|$1|L|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|M|$]]

The simplest thing is to make one computer responsible for handing out new dimension item id's. You can have one for each dimension. If the dimension handling computers are on the same network, you can have them broadcast the id's. That should be fast enough.

What database did you plan on using with a 23-dimensional starscheme? Importing might not be the only performance bottleneck. You might want to do this in a distributed main-memory system. That avoids a lot of the materalization issues.

You should investigate if there are highly correlating dimensions.

In general, with a 23 dimensional star scheme with large dimensions a standard relational database (SQL Server, PostgreSQL, MySQL) is going to perform extremely bad with datawarehouse questions. In order to avoid having to do a full table scan, relational databases use materialized views. With 23 dimensions you cannot afford enough of them. A distributed main-memory database might be able to do full table scans fast enough (in 2004 I did about 8 million rows/sec/thread on a Pentium 4 3 GHz in Delphi). Vertica might be an other option.

Another question: how large is the file when you zip it? That provides a good first order estimate of the amount of normalization you can do.

[edit] I've taken a look at your other questions. This does not look like a good match for PostgreSQL (or MySQL or SQL server). How long are you willing to wait for query results?

blocks|key|1232070|text|在另一个注意事项上，您可以使用Windows云计算附件用于Windows：http://www.microsoft.com/virtualization/en/us/private-cloud.aspx|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1232071|entityMap|0|LINK|mutability|MUTABLE|url|http://www.microsoft.com/virtualization/en/us/private-cloud.aspx^0|11|1S|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

On another note you could utilize Windows Hyper-V Cloud Computing addon for Windows Server:http://www.microsoft.com/virtualization/en/us/private-cloud.aspx

blocks|key|2788095|text|您可以考虑使用64位哈希函数为每个字符串生成一个bigint+ID，而不是使用顺序ID。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|2788096|使用64位哈希码，您可以在哈希表中存储2%5E(32+-+7)或超过3,000万项，然后才有0.0031%25的可能发生冲突。|2788097|这将允许您在所有节点上具有相同的ID，在“调度”阶段和“合并”阶段之间没有任何服务器之间的通信。|2788098|您甚至可以增加位数以进一步降低碰撞的可能性；只是，您无法使结果哈希适合于64位整数数据库字段。|2788099|请参见：|2788100|哈希|2788101|http://code.google.com/p/smhasher/wiki/MurmurHash|2788102|http://www.partow.net/programming/hashfunctions/index.html|2788103|entityMap|0|LINK|mutability|MUTABLE|url|http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash|1|2^0|O|6|0|0|0|0|0|0|2|0|0|0|1D|1|0|0|1M|2|0^^$0|@$1|2|3|4|5|6|7|13|8|@$9|14|A|15|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|16|8|@]|D|@]|E|$]]|$1|H|3|I|5|6|7|17|8|@]|D|@]|E|$]]|$1|J|3|K|5|6|7|18|8|@]|D|@]|E|$]]|$1|L|3|M|5|6|7|19|8|@]|D|@]|E|$]]|$1|N|3|O|5|6|7|1A|8|@]|D|@$9|1B|A|1C|1|1D]]|E|$]]|$1|P|3|Q|5|6|7|1E|8|@]|D|@$9|1F|A|1G|1|1H]]|E|$]]|$1|R|3|S|5|6|7|1I|8|@]|D|@$9|1J|A|1K|1|1L]]|E|$]]|$1|T|3|-4|5|6|7|1M|8|@]|D|@]|E|$]]]|U|$V|$5|W|X|Y|E|$Z|10]]|11|$5|W|X|Y|E|$Z|Q]]|12|$5|W|X|Y|E|$Z|S]]]]

You could consider using a 64bit hash function to produce a <code>bigint</code> ID for each string, instead of using sequential IDs.

With 64-bit hash codes, you can store 2^(32 - 7) or over 30 million items in your hash table before there is a 0.0031% chance of a collision.

This would allow you to have identical IDs on all nodes, with no communication whatsoever between servers between the 'dispatch' and the 'merge' phases.

You could even increase the number of bits to further lower the chance of collision; only, you would not be able to make the resultant hash fit in a 64bit integer database field.

See:

<a href="http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash" rel="noreferrer">http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash</a>

<a href="http://code.google.com/p/smhasher/wiki/MurmurHash" rel="noreferrer">http://code.google.com/p/smhasher/wiki/MurmurHash</a>

<a href="http://www.partow.net/programming/hashfunctions/index.html" rel="noreferrer">http://www.partow.net/programming/hashfunctions/index.html</a>

blocks|key|2428995|text|罗希塔|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2428996|我建议您先从负载中删除大量的工作，首先将数据放在数据库之外。我在Solaris+unix环境中工作。我倾向于使用korn-shell脚本，该脚本将cut的文件放入更可管理的块中，然后将这些块平均地分给我的另外两台服务器。我使用nawk脚本(nawk有一个高效的哈希表，他们称之为“关联数组”)来计算不同的值(维度表)和事实表。只需将每个新名称与这个维度的增量器关联起来，然后写出事实。|offset|length|style|CODE|2428997|如果您通过命名管道执行此操作，您可以推送、远程处理和读取‘动态’数据，而“主机”计算机则直接将其加载到表中。|2428998|请记住，无论您如何处理200,000,000行数据(这是多少次？)，这需要一些时间。听起来你是来找乐子的。阅读别人如何解决这个问题是很有趣的.有句老话：“做这件事有不止一种方法！”从来没有这么真实过。祝好运!|2428999|干杯。基思。|2429000|entityMap^0|0|21|3|0|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Q|8|@$D|R|E|S|F|G]]|9|@]|A|$]]|$1|H|3|I|5|6|7|T|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|U|8|@]|9|@]|A|$]]|$1|L|3|M|5|6|7|V|8|@]|9|@]|A|$]]|$1|N|3|-4|5|6|7|W|8|@]|9|@]|A|$]]]|O|$]]

Rohita,

I'd suggest you eliminate a lot of the work from the load by sumarising the data FIRST, outside of the database. I work in a Solaris unix environment. I'd be leaning towards a korn-shell script, which <code>cut</code>s the file up into more managable chunks, then farms those chunks out equally to my two OTHER servers. I'd process the chunks using a nawk script (nawk has an efficient hashtable, which they call "associative arrays") to calculate the distinct values (the dimensions tables) and the Fact table. Just associate each new-name-seen with an incrementor-for-this-dimension, then write the Fact.

If you do this through named pipes you can push, process-remotely, and readback-back the data 'on the fly' while the "host" computer sits there loading it straight into tables.

Remember, No matter WHAT you do with 200,000,000 rows of data (How many Gig is it?), it's going to take some time. Sounds like you're in for some fun. It's interesting to read how other people propose to tackle this problem... The old adage "there's more than one way to do it!" has never been so true. Good luck!

Cheers. Keith.

blocks|key|1232139|text|您的实现似乎效率很低，因为它的加载速度低于1MB/秒(50+MB/15小时)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1232140|在现代单一服务器(2xXeon5690+CPU%2B+RAM+)上正确实现，足以满足哈希表%2B+8GB中加载的所有维度的需要，至少可以给您10倍的速度，即至少10+10/秒。|1232141|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

It seems that your implementation is very inefficient as it's loading at the speed of less than 1 MB/sec (50GB/15hrs).

Proper implementation on a modern single server (2x Xeon 5690 CPUs + RAM that's enough for ALL dimensions loaded in hash tables + 8GB ) should give you at least 10 times better speed i.e at least 10MB/sec.

We have flat files (CSV) with >200,000,000 rows, which we import into a star schema with 23 dimension tables. The biggest dimension table has 3 million rows. At the moment we run the importing process on a single computer and it takes around 15 hours. As this is too long time, we want to utilize something like 40 computers to do the importing.

My question

How can we efficiently utilize the 40 computers to do the importing. The main worry is that there will be a lot of time spent replicating the dimension tables across all the nodes as they need to be identical on all nodes. This could mean that if we utilized 1000 servers to do the importing in the future, it might actually be slower than utilize a single one, due to the extensive network communication and coordination between the servers. 

Does anyone have suggestion? 

EDIT:

The following is a simplification of the CSV files: 

<pre><code>"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue" 
"bvalue";"evenanothervalue"
"avalue";"anothervalue"
</code></pre>

After importing, the tables look like this:

dimension_table1

<pre><code>id name
1 "avalue"
2 "bvalue"
</code></pre>

dimension_table2

<pre><code>id name
1 "anothervalue"
2 "evenanothervalue"
</code></pre>

Fact table

<pre><code> dimension_table1_ID dimension_table2_ID
 1 1
 2 2
 1 2
 1 2 
 2 2
 1 1
</code></pre>

How to efficiently utilize 10+ computers to import data

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我们有大于200,000,000行的平面文件(CSV)，我们将其导入到一个包含23个维度表的星型模式中。最大的维度表有300万行。目前，我们在一台计算机上运行导入过程，大约需要15个小时。因为时间太长了，我们想用40台电脑来做进口。我的问题如何有效地利用40台计算机进行导入。主要担心的是，在所有节点上复制维度表将花费大...

问如何有效利用10+计算机导入数据
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何有效利用10+计算机导入数据EN