blocks|key|2614825|text|您需要实现某种类型的循环构造来一次读取一个数字，因为您不能一次将它们全部存储在内存中。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2614826|多么?哦，你用的是什么语言？|2614827|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.

How? Oh, what language are you using?

blocks|key|2614866|text|我曾经在面试中问过这个问题。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2614867|这是一个O(N)的算法|2614868|使用哈希表。按顺序存储指向数字的指针，其中哈希键是根据数值计算得出的。一旦你有了碰撞，你就找到了你的副本。|2614869|作者编辑：|2614870|下面，@Phimuemue提出了一个很好的观点，即在保证冲突之前，4字节整数有一个固定的界限；即2%5E32，或者说大约2%5E32。4+GB。当在此答案附带的对话中考虑时，此算法的最坏情况下的内存消耗将大大减少。|2614871|此外，使用如下所述的位数组可以将内存消耗减少到1/8,512mb。在许多机器上，现在可以在不考虑持久散列或性能较差的排序优先策略的情况下进行这种计算。|2614872|现在，对于位数组策略来说，较长的数字或双精度数字是效率较低的方案。|2614873|Phimuemue编辑：|2614874|当然，我们需要使用一些“特殊的”哈希表：|2614875|以一个包含2%5E32位的哈希表为例。由于问题询问的是4字节整数，因此最多有2%5E32个不同的整数，即每个数字对应一位。2%5E32位=+512mb。|2614876|因此，现在只需确定相应位在hashmap中的位置并对其进行设置。如果遇到已设置的位，则该数字已出现在序列中。|2614877|entityMap^0|0|0|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|X|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Y|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|Z|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|10|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|11|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|12|8|@]|9|@]|A|$]]|$1|L|3|M|5|6|7|13|8|@]|9|@]|A|$]]|$1|N|3|O|5|6|7|14|8|@]|9|@]|A|$]]|$1|P|3|Q|5|6|7|15|8|@]|9|@]|A|$]]|$1|R|3|S|5|6|7|16|8|@]|9|@]|A|$]]|$1|T|3|U|5|6|7|17|8|@]|9|@]|A|$]]|$1|V|3|-4|5|6|7|18|8|@]|9|@]|A|$]]]|W|$]]

I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
<h3>Author Edit:</h3>
Below, @Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
<h3>Phimuemue Edit:</h3>
Of course one needs to take a bit &quot;special&quot; hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.

blocks|key|4022901|text|您必须读取每个数字并将其存储到hashmap中，以便如果某个数字再次出现，它将自动被丢弃。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4022902|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.

blocks|key|1572833|text|如果文件中数字的可能范围不是太大，那么您可以使用一些位数组来指示范围中的一些数字是否出现。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1572834|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.

blocks|key|4022967|text|如果数字的范围足够小，您可以使用一个位字段来存储它是否在其中-通过扫描文件来初始化它。每个可能的数字取一位。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4022968|对于大范围(如int)，您每次都需要读取文件。文件布局可以允许更有效的查找(即，在排序数组的情况下进行二进制搜索)。|4022969|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.

With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).

blocks|key|2615008|text|如果时间不是问题，而RAM是问题，那么您可以读取每个数字，然后通过读取文件而不将其存储在RAM中来将其与后续的每个数字进行比较。这将花费大量的时间，但您不会耗尽内存。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2615009|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.

blocks|key|826622|text|读取文件一次，创建一个哈希表，存储您遇到每个项目的次数。但是等等！不是使用项目本身作为关键字，而是使用项目本身的散列，例如最低有效位，假设20位(1M个项目)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|826623|在第一次传递之后，计数器大于1的所有项都可能指向重复的项，或者是假阳性。重新扫描文件，只考虑可能导致重复的项(查询表1中的每一项)，现在使用实值作为键构建一个新的哈希表，并再次存储计数。|826624|在第二次遍历之后，第二个表中计数大于1的项目就是您的副本。|826625|这仍然是O(n)，只是比单次通过慢两倍。|826626|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|J|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|K|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|L|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|M|8|@]|9|@]|A|$]]|$1|H|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|I|$]]

Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).

After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.

After the second pass, items with count > 1 in the second table are your duplicates.

This is still O(n), just twice as slow as a single pass.

blocks|key|826642|text|我不得不同意kbrimington和他的哈希表的想法，但首先，我想知道你要找的数字的范围。基本上，如果你在寻找32位的数字，你需要一个4.294.967.296位的数组。首先将所有位设置为0，文件中的每个数字都将设置一个特定的位。如果位已经被设置，那么你已经找到了一个以前出现过的数字。您还需要知道它们发生的频率吗？|type|unstyled|depth|inlineStyleRanges|entityRanges|data|826643|不过，它至少需要536.870.912字节。(512+MB。)它很多，需要一些巧妙的编程技巧。根据你的编程语言和个人经验，有数百种解决方案可以用这种方式来解决。|826644|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur? Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.

blocks|key|2615141|text|重要的问题是，你是否想要高效地解决这个问题，或者你是否想要。|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|2615142|如果你真的有100亿个数字，并且只有一个重复，那么你就处于“大海捞针”类型的情况下。直观地说，如果没有非常肮脏和不稳定的解决方案，如果不存储大量的数字，就没有希望解决这个问题。|2615143|相反，转向概率解决方案，它已经在这个问题的几乎所有实际应用中使用(在网络分析中，您试图做的是寻找鼠标，即在大型数据集中很少出现的元素)。|2615144|一种可能的解决方案，可以找到准确的结果:使用足够高分辨率的Bloom+filter。要么使用过滤器来确定元素是否已经被看到，或者，如果你想要完美的准确性，可以使用(正如kbrimington建议的那样使用标准哈希表)过滤器来过滤掉你不可能看到的元素，然后在第二次遍历时，确定你实际看到的元素两次。|2615145|如果你的问题稍有不同-例如，你知道你至少有0.001%25的元素重复了两次，你想要找出大约有多少，或者你想从这些元素中随机抽样-那么在Flajolet+&+Martin，Alon等人的脉络中，存在着一大堆概率流算法，它们非常有趣(更不用说效率很高了)。|2615146|entityMap|0|LINK|mutability|MUTABLE|url|http://en.wikipedia.org/wiki/Bloom_filter|1|http://portal.acm.org/citation.cfm?id=5215^0|T|1|0|0|K|0|0|T|C|0|0|1T|H|1|0^^$0|@$1|2|3|4|5|6|7|X|8|@$9|Y|A|Z|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|10|8|@$9|11|A|12|B|C]]|D|@]|E|$]]|$1|H|3|I|5|6|7|13|8|@]|D|@]|E|$]]|$1|J|3|K|5|6|7|14|8|@]|D|@$9|15|A|16|1|17]]|E|$]]|$1|L|3|M|5|6|7|18|8|@]|D|@$9|19|A|1A|1|1B]]|E|$]]|$1|N|3|-4|5|6|7|1C|8|@]|D|@]|E|$]]]|O|$P|$5|Q|R|S|E|$T|U]]|V|$5|Q|R|S|E|$T|W]]]]

The important question is whether you want to solve this problem efficiently, or whether you want accurately.

If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.

Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).

A possible solution, which can be made to find exact results: use a sufficiently high-resolution <a href="http://en.wikipedia.org/wiki/Bloom_filter" rel="nofollow noreferrer">Bloom filter</a>. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.

And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of <a href="http://portal.acm.org/citation.cfm?id=5215" rel="nofollow noreferrer">Flajolet &amp; Martin</a>, Alon et al., exist and are very interesting (not to mention highly efficient).

blocks|key|826682|text|这样如何：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|826683|826684|使用一些算法对输入进行排序，这种算法只允许输入的一部分在内存中。第一步输出中的there|ordered-list-item|offset|length|826685|Seek副本就是一个例子--要检测重复，每次只需要在内存中为2个输入元素留出空间。|826686|826687|entityMap|0|LINK|mutability|MUTABLE|url|http://en.wikipedia.org/wiki/External_sorting^0|0|0|13|5|0|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|T|8|@]|9|@]|A|$]]|$1|C|3|D|5|E|7|U|8|@]|9|@$F|V|G|W|1|X]]|A|$]]|$1|H|3|I|5|E|7|Y|8|@]|9|@]|A|$]]|$1|J|3|-4|5|6|7|Z|8|@]|9|@]|A|$]]|$1|K|3|-4|5|6|7|10|8|@]|9|@]|A|$]]]|L|$M|$5|N|O|P|A|$Q|R]]]]

How about:

<ol>
<li>Sort input by using some algorith which allows only portion of input to be in RAM. Examples are <a href="http://en.wikipedia.org/wiki/External_sorting" rel="nofollow noreferrer">there</a></li>
<li>Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.</li>
</ol>

blocks|key|4023204|text|查找重复的|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|4023205|请注意，它是一个32位整数，这意味着你将有大量的重复，因为32位整数只能代表43亿个不同的数字，而你有“100亿”。|4023206|如果您要使用一个紧凑的集合，您可以表示是否所有的可能性都在512MB中，这可以很容易地适应当前的RAM值。作为一个开始，这很容易让你识别出一个数字是否重复。|4023207|计数与重复|4023208|如果您需要知道一个数字被重复了多少次，那么您可以使用一个只包含重复项的hashmap+(使用ram的前500MB来有效地判断它是否应该出现在map中)。在最坏的情况下，如果有很大的扩展，您将无法将其放入ram中。|4023209|如果数字具有偶数个重复项，另一种方法是使用紧凑的数组，每个值具有2-8位，占用大约1-4+4GB的RAM，允许您计算每个数字出现的次数多达255次。|4023210|这将是一个黑客，但它是可行的。|4023211|entityMap^0|0|5|0|0|0|0|3|0|0|0|0^^$0|@$1|2|3|4|5|6|7|T|8|@$9|U|A|V|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|W|8|@]|D|@]|E|$]]|$1|H|3|I|5|6|7|X|8|@]|D|@]|E|$]]|$1|J|3|K|5|6|7|Y|8|@$9|Z|A|10|B|C]]|D|@]|E|$]]|$1|L|3|M|5|6|7|11|8|@]|D|@]|E|$]]|$1|N|3|O|5|6|7|12|8|@]|D|@]|E|$]]|$1|P|3|Q|5|6|7|13|8|@]|D|@]|E|$]]|$1|R|3|-4|5|6|7|14|8|@]|D|@]|E|$]]]|S|$]]

Finding duplicates

Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".

If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.

Counting Duplicates

If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.

Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.

Its going to be a hack, but its doable.

blocks|key|2615213|text|#include+<stdio.h>
#include+<stdlib.h>
/*+Macro+is+overly+general+but+I+left+it+'cos+it's+convenient+*/
#define+BITOP(a,b,op)+\
+((a)[(size_t)(b)/(8*sizeof+*(a))]+op+(size_t)1<<((size_t)(b)%25(8*sizeof+*(a))))
int+main(void)
{
++++unsigned+x=0;
++++size_t+*seen+=+malloc(1<<8*sizeof(unsigned)-3);
++++while+(scanf("%25u",+&x)>0+&&+!BITOP(seen,x,&))+BITOP(seen,x,%7C=);
++++if+(BITOP(seen,x,&))+printf("duplicate+is+%25u\n",+x);
++++else+printf("no+duplicate\n");
++++return+0;
}|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|2615214|unstyled|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|G|8|@]|9|@]|A|$B|C]]|$1|D|3|-4|5|E|7|H|8|@]|9|@]|A|$]]]|F|$]]

<pre><code>#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
 ((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1&lt;&lt;((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
 unsigned x=0;
 size_t *seen = malloc(1&lt;&lt;8*sizeof(unsigned)-3);
 while (scanf("%u", &amp;x)&gt;0 &amp;&amp; !BITOP(seen,x,&amp;)) BITOP(seen,x,|=);
 if (BITOP(seen,x,&amp;)) printf("duplicate is %u\n", x);
 else printf("no duplicate\n");
 return 0;
}
</code></pre>

blocks|key|823896|text|这是一个很容易解决的简单问题(几行代码)|type|unstyled|depth|inlineStyleRanges|entityRanges|data|823897|使用正确的工具，速度非常快(几分钟的执行)|823898|我个人的方法是使用MapReduce。|offset|length|style|BOLD|823899|MapReduce:+Simplified+Data+Processing+on+Large+Clusters|823900|很抱歉，我没有深入讨论更多细节，但是一旦熟悉了MapReduce的概念，就会非常清楚如何针对解决方案|823901|基本上，我们将实现两个简单的函数|823902|主键地图(key，value)|823903|Reduce(key，values[])|unordered-list-item|823904|823905|所以总而言之：|823906|打开文件并遍历data|823907|for+each+number+->+(+number，line_index)|823908|
|823909|in++reduce我们将获得数字作为关键字，总出现次数作为值的数量(包括它们在文件中的位置)|823910|so+in+Reduce(+key，values[])+if+number+of+values+>1+so+a+duplicate+number+|823911|打印副本:+number，line_index1，line_index2，...+|823912|<>F220>|823913|823914|同样，这种方法可以导致非常快的执行，这取决于您的MapReduce框架的设置，高度可伸缩性和非常可靠，在许多语言中有许多不同的MapReduce实现|823915|有几家顶级公司展示了已经构建的云计算环境，如Google，Microsoft+azure，Amazon+AWS，...|823916|或者，您可以构建自己的群集，并与任何提供虚拟计算环境的提供商建立一个群集，按小时计算的成本非常低|823917|祝你好运:)|823918|-+Another+more+simple+approach+could+be+in+using+bloom+filters|code-block|syntax|javascript|823919|AdamT|823920|​|823921|entityMap|0|LINK|mutability|MUTABLE|url|http://labs.google.com/papers/mapreduce-osdi04.pdf^0|0|0|9|9|0|0|1J|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|21|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|22|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|23|8|@$F|24|G|25|H|I]]|9|@]|A|$]]|$1|J|3|K|5|6|7|26|8|@]|9|@$F|27|G|28|1|29]]|A|$]]|$1|L|3|M|5|6|7|2A|8|@]|9|@]|A|$]]|$1|N|3|O|5|6|7|2B|8|@]|9|@]|A|$]]|$1|P|3|Q|5|6|7|2C|8|@]|9|@]|A|$]]|$1|R|3|S|5|T|7|2D|8|@]|9|@]|A|$]]|$1|U|3|-4|5|6|7|2E|8|@]|9|@]|A|$]]|$1|V|3|W|5|6|7|2F|8|@]|9|@]|A|$]]|$1|X|3|Y|5|6|7|2G|8|@]|9|@]|A|$]]|$1|Z|3|10|5|T|7|2H|8|@]|9|@]|A|$]]|$1|11|3|12|5|6|7|2I|8|@]|9|@]|A|$]]|$1|13|3|14|5|T|7|2J|8|@]|9|@]|A|$]]|$1|15|3|16|5|T|7|2K|8|@]|9|@]|A|$]]|$1|17|3|18|5|T|7|2L|8|@]|9|@]|A|$]]|$1|19|3|1A|5|6|7|2M|8|@]|9|@]|A|$]]|$1|1B|3|-4|5|6|7|2N|8|@]|9|@]|A|$]]|$1|1C|3|1D|5|6|7|2O|8|@]|9|@]|A|$]]|$1|1E|3|1F|5|6|7|2P|8|@]|9|@]|A|$]]|$1|1G|3|1H|5|6|7|2Q|8|@]|9|@]|A|$]]|$1|1I|3|1J|5|6|7|2R|8|@]|9|@]|A|$]]|$1|1K|3|1L|5|1M|7|2S|8|@]|9|@]|A|$1N|1O]]|$1|1P|3|1Q|5|6|7|2T|8|@]|9|@]|A|$]]|$1|1R|3|1S|5|6|7|2U|8|@]|9|@]|A|$]]|$1|1T|3|-4|5|6|7|2V|8|@]|9|@]|A|$]]]|1U|$1V|$5|1W|1X|1Y|A|$1Z|20]]]]

This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools 
my personal approach would be in using MapReduce 
<a href="http://labs.google.com/papers/mapreduce-osdi04.pdf" rel="nofollow noreferrer">MapReduce: Simplified Data Processing on Large Clusters</a> 
 
 
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution 
basicly we are going to implement two simple functions 

<ol>
<li>Map(key, value)</li>
<li>Reduce(key, values[])</li>
</ol>

so all in all:

<ul>
<li>open file and iterate through the data</li>
<li>for each number -> Map(number, line_index)</li>
<li>in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)</li>
<li>so in Reduce(key, values[]) if number of values > 1 than its a duplicate number</li>
<li>print the duplicates : number, line_index1, line_index2,...
 
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages 
 
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ... 
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
 
 
good luck :)
 
 

<ul>
<li>Another more simple approach could be in using bloom filters 
AdamT</li>
</ul></li>
</ul>

blocks|key|4023316|text|实现一个BitArray，使得这个数组的第i个索引将对应于数字8*i+%2B1到8*(i%2B1)+-1。如果我们已经见过8*i%2B1，那么第i个数的第一位是1。如果我们已经见过8*i+%2B2，那么第i个数的第二位是1，以此类推。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4023317|初始化大小为Integer.Max/8的位数组，每当您看到数字k时，将k/8索引的k%258位设置为1，如果此位已为1，则表示您已经看到此数字。|4023318|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.

Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?

Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.

How would you approach this problem?

Thanks in advance :)

finding a number appearing again among numbers stored in a file

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

比方说，我在一个文件中存储了100亿个数字。我如何找到之前已经出现过一次的数字？我不能在数组中一次填充数十亿个数字，然后保持一个简单的嵌套循环来检查这个数字以前是否出现过。你将如何处理这个问题？提前感谢:)

问在存储在文件中的数字中查找再次出现的数字
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在存储在文件中的数字中查找再次出现的数字EN