blocks|key|546702|text|我用Dan+Bernstein的djb2得到了很好的结果。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|546703|unsigned+long
hash(unsigned+char+*str)
{
++++unsigned+long+hash+=+5381;
++++int+c;

++++while+(c+=+*str%2B%2B)
++++++++hash+=+((hash+<<+5)+%2B+hash)+%2B+c;+/*+hash+*+33+%2B+c+*/

++++return+hash;
}|code-block|syntax|javascript|546704|entityMap|0|LINK|mutability|MUTABLE|url|http://www.cse.yorku.ca/~oz/hash.html^0|G|4|G|4|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@$9|T|A|U|B|C]]|D|@$9|V|A|W|1|X]]|E|$]]|$1|F|3|G|5|H|7|Y|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|Z|8|@]|D|@]|E|$]]]|L|$M|$5|N|O|P|E|$Q|R]]]]

I've had nice results with <a href="http://www.cse.yorku.ca/~oz/hash.html"><code>djb2</code></a> by Dan Bernstein.

<pre><code>unsigned long
hash(unsigned char *str)
{
 unsigned long hash = 5381;
 int c;

 while (c = *str++)
 hash = ((hash &lt;&lt; 5) + hash) + c; /* hash * 33 + c */

 return hash;
}
</code></pre>

blocks|key|546835|text|Wikipedia+shows一个很好的字符串散列函数，叫做Jenkins，每次一个散列。它还引用了这个散列的改进版本。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|546836|uint32_t+jenkins_one_at_a_time_hash(char+*key,+size_t+len)
{
++++uint32_t+hash,+i;
++++for(hash+=+i+=+0;+i+<+len;+%2B%2Bi)
++++{
++++++++hash+%2B=+key[i];
++++++++hash+%2B=+(hash+<<+10);
++++++++hash+%5E=+(hash+>>+6);
++++}
++++hash+%2B=+(hash+<<+3);
++++hash+%5E=+(hash+>>+11);
++++hash+%2B=+(hash+<<+15);
++++return+hash;
}|code-block|syntax|javascript|546837|entityMap|0|LINK|mutability|MUTABLE|url|http://en.wikipedia.org/wiki/Jenkins_hash_function^0|0|F|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@$A|R|B|S|1|T]]|C|$]]|$1|D|3|E|5|F|7|U|8|@]|9|@]|C|$G|H]]|$1|I|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|J|$K|$5|L|M|N|C|$O|P]]]]

<a href="http://en.wikipedia.org/wiki/Jenkins_hash_function" rel="noreferrer">Wikipedia shows</a> a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.

<pre><code>uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
 uint32_t hash, i;
 for(hash = i = 0; i &lt; len; ++i)
 {
 hash += key[i];
 hash += (hash &lt;&lt; 10);
 hash ^= (hash &gt;&gt; 6);
 }
 hash += (hash &lt;&lt; 3);
 hash ^= (hash &gt;&gt; 11);
 hash += (hash &lt;&lt; 15);
 return hash;
}
</code></pre>

blocks|key|763636|text|有许多现有的用于C的哈希表实现，从C标准库hcreate/hdestroy/hsearch到APR和glib中的实现，它们也提供了预先构建的哈希函数。我强烈建议使用这些，而不是发明自己的哈希表或哈希函数；它们已经针对常见用例进行了大量优化。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|763637|但是，如果数据集是静态的，则最好的解决方案可能是使用perfect+hash。对于给定的数据集，gperf将为您生成完美的哈希。|763638|entityMap|0|LINK|mutability|MUTABLE|url|http://apr.apache.org/|1|http://developer.gnome.org/glib/|2|http://en.wikipedia.org/wiki/Perfect_hash|3|http://www.gnu.org/s/gperf/^0|1A|3|0|1E|4|1|0|Q|C|2|1C|5|3|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@$A|U|B|V|1|W]|$A|X|B|Y|1|Z]]|C|$]]|$1|D|3|E|5|6|7|10|8|@]|9|@$A|11|B|12|1|13]|$A|14|B|15|1|16]]|C|$]]|$1|F|3|-4|5|6|7|17|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]|N|$5|I|J|K|C|$L|O]]|P|$5|I|J|K|C|$L|Q]]|R|$5|I|J|K|C|$L|S]]]]

There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the <a href="http://apr.apache.org/">APR</a> and <a href="http://developer.gnome.org/glib/">glib</a>, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.

If your dataset is static, however, your best solution is probably to use a <a href="http://en.wikipedia.org/wiki/Perfect_hash">perfect hash</a>. <a href="http://www.gnu.org/s/gperf/">gperf</a> will generate a perfect hash for you for a given dataset.

blocks|key|763951|text|djb2对于this+466k+english+dictionary有317个冲突，而MurmurHash对于64位散列没有冲突，对于32位散列有21个冲突(对于466k随机32位散列，预计大约有25个)。我的建议是使用MurmurHash，如果可用的话，它非常快，因为它一次接受几个字节。但是如果你需要一个简单而简短的散列函数来复制并粘贴到你的项目中，我建议使用your+a-byte-a-time版本：|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|763952|uint32_t+inline+MurmurOAAT32+(+const+char+*+key)
{
++uint32_t+h(3323198485ul);
++for+(;*key;%2B%2Bkey)+{
++++h+%5E=+*key;
++++h+*=+0x5bd1e995;
++++h+%5E=+h+>>+15;
++}
++return+h;
}

uint64_t+inline+MurmurOAAT64+(+const+char+*+key)
{
++uint64_t+h(525201411107845655ull);
++for+(;*key;%2B%2Bkey)+{
++++h+%5E=+*key;
++++h+*=+0x5bd1e9955bd1e995;
++++h+%5E=+h+>>+47;
++}
++return+h;
}|code-block|syntax|javascript|763953|简而言之，哈希表的最佳大小是尽可能大，同时仍能放入内存中。因为我们通常不知道或不想查看有多少可用内存，甚至可能会发生变化，所以最优的哈希表大小大约是表中要存储的元素预期数量的2倍。分配更多的哈希表将使您的哈希表更快，但在快速递减的回报，使您的哈希表小于这将使它指数级的慢。这是因为存在用于哈希表的非线性trade-off+between+space+and+time+complexity，其最佳负载因子为2-sqrt(2)+=+0.58...很明显。|763954|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/dwyl/english-words/blob/master/words.txt|1|https://en.wikipedia.org/wiki/MurmurHash|2|https://1ykos.github.io/patchmap/#Performance%2520comparison^0|6|S|0|32|A|1|0|0|47|17|2|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@$A|X|B|Y|1|Z]|$A|10|B|11|1|12]]|C|$]]|$1|D|3|E|5|F|7|13|8|@]|9|@]|C|$G|H]]|$1|I|3|J|5|6|7|14|8|@]|9|@$A|15|B|16|1|17]]|C|$]]|$1|K|3|-4|5|6|7|18|8|@]|9|@]|C|$]]]|L|$M|$5|N|O|P|C|$Q|R]]|S|$5|N|O|P|C|$Q|T]]|U|$5|N|O|P|C|$Q|V]]]]

djb2 has 317 collisions for <a href="https://github.com/dwyl/english-words/blob/master/words.txt" rel="noreferrer">this 466k english dictionary</a> while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes).
My recommendation is using <a href="https://en.wikipedia.org/wiki/MurmurHash" rel="noreferrer">MurmurHash</a> if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:

<pre><code>uint32_t inline MurmurOAAT32 ( const char * key)
{
 uint32_t h(3323198485ul);
 for (;*key;++key) {
 h ^= *key;
 h *= 0x5bd1e995;
 h ^= h &gt;&gt; 15;
 }
 return h;
}

uint64_t inline MurmurOAAT64 ( const char * key)
{
 uint64_t h(525201411107845655ull);
 for (;*key;++key) {
 h ^= *key;
 h *= 0x5bd1e9955bd1e995;
 h ^= h &gt;&gt; 47;
 }
 return h;
}
</code></pre>

The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear <a href="https://1ykos.github.io/patchmap/#Performance%20comparison" rel="noreferrer">trade-off between space and time complexity</a> for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.

blocks|key|546767|text|首先，将130个单词的40个冲突散列为0..99是不好的吗？如果您没有采取专门的步骤来实现哈希，就不能期望完美的哈希。在大多数情况下，普通哈希函数的冲突不会比随机生成器少。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|546768|具有良好声誉的散列函数是MurmurHash3。|offset|length|546769|最后，关于哈希表的大小，它实际上取决于您考虑的哈希表的类型，特别是，存储桶是可扩展的还是单槽的。如果存储桶是可扩展的，那么还有一个选择:您可以为内存/速度约束选择平均存储桶长度。|546770|entityMap|0|LINK|mutability|MUTABLE|url|http://code.google.com/p/smhasher/wiki/MurmurHash3^0|0|C|B|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Q|8|@]|9|@$D|R|E|S|1|T]]|A|$]]|$1|F|3|G|5|6|7|U|8|@]|9|@]|A|$]]|$1|H|3|-4|5|6|7|V|8|@]|9|@]|A|$]]]|I|$J|$5|K|L|M|A|$N|O]]]]

First, is 40 collisions for 130 words hashed to 0..99 bad? You can't expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won't have fewer collisions than a random generator most of the time.

A hash function with a good reputation is <a href="http://code.google.com/p/smhasher/wiki/MurmurHash3" rel="nofollow">MurmurHash3</a>.

Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.

blocks|key|547131|text|我尝试了这些散列函数，得到了以下结果。我有大约960%5E3个条目，每个条目64字节长，64个字符按不同顺序排列，散列值为32位。来自here的代码。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|547132|Hash+function++++%7C+collision+rate+%7C+how+many+minutes+to+finish
==============================================================
MurmurHash3++++++%7C+++++++++++6.?%25+%7C++++++++++++++++++++++4m15s
Jenkins+One..++++%7C+++++++++++6.1%25+%7C++++++++++++++++++++++6m54s+++
Bob,+1st+in+link+%7C++++++++++6.16%25+%7C++++++++++++++++++++++5m34s
SuperFastHash++++%7C++++++++++++10%25+%7C++++++++++++++++++++++4m58s
bernstein++++++++%7C++++++++++++20%25+%7C+++++++14s+only+finish+1/20
one_at_a_time++++%7C++++++++++6.16%25+%7C+++++++++++++++++++++++7m5s
crc++++++++++++++%7C++++++++++6.16%25+%7C++++++++++++++++++++++7m56s|code-block|syntax|javascript|547133|奇怪的是，几乎所有的哈希函数对我的数据都有6%25的冲突率。|547134|entityMap|0|LINK|mutability|MUTABLE|url|http://burtleburtle.net/bob/hash/doobs.html^0|1T|4|0|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@$A|T|B|U|1|V]]|C|$]]|$1|D|3|E|5|F|7|W|8|@]|9|@]|C|$G|H]]|$1|I|3|J|5|6|7|X|8|@]|9|@]|C|$]]|$1|K|3|-4|5|6|7|Y|8|@]|9|@]|C|$]]]|L|$M|$5|N|O|P|C|$Q|R]]]]

I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from <a href="http://burtleburtle.net/bob/hash/doobs.html" rel="nofollow noreferrer">here</a>.

<pre class="lang-none prettyprint-override"><code>Hash function | collision rate | how many minutes to finish
==============================================================
MurmurHash3 | 6.?% | 4m15s
Jenkins One.. | 6.1% | 6m54s 
Bob, 1st in link | 6.16% | 5m34s
SuperFastHash | 10% | 4m58s
bernstein | 20% | 14s only finish 1/20
one_at_a_time | 6.16% | 7m5s
crc | 6.16% | 7m56s
</code></pre>

One strange things is that almost all the hash functions have 6% collision rate for my data.

blocks|key|763721|text|有一件事我用得很好，那就是(我不知道是否已经提到了，因为我记不住它的名字了)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|763722|您预先计算了一个表T，对于键的字母表0,255中的每个字符都有一个随机数。您可以将密钥'k0+k1+k2+...+kN‘散列为Tk0+xor+Tk1+xor+...异或TkN。你可以很容易地证明这是随机的，就像你的随机数生成器一样，它在计算上非常可行，如果你真的遇到一个非常糟糕的实例，有很多冲突，你可以使用一批新的随机数重复整个过程。|763723|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

One thing I've used with good results is the following (I don't know if its mentioned already because I can't remember its name).

You precompute a table T with a random number for each character in your key's alphabet [0,255]. You hash your key 'k0 k1 k2 ... kN' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.

I'm working on hash table in C language and I'm testing hash function for string.

The first function I've tried is to add ascii code and use modulo (%100) but i've got poor results with the first test of data: 40 collisions for 130 words. 

The final input data will contain 8 000 words (it's a dictionnary stores in a file). The hash table is declared as int table[10000] and contains the position of the word in a txt file.

The first question is which is the best algorithm for hashing string ? and how to determinate the size of hash table ?

thanks in advance !

:-)

hash function for string

我正在用C语言编写哈希表，我正在测试字符串的哈希函数。我尝试的第一个函数是添加ascii代码并使用模数(%100)，但我在第一次数据测试中得到了很差的结果: 130个单词发生40次冲突。最终的输入数据将包含8000个单词(它是一个存储在文件中的字典)。哈希表被声明为int table10000，并且包含单词在txt文件中的位置。第一个问题是，哪种算法是对字符串进行散列的最佳算法？如何确定哈希表的大

问字符串的散列函数
EN

回答 7

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字符串的散列函数EN

回答 7

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字符串的散列函数
EN