blocks|key|11482|text|为此，我可能会使用类似B%2Btree的东西：https://en.wikipedia.org/wiki/B%252B_tree|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|11483|由于内存效率对您很重要，因此当一个叶块变满时，您应该尽可能在多个块之间重新分配密钥，以确保块总是>=+85%25已满。块大小应该足够大，使得来自内部节点的开销只有几个%25。|11484|您还可以优化叶块中的存储，因为块中的大多数键都有一个长的公共前缀，您可以从较高级别的块中找出该前缀。因此，您可以从叶块中的键中删除公共前缀的所有副本，400MB的键值对占用的RAM将大大少于400MB。这将在一定程度上使插入过程复杂化。|11485|你还可以做其他的事情来进一步压缩这个结构，但是很快就会变得很复杂，而且听起来你不需要它。|11486|entityMap|0|LINK|mutability|MUTABLE|url|https://en.wikipedia.org/wiki/B%252B_tree^0|L|13|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|R|8|@]|9|@$A|S|B|T|1|U]]|C|$]]|$1|D|3|E|5|6|7|V|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|W|8|@]|9|@]|C|$]]|$1|H|3|I|5|6|7|X|8|@]|9|@]|C|$]]|$1|J|3|-4|5|6|7|Y|8|@]|9|@]|C|$]]]|K|$L|$5|M|N|O|C|$P|Q]]]]

I would probably use something like a B+tree for this: <a href="https://en.wikipedia.org/wiki/B%2B_tree" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/B%2B_tree</a>

Since memory-efficiency is important to you, when a leaf block gets full you should redistribute keys among several blocks if possible to ensure that blocks are always >= 85% full. Block size should be large enough that the overhead from internal nodes is only a few %.

You can also optimize storage in the leaf blocks, since most of the keys in a block will have a long common prefix that you can figure out from the blocks in the higher levels. You can therefore remove all the copies of the common prefix from the keys in the leaf blocks, and your 400MB of key-value pairs will take substantially less than 400MB of RAM. This will complicate the insert process somewhat.

There are other things you can do to compress this structure further, but that gets complicated fast and it doesn't sound like you need it.

blocks|key|11838|text|我会将其实现为一个用于查找的哈希表，以及一个用于迭代的单独inverted+index。我认为尝试将这些独立的关键片段转换为整数，就像您在Ways+to+convert+special-purpoes-strings+to+Integers中要求的那样，是一堆不必要的工作。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|11839|对于C语言，已经有很多不错的哈希表实现，所以我不再赘述。|11840|要为迭代创建倒排索引，请创建N个哈希表，其中N是键段的数量。然后，对于每个键，将其分成单独的段，并将该值的条目添加到适当的哈希表中。因此，如果您有密钥"abcxyzqgx"，其中：|11841|k1+=+abc
k2+=+xyz
k3+=+qgx|code-block|syntax|javascript|11842|然后在k1哈希表中添加一个条目"abc=abcxyzqgx“。在k2哈希表中添加一个条目"xyz=abcxyzqgx“。在k3哈希表中添加"qgx=abcxyzqgx“。(当然，值不是字符串键本身，而是对字符串键的引用。否则，您将有O(nk)个256个字符的字符串。)|11843|完成后，每个哈希表都有唯一的段值作为键，这些值是那些段所在的键的列表。|11844|当您想要查找所有包含k1=abc和k3=qgx的键时，可以查询k1哈希表中包含abc的键的列表，查询k3哈希表中包含qgx的键的列表。然后对这两个列表进行交集，以获得结果。|11845|构建单个哈希表的一次性成本为O(nk)，其中n是关键字的总数，k是关键字段的数量。内存需求也是O(nk)。诚然，这有点贵，但总共只有160万个密钥。|11846|迭代的情况是O(m*x)，其中m是单个键段引用的平均键数，x是查询中的键段数。|11847|一个明显的优化是将LRU缓存放在这个查找之前，以便从缓存中为频繁的查询提供服务。|11848|另一种可能的优化是创建组合键的附加索引。例如，如果查询经常请求k1和k2，并且可能的组合相当小，那么使用组合的k1k2缓存是有意义的。因此，如果有人搜索k1=abc和k2=xyz，您就有了一个包含"abcxyz=list+of+keys“的关键字缓存。|11849|entityMap|0|LINK|mutability|MUTABLE|url|https://en.wikipedia.org/wiki/Inverted_index|1|https://stackoverflow.com/q/50676031/56778^0|T|E|0|1X|1F|1|0|0|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|1A|8|@]|9|@$A|1B|B|1C|1|1D]|$A|1E|B|1F|1|1G]]|C|$]]|$1|D|3|E|5|6|7|1H|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|1I|8|@]|9|@]|C|$]]|$1|H|3|I|5|J|7|1J|8|@]|9|@]|C|$K|L]]|$1|M|3|N|5|6|7|1K|8|@]|9|@]|C|$]]|$1|O|3|P|5|6|7|1L|8|@]|9|@]|C|$]]|$1|Q|3|R|5|6|7|1M|8|@]|9|@]|C|$]]|$1|S|3|T|5|6|7|1N|8|@]|9|@]|C|$]]|$1|U|3|V|5|6|7|1O|8|@]|9|@]|C|$]]|$1|W|3|X|5|6|7|1P|8|@]|9|@]|C|$]]|$1|Y|3|Z|5|6|7|1Q|8|@]|9|@]|C|$]]|$1|10|3|-4|5|6|7|1R|8|@]|9|@]|C|$]]]|11|$12|$5|13|14|15|C|$16|17]]|18|$5|13|14|15|C|$16|19]]]]

I would implement this as a hash table for lookup, and a separate <a href="https://en.wikipedia.org/wiki/Inverted_index" rel="nofollow noreferrer">inverted index</a> for your iteration. I think trying to turn those separate key segments into integers, as you asked in <a href="https://stackoverflow.com/q/50676031/56778">Ways to convert special-purpoes-strings to Integers</a> to be a bunch of unnecessary work.

There are plenty of good hash table implementations for C already available, so I won't go into that.

To create the inverted index for iteration, create N hash tables, where N is the number of key segments. Then, for each key, break it into its individual segments and add an entry for that value into the appropriate hash table. So if you have the key "abcxyzqgx", where:

<pre><code>k1 = abc
k2 = xyz
k3 = qgx
</code></pre>

Then in the k1 hash table you add an entry "abc=abcxyzqgx". In the k2 hash table you add an entry "xyz=abcxyzqgx". In the k3 hash table you add "qgx=abcxyzqgx". (The values, of course, wouldn't be the string keys themselves, but rather references to the string keys. Otherwise you'd have O(nk) 256-character strings.)

When you're done, your hash tables each have as keys the unique segment values, and the values are lists of keys in which those segments exist.

When you want to find all of the keys that have k1=abc and k3=qgx, you query the k1 hash table for the list of keys that contain abc, query the k3 hash table for the list of keys that contain qgx. Then you do an intersection of those two lists to obtain the result.

Building the individual hash tables is a one-time cost of O(nk), where n is the total number of keys, and k is the number of key segments. Memory requirement, also, is O(nk). Granted, that's a bit expensive, but you're only talking about 1.6 million keys, total.

The case for iteration is O(m*x), where m is the average number of keys referenced by an individual key segment, and x is the number of key segments in the query.

An obvious optimization is to put an LRU cache in front of this lookup, so that frequent queries are served from the cache.

Another possible optimization is to create additional indexes that combine keys. For example, if queries frequently ask for k1 and k2, and the possible combinations are reasonably small, then it makes sense to have a combined k1k2 cache. So if somebody searches for k1=abc and k2=xyz, you have a k1k2 cache that contains "abcxyz=[list of keys]".

I have to maintain a in-memory data-structure of Key-Value pair. I have following constraints:

<ol>
<li>Both key and values are text strings of length 256 and 1024
respectively. Any key generally looks like k1k2k3k4k5, each k(i) being 4-8 byte string in itself.</li>
<li>As far as possible, in-memory data-structure should have contiguous memory. I have 400 MB worth of Key-Value pair and am allowed 120% worth of allocation. (Additional 20% for metadata, only if needed.)</li>
<li>DS will have following operations:</li>
<li>Add [Infrequent Operation]: Typical signature looks like <code>void add_kv(void *ds, char *key, char *value);</code></li>
<li>Delete[Infrequent Operation]: Typical signature looks like <code>void del_kv(void *ds, char *key);</code></li>
<li>LookUp [MOST FREQUENT OPERATION]: Typical signature looks like <code>char *lookup(void *ds, char *key);</code></li>
<li>Iterate [MOST FREQUENT OPERATION]: This operation is prefix based. It allocates an iterator i.e iterates the whole DS and returns list of key-values that match prefix_key (e.g. "k1k2k3.*", k(i) defined as above). Every iteration iterates on this iterator(list). Freeing the iterator frees the list. Typically expect an Iterator to return 100 KB worth of key-value pair in 400 MB DS (100KB:400 MB :: 1:4000). Typical signature looks like <code>void *iterate(void *ds, char *prefix_key);</code></li>
<li>Bullet 6 and Bullet 7 being most frequent operation, needs to be optimized for.</li>
</ol>

My question is what is the best suited data-structure for above constraints?

I have considered hash. Add/delete/lookup could be done in o(1) as I have sufficient memory but it is not optimum for iteration. Hash-of-hash (hash on k1 then on k2 then on k3...) or array of hash could be done but it then violates Bullet 2. What other options do I have?

Best suited data-structure for prefix based searches

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我必须在内存中维护键值对的数据结构。我有以下限制：键和值都是长度分别为256和1024的文本字符串。任何键通常看起来像k1k2k3k4k5，每个k(i)本身都是4-8字节的字符串。尽可能地，内存中的数据结构应该有连续的内存。我有400MB的键值对，并且允许120%的分配。(仅在需要时，元数据额外增加20%。) oper...

问最适合基于前缀的搜索的数据结构
EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问最适合基于前缀的搜索的数据结构EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问最适合基于前缀的搜索的数据结构
EN