[奇怪但有用的数据结构]Trie前缀树

原创

杜逸先

发布于 2018-06-08 14:52:07

1.9K0

发布于 2018-06-08 14:52:07

想象一个这样的情景，有一个很大的字典包含了很多的单词，需要判断一个新单词是否在字典中，最直接也最快的办法就是使用哈希表了。现在添加一个条件，判断字典里是否存在单词以新单词为前缀，这时候哈希表就不合适了，因为存在单词在字典中但其前缀不在字典中的情况，例如[‘apple’, 'application','append']这个字典并不包含他们的公共前缀'app'，'ap'和‘a’。所以我们需要一种新的数据结构和算法来处理这类问题。

很显然如果我们知道哪些单词是以字母'a'开头的，就可以很方便的判断是否有单词是以'ap'为前缀的，从而不必理会以其他字母开头的单词。再进一步，我们可以从'ap'开头的单词中找是否有单词是以‘app’为前缀的，从'ab'开头的单词中找是否有单词一'aba'为前缀。于是这样的树形结构就构造出来了。

我们可以很容易地看出来这棵树中包含四个单词abc, apple, bad和bat。也可以轻松判断出存在单词以'app'为前缀，而没有'ad'开头的单词。

Trie前缀树

这样的树形结构就是前缀树(Trie)，也叫单词查找树，典型应用是用于统计和排序大量的字符串（但不仅限于字符串），所以经常被搜索引擎系统用于文本词频统计。它的优点是：最大限度地减少无谓的字符串比较，查询效率比哈希表高。它的基本性质如下：

根节点不包含字符，除根节点外每一个节点都只包含一个字符。
从根节点到某一节点，路径上经过的字符连接起来，为该节点对应的字符串。
每个节点的所有子节点包含的字符都不相同。

构造前缀树

首先我们要定义一下节点的数据结构。每一个节点可以跟着若干个子节点，因为都是字符，所以可以用哈希表来保存子节点。并且要有一个标记来表示这个字符是不是一个单词的最后一个字符，不然以上图为例，无法知晓单词'app'是否在这个字典中。这是我设计的Node类（__repr__方法用于清晰的展示节点的结构）。

class Node:
    def __init__(self):
        self.is_last = False
        self.children = {}
    def __repr__(self):
        return 'Node({},[{}])'.format(self.is_last, 
            ','.join('{}:{}'.format(x, repr(self.children[x])) for x in self.children))

接下来就是构造前缀树了，很自然的会有一个Node类型的根节点root:

class Trie:
    def __init__(self):
        self.root = Node()

然后是在合适的位置插入新单词（去掉已有前缀的部分）：

class Trie:
    def __init__(self):
        self.root = Node()
        
    def generate_node(self, word:str) -> Node:
        node = Node()
        if len(word) > 0:
            node.children[word[0]] = self.generate_node(word[1:])
        else:
            node.is_last = True
        return node

对于后续的操作（insert, search, startsWith）来说，寻找前缀是第一步,我这里使用的是一个get_node_rest方法，获得一个单词在前缀树中最长前缀的尾节点和剩余的字符。

class Trie:
    def __init__(self):
        self.root = Node()
        
    def generate_node(self, word:str) -> Node:
        #...
        
    
    def get_node_rest(self, root:Node, word:str)-> tuple:
        if not word:
            return root, ''
        elif word[0] in root.children:
            return self.get_node_rest(root.children[word[0]], word[1:])
        else:
            return root, word

获取了一个单词的node和rest,插入操作就是在node节点插入rest部分的字符，判断是否是前缀(startsWith)就是判断rest是否为空，如果node的is_last也同时为真的话就是存在这个单词了(search)。完整的代码如下：

#前缀树
class Node:
    def __init__(self):
        self.is_last = False
        self.children = {}
    def __repr__(self):
        return 'Node({},[{}])'.format(self.is_last,
            ','.join('{}:{}'.format(x, repr(self.children[x])) for x in self.children))

        

class Trie:
    def __init__(self):
        self.root = Node()
        
    def generate_node(self, word:str) -> Node:
        node = Node()
        if len(word) > 0:
            node.children[word[0]] = self.generate_node(word[1:])
        else:
            node.is_last = True
        return node
        
    def get_node_rest(self, root:Node, word:str)-> tuple:
        if not word:
            return root, ''
        elif word[0] in root.children:
            return self.get_node_rest(root.children[word[0]], word[1:])
        else:
            return root, word
    
    def insert(self, word):
        node, rest = self.get_node_rest(self.root, word)
        if rest:
            node.children[rest[0]] = self.generate_node(rest[1:])
        else:
            node.is_last = True
    
    def search(self, word):
        node, rest = self.get_node_rest(self.root, word)
        return rest == '' and node.is_last
    
    def startsWith(self, word):
        _, rest = self.get_node_rest(self.root, word)
        return len(rest) == 0

我们可以看一下Trie类的使用。