词义类比与全局词共现信息不可兼得？基于飞桨实现的GloVe说可以

用户1386409

发布于 2020-08-11 16:36:19

5230

发布于 2020-08-11 16:36:19

文章被收录于专栏：PaddlePaddlePaddlePaddle

官网链接：

https://nlp.stanford.edu/projects/glove/

论文链接：

https://nlp.stanford.edu/pubs/glove.pdf

通过飞桨核心框架复现的GloVe模型具有良好的性能表现。针对论文给出的训练文本，处理后文本总词量为17M，模型的训练时间在1000s左右，达到了快速训练词向量的要求，可以用来训练大规模的文本数据。模型的详细情况和编程可以参考链接：

https://github.com/fiyen/PaddlePaddle-GloVe

GloVe词向量训练的论文诞生于Word2Vec工具出现后不久。文本特征空间的表示有两种经典的模式：

潜在语义分析：通过对词对共现矩阵进行矩阵分解得到文本潜在信息；
Word2Vec：通过最大化词序列在一定长度的窗口中的共现概率，训练得到每个词的词向量。

两种方式各有优缺点。潜在语义分析处理全局词共现信息，很好地考虑了文本的统计信息，但是其在词义类比等任务中的表现不佳，没有得到最优的向量空间关系。而Word2Vec在词义类比等任务上表现优异，但是由于它是在局部的窗口下训练的，没有很好地利用全局词共现信息。

这篇文章综合了两种方法的优点，提出了一种基于全局词共现信息的加权最小二乘模型。同时，由于统计词对信息相当于大大压缩了文本信息（相当于把文本中重复出现的信息合到一块了），该模型增加了耗时不长的预处理阶段来得到共现词对及共现频率，在训练时候大大降低了训练时长（相对于Word2Vec）。

https://aistudio.baidu.com/aistudio/projectdetail/487604 ）。

基于飞桨复现GloVe的详细代码可参考：

https://aistudio.baidu.com/aistudio/projectdetail/628391

词共现矩阵介绍

1. 如何处理词共现矩阵

对文档中的所有词汇按照频率标号，频率越大序号越小，从1开始，以下称为rank。从文档中逐行扫描词对，进行统计。定义CoOccur储存存在内存中的词对。CoOccur有三个功能：pair、check以及 get_pairs，下面将逐一介绍。

class CoOccur:
    """
    store word pairs in memory
    form: {w1: {w2: frequency(w1, w2)}, ...} in which the rank of w1 < rank     of w2
    """
    def __init__(self):
        self.cooccur = {}

首先看pair功能，pair是用来接收词对和其距离。当接收的词对满足了需要被储存在内存中（这一部分的词对在文中的出现频率高，判断依据是max_product，两个词对的rank之积小于max_product，则满足要求。rank越小的词其出现频率越高）的要求时，用pair接收两个词及其距离，将其信息储存在CoOccur类中的cooccur字典中，字典的形式为：{w1: {w2: frequency(w1, w2)}, ...}。注意，两个词对要保证rank小的词在前，rank大的词在后 (当然反过来也可以），这样避免将(w1,w2)和（w2,w1)识别成两个不同的词对。

def pair(self, w1, w2, dis):
    if w1 in self.cooccur.keys():
        if w2 in self.cooccur[w1].keys():
            self.cooccur[w1][w2] += 1.0 / dis
        else:
            self.cooccur[w1][w2] = 1.0 / dis
    else:
        self.cooccur[w1] = {}
        self.cooccur[w1][w2] = 1.0 / dis

check用来查询词对信息。输入词对（w1, w2）,返回词对之间的共现信息。注意词对仍然有rank的先后顺序。

def check(self, w1, w2):
    if w1 in self.cooccur.keys():
        if w2 in self.cooccur[w1].keys():
            return self.cooccur[w1][w2]
        else:
            return 0
    else:
        return 0

get_pairs得到内存中所有词对信息，以生成器的形式返回。需要注意的是，这个函数用于在训练时返回所有的词对，并不需要返回词对的共现信息。

def get_pairs(self):
    """
    get all word pairs like [(w1, w2), (w3, w4)...]
    :return:
    """
    # return iterator
    return ((w1, w2) for w1 in self.cooccur.keys() for w2 in self.cooccur[w1].keys()

读到这里有人可能会问：作者这方法岂不是要内存爆炸？因为一个nxn的矩阵，当n足够大时（比如n=500），消耗的内存是很大的（大概600M），随着n的增加，内存消耗更是呈现2次方增长。

细心看了作者的源码，发现他对内存消耗的处理简单而粗暴：设置一个max_product （即上文刚提到过的max_product），当需要处理词对的rank之积小于max_product时，将这个词对以及他们之间的共现频率值记录在内存中，同时开辟一个缓冲区（buffer），超过max_product的词对及其共现频率值记录在buffer中，如果buffer存满了，就对这个区域里的词对按照共现频率值排序后，存到一个文件中。

需要注意的是，在每次存储buffer文件以及最后，需要将buffer区的词对与已经存到文件中的匹配，已经存储的词需要进行合并。

以下定义了Buffer类，其功能如上述，即暂存超出范围的词对，并对超出Buffer区的词对进行临时存储和调用。Buffer的作用相当于一个临时仓库，把不满足存储在内存中的词对暂存起来，存到一定的数量后，就把这些词对信息转存到硬盘中，然后重新开辟出一个Buffer区作为仓库继续存。

如下所示，size即Buffer开辟出的临时空间储存词对的数量，超过这个数量就转存到硬盘中，然后清空这个临时空间。cache_path则定义了临时文件的储存位置。

class Buffer:
    """
    overflow buffer of the co-occurrence matrix. Save buffer in the cache file if buffer
    is full, label the buffer number and load buffer, lookup word pairs if needed. The
    word pairs will be sorted by their production of frequency rank and stored in cache file.
    cache_size:
    """
    def __init__(self, size=1e6, cache_path='cache'):
        self.size = size
        self.cooccur = {}
        self.cache_path = cache_path
        self.count = 0
        self.num_saved = 0

由于Buffer和CoOccur工具一样都是储存词对信息的，所以Buffer也有pair、check和get_pairs的功能。但是Buffer除了pair、check、get_pairs外，增加了update_file，用来对Buffer区的数据与临时文件进行合并，防止重复统计存到临时文件的词对。

def update_file(self):
    """
    update the saved cache files to avoid the duplicate word pairs.
    :return:
    """
    if self.num_saved > 0:
        for i in range(1, self.num_saved + 1):
            new_f = open(self.cache_path+'/buffer2bin_' + str(i) + 'tem.bin', 'w')
            with open(self.cache_path+'/buffer2bin_' + str(i) + '.bin', 'r') as f:
                for line in f:
                    w_1, w_2, freq = line.split()
                    freq = float(freq)
                    if w_1 in self.cooccur[w_1]:
                        if w_2 in self.cooccur[w_2]:
                            new_freq = freq + self.cooccur[w_1].pop(w_2)
                            new_f.write(w_1 + ' ' + w_2 + ' ' + str(new_freq) + '\n')
                            if not self.cooccur[w_1]:
                                self.cooccur.pop(w_1)
                        else:
                            new_f.write(w_1 + ' ' + w_2 + ' ' + str(freq) + '\n')
                    else:
                        new_f.write(w_1 + ' ' + w_2 + ' ' + str(freq) + '\n')
                f.close()
            new_f.close()
            os.remove(self.cache_path+'/buffer2bin_' + str(i) + '.bin')
            os.rename(self.cache_path+'/buffer2bin_' + str(i) + 'tem.bin', self.cache_path+'/buffer2bin_' + str(i) + '.bin')

这里需要注意，在pair中，需要调用update_file来更新临时文件。为什么要用update_file更新临时文件呢？这是因为转存到硬盘的词对并不是在以后的检索中就不会再出现，如果这个词对再次出现，Buffer区会从0开始计算这个词对的信息，如果不与临时文件进行合并，就会出现多个相同的词对，而且共现信息也会有差别。

def pair(self, w1, w2, dis):
    if w1 in self.cooccur.keys():
        if w2 in self.cooccur[w1].keys():
            self.cooccur[w1][w2] += 1.0 / dis
        else:
            self.cooccur[w1][w2] = 1.0 / dis
            self.count += 1
    else:
        self.cooccur[w1] = {}
        self.cooccur[w1][w2] = 1.0 / dis
        self.count += 1

    if self.count >= self.size:
        self.update_file()
        pairs = [(w_1, w_2, self.cooccur[w_1][w_2]) for w_1 in self.cooccur.keys() for w_2 in self.cooccur[w_1].keys()]
        pairs = sorted(pairs, key=lambda x: x[2], reverse=True)
        self.num_saved += 1
        f = open(self.cache_path+'/buffer2bin_' + str(self.num_saved) + '.bin', 'w')
        for pair in pairs:
            f.write(str(pair[0]) + ' ' + str(pair[1]) + ' ' + str(pair[2]) + '\n')
        f.close()
        self.count = 0

而当你需要get_pairs时，便也需要把硬盘中临时储存的信息也一并得到，所以get_pairs函数做了如下拓展：

def get_pairs(self):
    """
    get all word pairs like [(w1, w2), (w3, w4)...]
    :return:
    """
    if self.num_saved > 0:
        for i in range(1, self.num_saved + 1):
            f = open(self.cache_path+'/buffer2bin_' + str(i) + '.bin', 'r')
            for line in f:
                w_1, w_2, freq = line.split()
                yield w_1, w_2
            f.close()
    for w_1 in (i for i in self.cooccur.keys()):
        for w_2 in (j for j in self.cooccur[w_1].keys()):
            yield w_1, w_2

我们可以看到一个用来控制是否读入临时文件的变量num_saved，这个量记录了临时文件已经存了多少个，它处理用来控制读入以外，还作为文件名的后缀用来区分存入文件的名称。

与上述同样的道理，在检索词对返回词对共现信息时，也要检索已经暂存到临时文件中的词对，即对临时文件进行遍历。

def check(self, w1, w2):
    flag = 0
    while True:
        if w1 in self.cooccur.keys() and flag == 0:
            if w2 in self.cooccur[w1].keys():
                return self.cooccur[w1][w2]
            else:
                flag = 1
                continue
        elif self.num_saved > 0:
            for i in range(1, self.num_saved+1):
                f = open(self.cache_path+'/buffer2bin_' + str(i) + '.bin', 'r')
                for line in f:
                    w_1, w_2, freq = line.split()
                    freq = float(freq)
                    if w_1 == w1 and w_2 == w2:
                        f.close()
                        return freq
                f.close()
            return 0
        else:
            return 0

2. 如何统计词对信息

在统计词对的时候，由于需要防止窗口移动过程中对小于窗口距离的同一个词对进行重复统计，给出一种简单策略，我们姑且称之为向后遍历。以下用一个简单例子说明。假设一个文本是“1 2 3 4 5 6 7 8”，我们先从左到右依次固定中心词（分别为1,2,3,4,5,6,7,8）。接着，从中心词开始，向前走时间窗长度。假设中心词为6，窗口长度为5，则共现词对和距离便是：((6,5),1),((6,4),2),(6,3),3),((6,2),4),((6,1),5)。那么为什么这么做可以避免重复统计同一个词对呢？因为中心词是向前遍历的，而内容词是向后遍历的，他们因为方向不同，永远不可能重合，也就不存在重复统计的问题了。

for num, words in enumerate(text):
    for index, w in enumerate(words):
        if self.words_counter[w] > self.min_count:
            pre = max(0, index - self.window)
            length = len(words[pre:index])
            if length > 0:
                for i, w_ in enumerate(words[pre:index]):
                    if self.words_counter[w_] > self.min_count:
                        ind_1 = self.vocab_index[w]
                        ind_2 = self.vocab_index[w_]
                        dis = length – i

训练

1. 定义GloVe工具

用fluid.Embedding定义词向量训练参数和偏执项参数。偏执项也是一个可以训练的量，因为在训练过程中需要进行查询，所以也应Embedding层设置，维度设为1即可。

self.embedding = fluid.Embedding(size=[len(self.vocab_index), self.dimension],
                                         param_attr=fluid.ParamAttr(name='embedding',
                                                                    initializer=fluid.initializer.UniformInitializer(
                                                                        low=-self.init_scale, high=self.init_scale
                                                                    )))
self.bias = fluid.Embedding(size=[len(self.vocab_index), 1],
                            param_attr=fluid.ParamAttr(name='bias',
                                                        initializer=fluid.initializer.ConstantInitializer(0.0)))

以下过程以动态图为例，静态图和动态图模式详情可参考：

https://aistudio.baidu.com/aistudio/projectdetail/628391

参数介绍如下：

dimension: 向量维度。
min_count: 频率在min_count以下的词不再考虑。
window：考虑词共现的窗口长度。
learning_rate: 训练时的学习率。
x_max, alpha: 控制加权函数的参数。
max_prodct: 这个值不要轻易更改，会影响内存消耗和计算速度，作用参考次共现矩阵部分。
overflow_buffer_size: 这个值不要轻易改变，它决定了缓冲区的大小，也会影响内存消耗和计算速度。
use_gpu: 是否使用GPU进行训练。在动态图模式下没有作用，因为需要在实例之前调用fluid.dygraph.guard()。
init_scale: 词向量初始化范围。
verbose：显示信息（1），不显示信息（0）。

class GloVe(fluid.dygraph.Layer):
    """
    dimension: the dimensionality of word embedding.
    min_count: the words with frequency lower than min_count will be neglected.
    window: the window of word co-occurrence, words that co-occur in distance of more than window words will not be counted.
    learning_rate:
    x_max: 1/2 of weight function "W", to weigh the loss of two co-occurred words.
    alpha: 2/2 of weight function "W".
    max_product: Do not easily change this parameter. To limit the product of word rank because the words co-occurrence
    matrix become sparse in the right-bottom, i.e., the word pairs that with very large production of their frequency rank.
    overflow_buffer_size: Don not easily change this parameter. To provide a buffer for the word pairs with production
    exceeding max_product, if the buffer size exceeds overflow_buffer_size, save it as cache file.

    NOTE: the input must be string elements, except type (like int) may cause unpredictable problems.
    """
    def __init__(self, dimension=100, min_count=5, window=15, learning_rate=0.05,
                 x_max=100, alpha=3/4, max_product=1e8, overflow_buffer_size=1e6,
                 use_gpu=True, init_scale=0.1, verbose=1):
        super(GloVe, self).__init__()
        self.dimension = dimension
        self.min_count = min_count
        self.window = window
        self.learning_rate = learning_rate
        self.x_max = x_max
        self.alpha = alpha
        self.init_scale = init_scale
        self.max_product = max_product
        self.overflow_buffer_size = overflow_buffer_size
        self.verbose = verbose
        self.place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
        self.built_opt = False  # 标记是否创建了优化器
        self.emb_numpy = None

GloVe工具的前向传播定义如下：

def forward(self, w_freq, freq, w1, w2):
    """
    core of dygraph logits
    """
    bias_1 = self.bias(w1)
    bias_1 = fluid.layers.reshape(bias_1, shape=(-1, 1))
    bias_2 = self.bias(w2)
    bias_2 = fluid.layers.reshape(bias_2, shape=(-1, 1))
    emb_1 = self.embedding(w1)
    emb_2 = self.embedding(w2)
    mul = fluid.layers.elementwise_mul(emb_1, emb_2)
    mul = fluid.layers.reduce_sum(mul, dim=1)
    diff = mul + bias_1 + bias_2 - fluid.layers.log(freq)
    loss = diff * diff * w_freq
    loss = fluid.layers.reduce_mean(loss)
    return loss

这里多说一句，由于该模型的损失函数不是交叉熵损失，这就避免了求解softmax的耗时过程，而是直接将词对彼此的词向量作用结果当做损失。不得不说这是一个很高明的手法，它把耗时的部分都放在了预处理过程，训练阶段省掉了softmax就可以避免采用霍夫曼树或者负采样方法，大大提高了运算速度。

前向传播部分输入有四个量，w1,w2来自是将词对拆分成两列的词输入，然后是其共现频率freq与其对应的距离权值w_freq。

GloVe工具通过调用_glove函数来完成前向传播和损失后向传播，并返回当前的loss，如下：

def _glove(self, _pairs):
    """
    to update the embedding and bias.
    :param _pairs:
    :return:
    """
    freq = []
    w1 = []
    w2 = []
    w_freq = []
    for index, (w1_, w2_) in enumerate(_pairs):
        ind_1 = self.vocab_index[w1_]
        ind_2 = self.vocab_index[w2_]
        val = self.check(w1_, w2_)
        w_freq.append(self.W(val))
        freq.append(val)
        w1.append(ind_1)
        w2.append(ind_2)
    freq = np.asarray(freq, dtype='float32')
    w_freq = np.asarray(w_freq, dtype='float32')
    w1 = np.asarray(w1, dtype='int64')
    w2 = np.asarray(w2, dtype='int64')
    freq = fluid.dygraph.to_variable(freq)
    w_freq = fluid.dygraph.to_variable(w_freq)
    w1 = fluid.dygraph.to_variable(w1)
    w2 = fluid.dygraph.to_variable(w2)
    loss = self.forward(w_freq, freq, w1, w2)
    if not self.built_opt:
        self.opt = fluid.optimizer.Adam(learning_rate=self.learning_rate, parameter_list=self.parameters())
        self.built_opt = True
    loss.backward()
    self.opt.minimize(loss)
    return loss.numpy()[0]

GloVe工具通过fit_train函数接收文本数据，并进行训练，得到训练后的向量。在得到GloVe实例后，直接调用fit_train函数即可进行训练。参数含义如下所示：

text：需要训练的文本，形式为[text1, text2, ...]或者[[word11, word12, ...],[word21, word22, ...], ...]。即text可以是一个list，其中存有若干未经处理的文本序列，或者是经过处理的文本序列的所有前后出现的词。
epochs：训练周期数
batch_size：训练过程中一次性同时训练的词对的数量
verbose_int：调节训练信息显示频率的选项，越小显示越频繁，为int型变量。

def fit_train(self, text, epochs=1, batch_size=4000, verbose_int=10):
    """
    Fit the text data, get the vocabulary of the whole text. Then randomly initialize the embeddings of every word.
    and train text. the fit and train operation are called simultaneous because the fit will get the co-occurrence
    matrix and the matrix only fit the text. when the text changes, the co-occurrence matrix will also changes and
    it will need fit again.
    The embeddings will be stored in a dict formed: {word: embedding, ...}
    :param text: the collects of text, or split words from text. form: [text1, text2, ...] or [[word11, word12, ...],
    [word21, word22, ...], ...]
    :param epochs: training epochs
    :param threads: multiprocessing threads, ==0 will use the max threads of the machine.
    :param verbose_int: the interval of printing information while training.
    :return:
    """
    self._fit(text)
    start = time.time()
    total_step = 0
    total_pairs = 0
    for pairs in self.get_pairs(n=batch_size):
        total_step += 1
        total_pairs += len(pairs)
    for epoch in range(epochs):
        epoch_start = time.time()
        step = 0
        len_pairs = 0
        ave_loss = 0.0
        for pairs in self.get_pairs(n=batch_size):
            len_ = len(pairs)
            step += 1
            np.random.shuffle(pairs)
            loss = self._glove(pairs)
            ave_loss = (loss * len_ + ave_loss * len_pairs) / (len_pairs + len_)
            len_pairs += len_
            if self.verbose:
                if step % verbose_int == 0:
                    print("{}/{} epochs - {}/{} pairs - ETA {:.0f}s - loss: {:.4f} ...".format(str(epoch+1).rjust(len(str(epochs))),
                            epochs, str(len_pairs).rjust(len(str(total_pairs))), total_pairs, (time.time() - epoch_start) / step * (total_step - step),
                            loss))
        if self.verbose:
            print("{}/{} epochs - cost time {:.0f}s - ETA {:.0f}s - loss: {:.4f} ...".format(str(epoch+1).rjust(len(str(epochs))),
                         epochs, time.time() - start, (time.time() - start) / (epoch+1) * (epochs - epoch - 1), ave_loss))
    if self.verbose:
        print("training complete, cost time {:.0f}s.".format(time.time() - start))

模型复现效果

词向量的训练效果可以很直观地体现出来，我们用几个比较常见的应用来测试：求词的相似度，求词的相似词，以及词义类比。

词的相似度：两个词语的词向量之间的余弦相似度，取值范围为0-1，越大表示越相似。
词的相似词：通过余弦相似度对比词库中所有词与目标词的相似程度，选取余弦相似度最大的若干个词返回。
词义类比：形式：A to B as C to D。意思即A与B的相似关系就像C与D的相似关系一样。比如 'man' to 'king' as 'woman' to 'queen'。

用GloVeEval工具进行测试，该工具中集成了上述三种功能，分别调用函数get_similarity，get_similar_word，以及word_analogy即可，具体代码实现，如下所示。

class GloVeEval:
    def __init__(self, model):
        self.model = model
        self.emb_numpy = None

    def word_analogy(self, word1, word2, word3, words_list=None, verbose=1):
        """
        word1 is to word2 as word3 is to ? (? in the words_list)
        emb_target = emb_word1 + emb_word2 - emb_word3
        :param verbose: whether or not to show the target(?).
        :param words_list: provide words_list to choose target(?). if is None, the words_list is the vocabulary
        :param word1:
        :param word2:
        :param word3:
        :return:
        """
        target = self.get_embedding(word1) + self.get_embedding(word2) - self.get_embedding(word3)
        target = self.get_similar_word(target, 1, words_list=words_list, verbose=0)
        if verbose:
            print("{} is to {} as {} is to {}".format(word1, word2, word3, target))
        return target

    def get_similarity(self, word1, word2):
        """
        get two the cos similarity of two words
        :param word1:
        :param word2:
        :return:
        """
        emb_1 = self.get_embedding(word1)
        emb_2 = self.get_embedding(word2)
        return np.dot(emb_1, emb_2) / np.sqrt(np.dot(emb_1, emb_1) * np.dot(emb_2, emb_2) + 1e-9)

    def get_similar_word(self, word, k, words_list=None, verbose=1):
        """
        get the top_k most similar words of word in the words_list.
        :param words_list: provide words_list to choose the k most similar words. if is None, the words_list is the vocabulary
        :param verbose: whether (1) or not (0) to print the k words
        :param word: string or embedding
        :param k:
        :return:
        """
        if words_list is None:
            vocab_emb = self.get_vocab_emb()
        else:
            if k > len(words_list):
                raise ValueError(
                    'Not enough words to choose {} most similar words. {} > the length of words_list'.format(k, k))
            vocab_emb = {w: self.get_embedding(w) for w in words_list}
        if isinstance(word, str):
            emb = self.get_embedding(word)
        else:
            emb = word
        word2emb_list = [w for w in vocab_emb.items()]
        vocab_emb = np.array([x[1] for x in word2emb_list])
        cos = np.dot(vocab_emb, emb) / np.sqrt(np.sum(vocab_emb * vocab_emb, axis=1) * np.sum(emb * emb) + 1e-9)
        flat = cos.flatten()
        indices = np.argpartition(flat, -k)[-k:]
        indices = indices[np.argsort(-flat[indices])]
        k_words = [word2emb_list[i][0] for i in indices]
        if verbose:
            print('The {} most similar words to {} are(is) {}.'.format(k, word, str(k_words)))
        return k_words

    def get_embedding(self, word):
        """
        get the embedding of word
        :param word:
        :return:
        """
        if self.emb_numpy is None:
            self.emb_numpy = self.model.embedding.parameters()[0].numpy()
        emb = self.emb_numpy
        if word in self.model.vocab_index.keys():
            index = self.model.vocab_index[word]
        else:
            raise KeyError("Can't find word '%s' in dictionary." % word)
        return emb[index]

    def get_vocab_emb(self):
        """
        get the embedding of the vocabulary.
        :return: a dict with words as the keys and embeddings as the values.
        """
        return {w: self.get_embedding(w) for w in self.model.vocab_index.keys()}

模型复现效果如下，从结果可以知道训练的词向量空间可以比较好的反映词与词之间的关系。