问词向量表示中的UNK标记是什么？
EN

Stack Overflow用户

提问于 2017-08-17 12:38:28

回答 1查看 13.5K关注 0票数 12

# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000


def build_dataset(words, n_words):
  """Process raw inputs into a dataset."""
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(n_words - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0][1] = unk_count
  reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reversed_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                            vocabulary_size)

我正在学习用Tensorflow表示单词的基本例子。

这个步骤2的标题是“构建字典并用UNK令牌替换罕见的单词”，但是，对于"UNK“所指的内容并没有事先定义的过程。

具体说明以下问题：

0) UNK在NLP中一般指什么？

( 1)计数= ['UNK'，-1]是什么意思？我知道括号[]指python中的列表，但是，为什么我们要将其配置为-1呢？

tensorflow

回答 1

Stack Overflow用户

发布于 2019-08-03 12:18:13

正如注释中已经提到的，在标记化和NLP中，当您看到UNK令牌时，它可能是用来表示未知单词。

例如，如果你想预测一个句子中缺少的单词。你将如何将你的数据提供给它？你肯定需要一个记号来显示缺字在哪里。因此，如果"house“是我们缺少的词，那么在标记之后，它将是这样的：

'my house is big' -> ['my', 'UNK', 'is', 'big']

PS：count = [['UNK', -1]]用于初始化count，就像Ivan已经说过的[['word', number_of_occurences]]一样。

票数 7

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45735357

复制

相似问题

问词向量表示中的UNK标记是什么？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问词向量表示中的UNK标记是什么？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问词向量表示中的UNK标记是什么？
EN