文章/答案/技术大牛

发布

社区首页 >问答首页 >将字典中的文本向量与字典中的键相关联

问将字典中的文本向量与字典中的键相关联
EN

Stack Overflow用户

提问于 2019-09-22 19:49:41

回答 1查看 1.1K关注 0票数 0

我从一个sqlite3数据库中获得了文本。我想通过首先得到文本的向量和CountVectorizer来比较文本的相似性。我还有一个字典，其中存储与messageID相关的文本(作为字典键)。如何将每个文本向量与其messageID联系起来？例如，使用如下所示的向量数组

    [[1 1 0 1 1 0 1]
     [0 1 1 1 1 0 1]
     [0 1 0 1 1 1 1]]

我想知道messageID = 0有向量[1 1 0 1 1 0 1]。向量大小和数组随每一条新消息的大小而增大。

我试着把字典放到CountVectorizer中，只试着对一条消息进行评估，但都没有效果。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cosineSimilarity


def getVectorsAndFeatures(strs):
    text = [t for t in strs]
    vectorizer = CountVectorizer(text)
    vectorizer.fit(text)
    vectors = vectorizer.transform(text).toarray()
    features = vectorizer.get_feature_names()
    return vectors, features


text = ['This is the first sentence', 'This is the second sentence',
        'This is the third sentence']
messageDict = {0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}

vectors, features = getVectorsAndFeatures(text)

python

scikit-learn

countvectorizer

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-09-24 23:54:25

按照您的示例，在消息ID和句子之间有一张映射

>>> text = ['This is the first sentence', 'This is the second sentence',
 'This is the third sentence']
>>> message_map = dict(zip(range(len(text)), text))
>>> message_map
{0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}

然后，您需要使用CountVectorizer计算文本特性在每个句子中出现的次数。您可以运行与您具有的相同的分析：

>>> vectorizer = CountVectorizer() 
>>> # Learn the vocabulary dictionary and return term-document matrixtransform 
>>> vectors = vectorizer.fit_transform(message_map.values()).toarray()
>>> vectors
array([[1, 1, 0, 1, 1, 0, 1],
       [0, 1, 1, 1, 1, 0, 1],
       [0, 1, 0, 1, 1, 1, 1]], dtype=int64)
>>> # get a mapping of the feature associated with each count entry
>>> features = vectorizer.get_feature_names()
>>> features
['first', 'is', 'second', 'sentence', 'the', 'third', 'this']

在fit_transform() 文档中，您可以看到：

fit_transform(self，raw_documents，y=None) 参数: raw_documents :可迭代生成str、unicode或file对象的可迭代性。返回:x:数组，n_samples，n_features 文件术语矩阵。

这意味着每个向量以相同的顺序对应于输入文本(即message_map.values())中的一个句子。如果您想将ID映射到每个向量，您可以这样做(注意，顺序保留了)：

>>> vector_map = dict(zip(message_map.keys(), vectors.tolist()))
>>> vector_map
{0: [1, 1, 0, 1, 1, 0, 1], 1: [0, 1, 1, 1, 1, 0, 1], 2: [0, 1, 0, 1, 1, 1, 1]}

我相信你要问的是，建立一个语料库，然后把新句子转换成一个特征计数向量。但是，请注意，任何不在原始语料库中的新词都将被忽略，如本例所示：

from sklearn.feature_extraction.text import CountVectorizer

corpus= ['This is the first sentence', 'This is the second sentence']
vectorizer = CountVectorizer() 
vectorizer.fit(corpus)

message_map = {0:'This is the first sentence', 1:'This is the second sentence', 2:'This is the third sentence'}

vector_map = { k: vectorizer.transform([v]).toarray().tolist()[0] for k, v in message_map.items()}

你得到的是：

>>> vector_map
{0: [1, 1, 0, 1, 1, 1], 1: [0, 1, 1, 1, 1, 1], 2: [0, 1, 0, 1, 1, 1]}

请注意，现在您比以前少了一个特性，因为单词third不再是这些特性的一部分。

>>> vectorizer.get_feature_names()
['first', 'is', 'second', 'sentence', 'the', 'this']

在计算向量之间的相似性时，这可能是一个小问题，因为您正在丢弃对区分向量有用的单词。

另一方面，您可以使用英语词典或其中的一个子集作为语料库，并将其纳入vectorizer。然而，结果向量将变得更加稀疏，再次，它可能会造成问题的向量比较。但这将取决于您使用的方法来计算向量之间的距离。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/58052873

复制

相似问题

问将字典中的文本向量与字典中的键相关联
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将字典中的文本向量与字典中的键相关联EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将字典中的文本向量与字典中的键相关联
EN