首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >熊猫:所有类别的平均相似度计算

熊猫:所有类别的平均相似度计算
EN

Stack Overflow用户
提问于 2021-01-21 10:52:14
回答 1查看 294关注 0票数 0

我有一个数据文件,如下所示,但更大:

代码语言:javascript
运行
复制
import pandas as pd

data = {'First':  ['First value','Third value','Second value','First value','Third value','Second value'],
        'Second': ['the old man is here','the young girl is there', 'the old woman is here','the  young boy is there','the young girl is here','the old girl is here']}

df = pd.DataFrame (data, columns = ['First','Second'])

我根据下面的第一列计算了每个可能的对之间的平均相似性(从堆栈溢出中的其他答案中得到了对这个部分的帮助):

代码语言:javascript
运行
复制
from itertools import combinations
#function to calculate similarity between each pairs of documents 
def similarity_measure(doc1, doc2): 

    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)

    intersection = words_doc1.intersection(words_doc2)
    union = words_doc1.union(words_doc2)
    
    return float (len(intersection)) / len(union) * 100

    #getting the lemmatized text along side the intents
    data_similarity= df.groupby('First')['Second'].apply(lambda x:  nltk.tokenize.word_tokenize(' '.join(x)))
     data_similarity = data_similarity.reset_index()

   #returning the similarity measures for each pair in the dataset
    for val in list(combinations(range(len(data_similarity)), 2)):
         print(f"similarity between {data_similarity.iloc[val[0],0]} and {data_similarity.iloc[val[1],0]} intents is: {similarity_measure(data_similarity.iloc[val[0],1],data_similarity.iloc[val[1],1])}")

作为输出,我想要的是所有对之间的平均值,因此,例如,如果上面的代码有以下输出:

代码语言:javascript
运行
复制
similarity between first value and second value is 60
similarity between first value and third value is 50 
similarity between second value and third value is 55
similarity between second value and first value is 60
similarity between third value and first value is 50
similarity between third value and second value is 55

我希望第一值与任何组合,第二值与任何组合,第三值与任何这样的组合的平均值:

代码语言:javascript
运行
复制
first value average across all possible values is 55
second value average across all possible values is 57.5
third value average across all possible values is  52.5
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-01-21 11:55:07

编辑:根据您的评论,这里是您可以做的。

group.

  • Calculate
  1. 首先计算data_similarity表,该表将来自不同句子的标记组合为句子之间的
  2. 成对相似元组,
  3. 将它们放到数据中,然后按整个组进行分组并取平均值。

代码语言:javascript
运行
复制
import nltk
from itertools import combinations, product

#function to calculate similarity between each pairs of documents 
def similarity_measure(doc1, doc2): 

    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)

    intersection = words_doc1.intersection(words_doc2)
    union = words_doc1.union(words_doc2)
    
    return float (len(intersection)) / len(union) * 100

#getting the lemmatized text along side the intents
data_similarity= df.groupby('First')['Second'].apply(lambda x:  nltk.tokenize.word_tokenize(' '.join(x)))
data_similarity = data_similarity.reset_index()

all_pairs = [(i,l,similarity_measure(j,m)) for (i,j),(l,m) in 
             product(zip(data_similarity['First'], data_similarity['Second']), repeat=2) if i!=l]

pair_similarity = pd.DataFrame(all_pairs, columns=['A','B','Similarity'])
group_similarity = pair_similarity.groupby(['A'])['Similarity'].mean().reset_index()
print(group_similarity)
代码语言:javascript
运行
复制
              A  Similarity
0   First value   47.777778
1  Second value   45.000000
2   Third value   52.777778
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65825834

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档