我有一个数据文件,如下所示,但更大:
import pandas as pd
data = {'First': ['First value','Third value','Second value','First value','Third value','Second value'],
'Second': ['the old man is here','the young girl is there', 'the old woman is here','the young boy is there','the young girl is here','the old girl is here']}
df = pd.DataFrame (data, columns = ['First','Second'])
我根据下面的第一列计算了每个可能的对之间的平均相似性(从堆栈溢出中的其他答案中得到了对这个部分的帮助):
from itertools import combinations
#function to calculate similarity between each pairs of documents
def similarity_measure(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return float (len(intersection)) / len(union) * 100
#getting the lemmatized text along side the intents
data_similarity= df.groupby('First')['Second'].apply(lambda x: nltk.tokenize.word_tokenize(' '.join(x)))
data_similarity = data_similarity.reset_index()
#returning the similarity measures for each pair in the dataset
for val in list(combinations(range(len(data_similarity)), 2)):
print(f"similarity between {data_similarity.iloc[val[0],0]} and {data_similarity.iloc[val[1],0]} intents is: {similarity_measure(data_similarity.iloc[val[0],1],data_similarity.iloc[val[1],1])}")
作为输出,我想要的是所有对之间的平均值,因此,例如,如果上面的代码有以下输出:
similarity between first value and second value is 60
similarity between first value and third value is 50
similarity between second value and third value is 55
similarity between second value and first value is 60
similarity between third value and first value is 50
similarity between third value and second value is 55
我希望第一值与任何组合,第二值与任何组合,第三值与任何这样的组合的平均值:
first value average across all possible values is 55
second value average across all possible values is 57.5
third value average across all possible values is 52.5
发布于 2021-01-21 11:55:07
编辑:根据您的评论,这里是您可以做的。
group.
data_similarity
表,该表将来自不同句子的标记组合为句子之间的import nltk
from itertools import combinations, product
#function to calculate similarity between each pairs of documents
def similarity_measure(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return float (len(intersection)) / len(union) * 100
#getting the lemmatized text along side the intents
data_similarity= df.groupby('First')['Second'].apply(lambda x: nltk.tokenize.word_tokenize(' '.join(x)))
data_similarity = data_similarity.reset_index()
all_pairs = [(i,l,similarity_measure(j,m)) for (i,j),(l,m) in
product(zip(data_similarity['First'], data_similarity['Second']), repeat=2) if i!=l]
pair_similarity = pd.DataFrame(all_pairs, columns=['A','B','Similarity'])
group_similarity = pair_similarity.groupby(['A'])['Similarity'].mean().reset_index()
print(group_similarity)
A Similarity
0 First value 47.777778
1 Second value 45.000000
2 Third value 52.777778
https://stackoverflow.com/questions/65825834
复制相似问题