文章/答案/技术大牛

发布

社区首页 >问答首页 >如何从Python中许多语言的符号分词中获得最佳的合并？

问如何从Python中许多语言的符号分词中获得最佳的合并？
EN

Stack Overflow用户

提问于 2021-12-31 17:40:22

回答 1查看 761关注 0票数 0

下面的代码在Python中使用SymSpell，请参阅分割。

它使用github回购中的"de-100k.txt“和"en-80k.txt”频率字典，您需要将它们保存在工作目录中。只要您不想使用任何SymSpell逻辑，您就不需要安装和运行这个脚本来回答这个问题，只接受两种语言的单词分段的输出，然后继续。

import pkg_resources
from symspellpy.symspellpy import SymSpell

input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"

# German:
# Set max_dictionary_edit_distance to 0 to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "de-100k.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

# English:
# Reset the sym_spell object
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "en-80k.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

退出：

sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884

目的是通过某种逻辑找出最相关的单词:最频繁的名词邻居和/或词频、最长的单词等等。逻辑是没有选择的。

在使用两种语言的示例中，需要比较这两个输出，以便在删除其余部分时只保留最好的片段，而不截取单词的部分。在结果中，每个字母被使用一次和唯一。

如果input_term中的单词之间有空格，那么这些单词就不应该连接成一个新的部分。例如，如果您的“cr eme”中有一个错误的空间，这仍然不应该被允许成为‘奶油’。很有可能，这个空格比使用相邻字母时出现的错误更多的是对的。

array('sonnen', 'empfindlichkeit', 'sun', 'oil', 'farb', 'palette', 'sun', 'creme')
array(['DE'], ['DE'], ['EN'], ['EN'], ['DE'], ['DE', 'EN'], ['EN'], ['DE', 'EN'])

'DE/EN‘标签只是一个可选的概念，以显示单词在德语和英语中存在，您也可以在本例中选择'EN’而不是'DE‘。语言标签是一个额外的好处，你也可以不用它来回答。

可能有一种快速解决方案，它使用numpy数组和/或dictionaries代替lists或Dataframes，但是选择您喜欢的。

如何在符号分词中使用多种语言，并将它们组合成一个选择的合并？其目的是由所有字母组成的句子，使用每个字母一次，保留所有原始空格。

nlp

text-segmentation

symspell

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-12-31 23:03:23

SimSpell方法

这是推荐的方法。我是在做了手工操作后才发现的。您可以很容易地使用用于两种语言的同一种语言的频率逻辑：只是将两种或更多的语言加载到sym_spell对象中！

import pkg_resources
from symspellpy.symspellpy import SymSpell

input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"

# Set max_dictionary_edit_distance to 0 to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "de-100k.txt"
)

# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

# DO NOT reset the sym_spell object at this line so that
# English is added to the German frequency dictionary
# NOT: #reset the sym_spell object
# NOT: #sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "en-80k.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

退出：

sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884

手动方式

在这种手动方式下，逻辑是:两种语言的单词越长越好，记录胜利者语言标记。如果它们的长度相同，则记录两种语言。

与问题一样，input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"在SymSpell中为每种语言分段使用了一个重置对象，导致了德语的s1和英语的s2。

import numpy as np

s1 = 'sonnen empfindlichkeit s uno i l farb palette sun creme'
s2 = 'son ne ne mp find li ch k e it sun oil far b palette sun creme'

num_letters = len(s1.replace(' ',''))
list_w1 = s1.split()
list_w2 = s2.split()
list_w1_len = [len(x) for x in list_w1]
list_w2_len = [len(x) for x in list_w2]

lst_de = [(x[0], x[1], x[2], 'de', x[3], x[4]) for x in zip(list_w1, list_w1_len, range(len(list_w1)), np.cumsum([0] + [len(x)+1 for x in list_w1][:-1]), np.cumsum([0] + [len(x) for x in list_w1][:-1]))]
lst_en = [(x[0], x[1], x[2], 'en', x[3], x[4]) for x in zip(list_w2, list_w2_len, range(len(list_w2)), np.cumsum([0] + [len(x)+1 for x in list_w2][:-1]), np.cumsum([0] + [len(x) for x in list_w2][:-1]))]

idx_word_de = 0
idx_word_en = 0
lst_words = []
idx_letter = 0

# stop at num_letters-1, else you check the last word 
# also on the last idx_letter and get it twice
while idx_letter <= num_letters-1:
lst_de[idx_word_de][5], idx_letter)
    while(lst_de[idx_word_de][5]<idx_letter):
        idx_word_de +=1
    while(lst_en[idx_word_en][5]<idx_letter):
        idx_word_en +=1

    if lst_de[idx_word_de][1]>lst_en[idx_word_en][1]:
        lst_word_stats = lst_de[idx_word_de]
        str_word = lst_word_stats[0]
#         print('de:', lst_de[idx_word_de])
        idx_letter += len(str_word) #lst_de[idx_word_de][0])
    elif lst_de[idx_word_de][1]==lst_en[idx_word_en][1]:
        lst_word_stats = (lst_de[idx_word_de][0], lst_de[idx_word_de][1], (lst_de[idx_word_de][2], lst_en[idx_word_en][2]), (lst_de[idx_word_de][3], lst_en[idx_word_en][3]), (lst_de[idx_word_de][4], lst_en[idx_word_en][4]), lst_de[idx_word_de][5])
        str_word = lst_word_stats[0]
#         print('de:', lst_de[idx_word_de], 'en:', lst_en[idx_word_en])
        idx_letter += len(str_word) #lst_de[idx_word_de][0])        
    else:
        lst_word_stats = lst_en[idx_word_en]
        str_word = lst_word_stats[0]
#         print('en:', lst_en[idx_word_en][0])
        idx_letter += len(str_word)
    lst_words.append(lst_word_stats)

Out lst_words

[('sonnen', 6, 0, 'de', 0, 0),
 ('empfindlichkeit', 15, 1, 'de', 7, 6),
 ('sun', 3, 10, 'en', 31, 21),
 ('oil', 3, 11, 'en', 35, 24),
 ('farb', 4, 6, 'de', 33, 27),
 ('palette', 7, (7, 14), ('de', 'en'), (38, 45), 31),
 ('sun', 3, (8, 15), ('de', 'en'), (46, 53), 38),
 ('creme', 5, (9, 16), ('de', 'en'), (50, 57), 41)]

输出图例：

chosen word | len | word_idx_of_lang | lang | letter_idx_lang_with_spaces | letter_idx_no_spaces

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70544499

复制

相似问题

问如何从Python中许多语言的符号分词中获得最佳的合并？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从Python中许多语言的符号分词中获得最佳的合并？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从Python中许多语言的符号分词中获得最佳的合并？
EN