做 文本分析 【文本数据挖掘快速入门】时候经常遇到同意多形词,如 BigApple/NewYork/NN
都可能代指纽约市,当我们统计纽约市出现的次数的时候我们需要分别统计这三个词的数目并进行加总。
flashtext对于处理上面的问题非常擅长,而且运算速度特别快。清洗数据的速度,我们可以拿正则表达式来和flashtext作比较
我们发现运行正则表达式来清洗数据,速度会随着数据量线性下降,而flashtext的清洗性能基本保持不变。
pip3 install flashtext
https://flashtext.readthedocs.io/en/latest/
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Big Apple')
keyword_processor.add_keyword('Bay Area')
keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
keywords_found
Run
['Big Apple', 'Bay Area']
像big apple 和new york都代指纽约,我们需要先清洗好数据,统一用一个词语代指纽约,再去抽取关键词语。这就用到 add_keyword方法。
from flashtext import KeywordProcessor
kw_processor = KeywordProcessor()
#给关键词处理器对象中加入待识别的关键词
kw_processor.add_keyword('Big Apple', 'New York')
kw_processor.add_keyword('Bay Area')
#对文本数据进行关键词提取
kws_found = kw_processor.extract_keywords('I love Big Apple and Bay Area.')
kws_found
Run
['New York', 'Bay Area']
如果同义词太多,可以用字典构建映射关系。使用到的方法是addkeywordsfrom_dict
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_dict = {"java": ["java_2e", "java programing"],
"product management": ["PM", "product manager"]}
#从字典中加入映射关系keyword_processor.add_keywords_from_dict(keyword_dict)
#从列表中加入关键词keyword_processor.add_keywords_from_list(["java", "python"])
keyword_processor.extract_keywords('I am a product manager for a java_2e and python platform')
Run
['product management', 'java', 'python']
有的时候我们可能加错了关键词,想去除之前添加的关键词。这就用到removekeyword/removekeywords/removekeywordsfromdict/removekeywordsfromlist
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()keyword_dict = { "java": ["java_2e", "java programing"],
"product management": ["PM", "product manager"]}
keyword_processor.add_keywords_from_dict(keyword_dict)print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform'))
keyword_processor.remove_keyword('java_2e')
keyword_processor.remove_keywords_from_dict({"product management": ["PM"]})
keyword_processor.remove_keywords_from_list(["java programing"])
keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
Run
['product management', 'java']['product management']
查看自定义的关键词个数
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()keyword_dict = { "java": ["java_2e", "java programing"],
"product management": ["PM", "product manager"]}
keyword_processor.add_keywords_from_dict(keyword_dict)print(len(keyword_processor))
Run
4
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('j2ee', 'Java')
print('j2ee' in keyword_processor)
Run
True
add_keyword()中的传入的顺序不同,结果也不同
print(keyword_processor.get_keyword('j2ee'))
print(keyword_processor.get_keyword('Java'))
Run
JavaNone
这个比较简单
keyword_processor['colour'] = 'color'
print(keyword_processor['colour'])
Run
color
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('j2ee', 'Java')
keyword_processor.add_keyword('colour', 'color')
keyword_processor.get_all_keywords()
Run
{'j2ee': 'Java', 'colour': 'color'}
from flashtext import KeywordProcessor
kw_processor2 = KeywordProcessor()
# 给关键词处理器对象中加入待识别的关键词
kw_processor2.add_keyword('New Delhi', 'NCR region')
kw_processor2.add_keyword('Big Apple','New York')
# 注意顺序#对文本数据进行关键词替换
new_sentence = kw_processor2.replace_keywords('I love Big Apple and new delhi.')
new_sentence
Run
'I love New York and NCR region.'
flashtext还能计算待考察词语的开始与结束的索引值
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Big Apple', 'New York')
keyword_processor.add_keyword('Bay Area')
keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)
keywords_found
Run
[('New York', 7, 16), ('Bay Area', 21, 29)]
根据关键词,还能抽取一些额外的信息,如时间、位置等。但这些功能在中文中可能不太适用,英文问题不大。
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keyword('Taj Mahal', ('Monument', 'Taj Mahal'))
kp.add_keyword('Delhi', ('Location', 'Delhi'))
kp.extract_keywords('Taj Mahal is in Delhi.')
Run
[('Monument', 'Taj Mahal'), ('Location', 'Delhi')]