文章/答案/技术大牛

发布

社区首页 >问答首页 >将段落中的词分类成组，并根据列出的顺序给它们分配权重

问将段落中的词分类成组，并根据列出的顺序给它们分配权重
EN

Stack Overflow用户

提问于 2017-03-01 16:39:59

回答 1查看 224关注 0票数 0

我有一个段落，其中包含最多18个不同行业的名称。这些名字用分号隔开。它们的发生顺序对确定它们的大小也很重要。因此，必须将其指定为名称的权重。该清单可分为三大类：

报告增长的行业。2.报告收缩的行业。3.报告没有变化的行业。

在十八个制造业中，一月份有十二个按以下次序增长:塑胶及橡胶制品；杂项制造；成衣、皮革及有关产品；纸品；化学制品；运输设备；食品、饮料及烟草制品；机械；石油及煤产品；初级金属；金属制品业；以及电脑及电子产品。一月份录得收缩的五个行业为:非金属矿产产品；木制品；家具及有关产品；电气设备、电器及零件；印刷及有关支援活动。

上面的段落是一个样本。将文本分为3类(本例中为2种)并根据列出的顺序分配值的最佳方法是什么？文本中出现了一个模式。名称以“：”开头，以“”结尾。有时，报告收缩的行业名称被列在第一位，然后是报告增长的行业。如何在自动化的同时克服这一问题？

价值分配将取决于每一类工业的计数。报告增长的行业的正值一直下降到1，而无变化的行业有0作为违约值，收缩工业的负值，其幅度从1一直下降到-1。然后将这些类别放在一起并按递减顺序排序以获得一个列表(+ve，0，-ve)。还处于程序设计的早期阶段。请容忍我。即使是解决问题的策略建议也会帮助我走很长一段路。

python

string

python-2.7

text-mining

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-03-01 18:23:40

下面的代码适用于您给出的示例，但我不能保证它在所有的示例上都能工作(特别是因为您没有给出一个带有任何更改的示例)。其主要想法是使用正则表达式(import re)专门寻找“增长”、“不变”和“收缩”这几个术语，然后分别列出公司名单。接下来，三个类别中的每一个都要经过一个列表理解，以获得相关的分数，以便每个列表条目成为(company, value)的一个元组。最后，将这三个类别合并成一个列表，按值(第一个索引)进行排序，并打印出来。请注意，如果没有使用确切的单词“growth”，例如用“growth”代替“growth”，这是行不通的。

代码：

import re

sample = 'Of the 18 manufacturing industries, 12 reported growth in January in the following order: Plastics & Rubber Products; Miscellaneous Manufacturing; Apparel, Leather & Allied Products; Paper Products; Chemical Products; Transportation Equipment; Food, Beverage & Tobacco Products; Machinery; Petroleum & Coal Products; Primary Metals; Fabricated Metal Products; and Computer & Electronic Products. The five industries reporting contraction in January are: Nonmetallic Mineral Products; Wood Products; Furniture & Related Products; Electrical Equipment, Appliances & Components; and Printing & Related Support Activities.'

#Find the growth industries
growth_pattern = 'growth.*?:(.*?)\.'
growths = re.findall(growth_pattern,sample)
growths = growths[0].strip().split(';') if len(growths) == 1 else []

#Find the no change industries
nochange_pattern = 'no change.*?:(.*?)\.'
nochanges = re.findall(nochange_pattern,sample)
nochanges = nochanges[0].strip().split(';') if len(nochanges) == 1 else []

#Find the contraction industries
contraction_pattern = 'contraction.*?:(.*?)\.'
contractions = re.findall(contraction_pattern,sample)
contractions = contractions[0].strip().split(';') if len(contractions) == 1 else []

#Give numbers to each of the industries
growths = [(g.strip().replace('and ',''),len(growths)-i) for i,g in enumerate(growths)]
nochanges = [(nc.strip().replace('and ',''),0) for i,nc in enumerate(nochanges)]
contractions = [(c.strip().replace('and ',''),-(len(contractions)-i)) for i,c in enumerate(contractions)]

#Print them out to check (commented out for now)
#print('growths:'+str(growths))
#print('nochanges:'+str(nochanges))
#print('contractions:'+str(contractions))

#Combine them all together, sort by value, and print out
all_together = growths+nochanges+contractions
all_together = sorted(all_together,key=lambda x: -x[1])
print all_together

产出：

[('Plastics & Rubber Products', 12), ('Miscellaneous Manufacturing', 11), ('Apparel, Leather & Allied Products', 10), ('Paper Products', 9), ('Chemical Products', 8), ('Transportation Equipment', 7), ('Food, Beverage & Tobacco Products', 6), ('Machinery', 5), ('Petroleum & Coal Products', 4), ('Primary Metals', 3), ('Fabricated Metal Products', 2), ('Computer & Electronic Products', 1), ('Printing & Related Support Activities', -1), ('Electrical Equipment, Appliances & Components', -2), ('Furniture & Related Products', -3), ('Wood Products', -4), ('Nonmetallic Mineral Products', -5)]

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42537246

复制

相似问题

问将段落中的词分类成组，并根据列出的顺序给它们分配权重
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将段落中的词分类成组，并根据列出的顺序给它们分配权重EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将段落中的词分类成组，并根据列出的顺序给它们分配权重
EN