我有一个段落,其中包含最多18个不同行业的名称。这些名字用分号隔开。它们的发生顺序对确定它们的大小也很重要。因此,必须将其指定为名称的权重。该清单可分为三大类:
在十八个制造业中,一月份有十二个按以下次序增长:塑胶及橡胶制品;杂项制造;成衣、皮革及有关产品;纸品;化学制品;运输设备;食品、饮料及烟草制品;机械;石油及煤产品;初级金属;金属制品业;以及电脑及电子产品。一月份录得收缩的五个行业为:非金属矿产产品;木制品;家具及有关产品;电气设备、电器及零件;印刷及有关支援活动。
上面的段落是一个样本。将文本分为3类(本例中为2种)并根据列出的顺序分配值的最佳方法是什么?文本中出现了一个模式。名称以“:”开头,以“”结尾。有时,报告收缩的行业名称被列在第一位,然后是报告增长的行业。如何在自动化的同时克服这一问题?
价值分配将取决于每一类工业的计数。报告增长的行业的正值一直下降到1,而无变化的行业有0作为违约值,收缩工业的负值,其幅度从1一直下降到-1。然后将这些类别放在一起并按递减顺序排序以获得一个列表(+ve,0,-ve)。还处于程序设计的早期阶段。请容忍我。即使是解决问题的策略建议也会帮助我走很长一段路。
发布于 2017-03-01 18:23:40
下面的代码适用于您给出的示例,但我不能保证它在所有的示例上都能工作(特别是因为您没有给出一个带有任何更改的示例)。其主要想法是使用正则表达式(import re
)专门寻找“增长”、“不变”和“收缩”这几个术语,然后分别列出公司名单。接下来,三个类别中的每一个都要经过一个列表理解,以获得相关的分数,以便每个列表条目成为(company, value)
的一个元组。最后,将这三个类别合并成一个列表,按值(第一个索引)进行排序,并打印出来。请注意,如果没有使用确切的单词“growth”,例如用“growth”代替“growth”,这是行不通的。
代码:
import re
sample = 'Of the 18 manufacturing industries, 12 reported growth in January in the following order: Plastics & Rubber Products; Miscellaneous Manufacturing; Apparel, Leather & Allied Products; Paper Products; Chemical Products; Transportation Equipment; Food, Beverage & Tobacco Products; Machinery; Petroleum & Coal Products; Primary Metals; Fabricated Metal Products; and Computer & Electronic Products. The five industries reporting contraction in January are: Nonmetallic Mineral Products; Wood Products; Furniture & Related Products; Electrical Equipment, Appliances & Components; and Printing & Related Support Activities.'
#Find the growth industries
growth_pattern = 'growth.*?:(.*?)\.'
growths = re.findall(growth_pattern,sample)
growths = growths[0].strip().split(';') if len(growths) == 1 else []
#Find the no change industries
nochange_pattern = 'no change.*?:(.*?)\.'
nochanges = re.findall(nochange_pattern,sample)
nochanges = nochanges[0].strip().split(';') if len(nochanges) == 1 else []
#Find the contraction industries
contraction_pattern = 'contraction.*?:(.*?)\.'
contractions = re.findall(contraction_pattern,sample)
contractions = contractions[0].strip().split(';') if len(contractions) == 1 else []
#Give numbers to each of the industries
growths = [(g.strip().replace('and ',''),len(growths)-i) for i,g in enumerate(growths)]
nochanges = [(nc.strip().replace('and ',''),0) for i,nc in enumerate(nochanges)]
contractions = [(c.strip().replace('and ',''),-(len(contractions)-i)) for i,c in enumerate(contractions)]
#Print them out to check (commented out for now)
#print('growths:'+str(growths))
#print('nochanges:'+str(nochanges))
#print('contractions:'+str(contractions))
#Combine them all together, sort by value, and print out
all_together = growths+nochanges+contractions
all_together = sorted(all_together,key=lambda x: -x[1])
print all_together
产出:
[('Plastics & Rubber Products', 12), ('Miscellaneous Manufacturing', 11), ('Apparel, Leather & Allied Products', 10), ('Paper Products', 9), ('Chemical Products', 8), ('Transportation Equipment', 7), ('Food, Beverage & Tobacco Products', 6), ('Machinery', 5), ('Petroleum & Coal Products', 4), ('Primary Metals', 3), ('Fabricated Metal Products', 2), ('Computer & Electronic Products', 1), ('Printing & Related Support Activities', -1), ('Electrical Equipment, Appliances & Components', -2), ('Furniture & Related Products', -3), ('Wood Products', -4), ('Nonmetallic Mineral Products', -5)]
https://stackoverflow.com/questions/42537246
复制相似问题