我可以计算一个json文件,它不仅有简单的副本,而且还有一些类似的:
FooTechDepartment
FooFinaceDepartment
BarCompany
BarCompanySalesDepartment
我首先用set()来区分它
with open(json_file_name) as f_in:
companies_raw_data = json.load(f_in)
#distinct the companis
companies = set(companies_raw_data)
companies = sorted(list(companies))
这些公司:
In [212]: len(companies)
Out[212]: 472
In [227]: companies[40:50]
Out[227]:
['SpeedyCloud研发中心',
'SpeedyCloud研发部',
'The ONE',
'The ONE产品研发',
'The ONE产品研发部',
'TransferEasy',
'VIPKID',
'Weego Travel技术',
'ZingFront智线',
'ZingFront智线技术部']
我的想法是计算它们的长度从1到max_length,
0,设置一个柜台来收集公司
counter = {}
1,找到one_letter_companis并删除它们
In[228]: one_letter_companies = [c for c in companies if len(c) == 1]
In[229]: len(one_letter_companies)
Out[229]: 0
2 .找到two_letters_companies并在收集后删除它们
In[230]: two_letters_companies = [c for c in companies if len(c) == 2]
In[231]: len(two_letters_companies)
Out[231]: 16
把它们加到计数器上
In[238]: for company in two_letters_companies:
...: value = [c for c in companies if c.startswith(company)]
...: counter[company] = value
...: for v in value:
...: companies.remove(v)
它显示:
In[239]: counter
Out[239]:
{'互拍': ['互拍'],
'博飞': ['博飞'],
'城宿': ['城宿'],
'小米': ['小米', '小米小米安全', '小米小米电视'],
'币信': ['币信', '币信开发部'],
'库神': ['库神', '库神技术部'],
'微创': ['微创', '微创ITO', '微创ITO事业部', '微创微创赴微软', '微创赴微软小冰'],
'掌控': ['掌控', '掌控移动研发'],
'汇游': ['汇游'],
'百度': ['百度', '百度百度度秘事业部', '百度视频'],
'知乎': ['知乎', '知乎商业广告事业部', '知乎工程效率组', '知乎知识市场', '知乎社区平台部'],
'知藏': ['知藏'],
'纽曼': ['纽曼'],
'维朗': ['维朗'],
'艺恩': ['艺恩'],
'贝壳': ['贝壳']}
完整的守则:
counter = {}
while companies:
#separate the one_letter_companies
one_letter_companies = [c for c in companies if len(c) == 1]
if one_letter_companies:
counter["one_letter_companies"] = one_letter_companies
for c in one_letter_companies:
companies.remove(c)
#handle the companies whose name with more than 1 letter
#find the max_length
max_len = max([len(c) for c in companies]) + 1
for i in range(2,max_len):
n_letters_companies = [c for c in companies if len(c)==i]
if n_letters_companies:
for company in n_letters_companies:
value = [c for c in companies if c.startswith(company)]
counter[company] = value
#delete the found companied from the companies list
for v in value:
companies.remove(v)
信息技术产出:
In [259]: len(counter)
Out[259]: 391 #vs 472 in the set()
我正在学习算法,也渴望深入研究python。
你能用适当的算法或python库给出解决这个问题的任何提示吗?
发布于 2018-08-24 12:37:15
这就是我解决它的方法。关键部分是按长度对列表进行排序,因此部门总是在其公司之后来,这样我们就可以在遇到任何部门之前将公司添加到结果中。然后,我们可以通过这些公司查看是否有任何公司(或部门)是我们当前公司的公司,如果它是一个部门,我们可以将它添加到该条目中,如果它是公司,则将其添加为密钥。
with open(json_file_name) as f_in:
companies_raw_data = json.load(f_in)
companies = sorted(set(companies_raw_data), key=len)
results = {}
for company in companies:
for key in results:
if company.startswith(key): # is a department
results[key].append(company)
break
else: # no break -- is not a department
results[company] = []
如果将循环改为这样,效率可能会更高,但不太明显:
results = {}
for company in companies:
for i in range(len(company) - 1, 0, -1):
key = company[:i] # substring
if key in results:
results[key].append(company)
break
else: # no break -- is not a department
results[company] = []
https://stackoverflow.com/questions/52004386
复制相似问题