在题为“数据集python”的专栏文章中,我有一个具有不同编程语言的数据集,我希望在dataset python中获得最常用的10种编程语言。
数据集:https://drive.google.com/file/d/1nJLDFSdIbkNxcqY7NBtJZfcgLW1wpsUZ/view?usp=sharing
发布于 2022-09-11 20:39:27
在StackOverflow上,您不应该链接到外部源,而应该将相关数据包含在您的问题中。你也应该削减数据--如果数据是长的,尽量简短,并说明你的问题。
最后,在StackOverflow上,我们不会问诸如“如何做x”这样的简单问题。首先,您必须自己努力解决问题,然后,如果您不明白为什么您的代码不能工作,那么您必须发布尽可能小的示例,然后我们将告诉您bug在哪里。你必须先表现出你的努力,然后我们才能帮你解决它。它的目的是让您学会自己编写代码,而不仅仅是复制现成的解决方案。
因为你没有表现出任何努力,我不会给你一个完整的解决方案,但我会给你一些开始,为你指出正确的方向。您应该首先分析我的代码,以便了解它是如何工作的,然后才能完成它。
当您运行此代码时,它将为您打印相关的单词。您可能应该进一步处理它们,删除人们输入的不相关字符,并使用适当的字母大小写,然后您就可以统计单词的出现情况。
#!/usr/bin/python3
LANGUAGE_COLUMN_INDEX = 8
with open('Salary.csv') as fp:
#skip over first line
fp.readline()
for line in fp:
for word in line.split(',')[LANGUAGE_COLUMN_INDEX].split('/'):
print(word.strip())发布于 2022-09-11 20:41:00
如果您使用pandas,并且列中有单个元素,那么您可以在[:10]中使用.value_counts()
列"Your main technology / programming language"的示例
import pandas as pd
df = pd.read_csv('IT Salary Survey EU 2020.csv')
result = df["Your main technology / programming language"].value_counts()[:10]
print(result)结果:
Java 184
Python 164
PHP 56
C++ 38
JavaScript 34
Javascript 31
C# 30
python 26
Scala 24
Swift 24
Name: Your main technology / programming language, dtype: int64编辑:似乎需要将所有文本转换为较低的值才能同时计算Python和python
如果每个文本中都有许多由,分隔的元素,则需要拆分测试(创建列表),使用explode()将元素从列表移动到分隔的行,然后使用与[:10]相同的.value_counts()对元素进行计数。
列"Other technologies/programming languages you use often"的示例
import pandas as pd
df = pd.read_csv('IT Salary Survey EU 2020.csv')
data = df["Other technologies/programming languages you use often"]
data = data.str.split(',') # convert strings into list of values
data = data.explode() # convert list into separated rows
data = data.str.strip() # remove spaces
data = data.str.lower() # convert all to lower (to count together `Python` and `python`)
result = data.value_counts()[:10]
print(result)结果:
docker 525
sql 480
python 409
aws 401
javascript / typescript 381
kubernetes 296
java / scala 229
google cloud 150
kotlin 120
go 113
Name: Other technologies/programming languages you use often, dtype: int64您也可以将所有内容写成一行。
result = df["Other technologies/programming languages you use often"].str.split(',').explode().str.strip().str.lower().value_counts()[:10]发布于 2022-09-11 21:19:47
如果您不想使用Pandas,这里一个纯Python版本作为从集合模块导入的计数器是标准Python发行版的一部分。输出中的第三个值给出了给定语言的条目相对于条目总数的相对%(舍入整数值):
prgrLang = [] # will collect the entries in the table
with open("IT_Salary_Survey_EU_2020.csv") as f:
Lines = f.readlines()
for line in Lines[1:]: # [1:] skips the header line
prgrLang.append(line.split(',')[8]) # column 9 in CSV
noOfEntries = len(Lines)-1
from collections import Counter
objCounter = Counter(prgrLang) # counts same values in a list
lstResults = sorted( [ [value, key]
for key, value in objCounter.items() ], reverse=True )
lstResPerc = [ [key, value, int(100*value/noOfEntries)]
for value, key in lstResults ]
print( *lstResPerc, sep='\n' )上面的代码提供了以下输出:
['Java', 182, 14]
['Python', 163, 13]
['', 121, 9]
['PHP', 55, 4]
['C++', 38, 3]
['JavaScript', 34, 2]
['Javascript', 31, 2]
['C#', 29, 2]
['python', 26, 2]
['Swift', 23, 1]
['Scala', 23, 1]
['Go', 23, 1]
['Python ', 21, 1]
['Kotlin', 21, 1]
['Ruby', 19, 1]
['TypeScript', 14, 1]
['SQL', 14, 1]
['.NET', 13, 1]
['Middle', 12, 0]
['JS', 11, 0]
['C', 10, 0]
['iOS', 9, 0]
['R', 9, 0]
['php', 8, 0]
['java', 8, 0]
['Typescript', 8, 0]
['Android', 8, 0]
['javascript', 7, 0]
['Kubernetes', 7, 0]
['.net', 7, 0]
['"Python', 7, 0]
['"Java', 7, 0]
['Senior', 6, 0]
['Javascript / Typescript', 6, 0]
['Php', 5, 0]
['JavaScript ', 5, 0]
['Elixir', 5, 0]
['ABAP', 5, 0]
['QA', 4, 0]
['AWS', 4, 0]
['c++', 3, 0]
['Ruby on Rails', 3, 0]
['React', 3, 0]
['Node.js', 3, 0]
['Golang', 3, 0]
['Embedded', 3, 0]
['Cloud', 3, 0]
['5"', 3, 0]
['.Net', 3, 0]
['"TypeScript', 3, 0]
['"Swift', 3, 0]
['"C', 3, 0]
['"Angular', 3, 0]
['yaml', 2, 0]
['kotlin', 2, 0]
['js', 2, 0]
['go', 2, 0]
['Sql ', 2, 0]
['Spark', 2, 0]
['PHP ', 2, 0]
['NodeJS', 2, 0]
['JavaScript/Typescript', 2, 0]
['JavaScript / TypeScript', 2, 0]
['Java/Kotlin', 2, 0]
['Java ', 2, 0]
['Frontend', 2, 0]
['Figma', 2, 0]
['C/C++', 2, 0]
['Bash', 2, 0]
['Angular', 2, 0]
['"Scala', 2, 0]
['"JS', 2, 0]
['"C++', 2, 0]
['"C#', 2, 0]
['С#', 1, 0]
['Офмф', 1, 0]
['typescript', 1, 0]
['swift', 1, 0]
['sql', 1, 0]
['spark', 1, 0]
['several', 1, 0]
['scala', 1, 0]
['ruby on rails', 1, 0]
['python ', 1, 0]
['pythin', 1, 0]
['nothing', 1, 0]
['none', 1, 0]
['n/a', 1, 0]
['k8s', 1, 0]
['julia', 1, 0]
['jenkins bash', 1, 0]
['java/scala/go/clouds/devops', 1, 0]
['golang', 1, 0]
['embedded', 1, 0]
['consumer analysis', 1, 0]
['c/c++', 1, 0]
['c#', 1, 0]
['android', 1, 0]
['Web developer', 1, 0]
['VHDL', 1, 0]
['UML', 1, 0]
['Typescript / Angular', 1, 0]
['Typescript ', 1, 0]
['TypeScript/Angular', 1, 0]
['Test Management ', 1, 0]
['Terraform ', 1, 0]
['Terraform', 1, 0]
['TS', 1, 0]
['T-SQL', 1, 0]
['Swift/Kotlin', 1, 0]
['Sql', 1, 0]
['Spring', 1, 0]
['Scala / Python', 1, 0]
['Salesforce ', 1, 0]
['SWIFT', 1, 0]
['SRE', 1, 0]
['SAP BW / ABAP', 1, 0]
['SAP ABAP', 1, 0]
['SAP / ABAP', 1, 0]
['SAP', 1, 0]
['React.js / TypeScript', 1, 0]
['React JS', 1, 0]
['React / JavaScript', 1, 0]
['React ', 1, 0]
['Qml', 1, 0]
['Qlik', 1, 0]
['Python/SQL', 1, 0]
['Python/NLP', 1, 0]
['Python / JavaScript (React)', 1, 0]
['Python + SQL', 1, 0]
['Python (Django)', 1, 0]
['Pyrhon', 1, 0]
['PowerShell', 1, 0]
['Power BI', 1, 0]
['Perl', 1, 0]
['Pegasystems platform ', 1, 0]
['PM tools', 1, 0]
['PL/SQL', 1, 0]
['PHP/MySQL', 1, 0]
['Oracle', 1, 0]
['Objective-C', 1, 0]
['NodsJs', 1, 0]
['Nodejs', 1, 0]
['NodeJS/TS', 1, 0]
['Node', 1, 0]
['Network Automation', 1, 0]
['Network', 1, 0]
['Ml/Python', 1, 0]
['Management', 1, 0]
['Magento', 1, 0]
['ML', 1, 0]
['Linux Kernel', 1, 0]
['Linux', 1, 0]
['Kubrrnetes', 1, 0]
['Kotlin/PHP', 1, 0]
['Kotlin ', 1, 0]
['Js', 1, 0]
['Jira', 1, 0]
['Javascript/Typescript', 1, 0]
['Javascript ', 1, 0]
['JavaScript/TypeScript', 1, 0]
['JavaScript/ES6', 1, 0]
['JavaScript / typescript', 1, 0]
['Java/Scala', 1, 0]
['Java/Groovy', 1, 0]
['Java/C++', 1, 0]
['Java Backend', 1, 0]
['Java / Scala', 1, 0]
['Java & Distributed Systems Stuff', 1, 0]
['JavScript', 1, 0]
['JAVA', 1, 0]
['Haskell', 1, 0]
['Hardware', 1, 0]
['Google Cloud Platform', 1, 0]
['Golang ', 1, 0]
['Go/Python', 1, 0]
['GCP', 1, 0]
['FBD', 1, 0]
['Erlang', 1, 0]
['Embedded C++', 1, 0]
['DevOps', 1, 0]
['DWH', 1, 0]
['DC Management', 1, 0]
['Cobol', 1, 0]
['Clojure', 1, 0]
['Charles', 1, 0]
['C-Level', 1, 0]
['C++/c', 1, 0]
['C++/C#', 1, 0]
['C#/.NET', 1, 0]
['C# .NET', 1, 0]
['Business Development Manager Operation ', 1, 0]
['Blockchain', 1, 0]
['Azure', 1, 0]
['Aws Hadoop Postgre Typescript', 1, 0]
['Autonomous Driving', 1, 0]
['Atlassian JIRA', 1, 0]
['Apotheker', 1, 0]
['Apache Spark', 1, 0]
['Android/Kotlin', 1, 0]
['Agile', 1, 0]
['AI', 1, 0]
['4', 1, 0]
['--', 1, 0]
['-', 1, 0]
['"python', 1, 0]
['"php', 1, 0]
['"networking', 1, 0]
['"VB', 1, 0]
['"Typescript', 1, 0]
['"Terraform', 1, 0]
['"Sql', 1, 0]
['"Spark', 1, 0]
['"Sketch', 1, 0]
['"SAS', 1, 0]
['"Qlik BI Tool', 1, 0]
['"Pascal', 1, 0]
['"PS', 1, 0]
['"NodeJS', 1, 0]
['"NLP', 1, 0]
['"Linux/UNIX', 1, 0]
['"Kubernetes', 1, 0]
['"Kuberenetes', 1, 0]
['"Kotlin', 1, 0]
['"Js', 1, 0]
['"Javascript', 1, 0]
['"JavaScript', 1, 0]
['"Grails', 1, 0]
['"Go', 1, 0]
['"Frontend: react', 1, 0]
['"Computer Networking', 1, 0]
['"BI', 1, 0]
['"Azure', 1, 0]
['"AWS', 1, 0]
['".net', 1, 0]
['".Net', 1, 0]
[' there are no ranges in the firm "', 1, 0]
[' but as a lab scientist)"', 1, 0]从上面的输出可以看出,排名受到不同类型的条目,需要一些微调。这里试图对结果进行微调:
fineTuning = True
prgrLang = []
with open("IT_Salary_Survey_EU_2020.csv") as f:
Lines = f.readlines()
for line in Lines[1:]:
entry = line.split(',')[8] # 9th column in CSV
if fineTuning:
entry = entry.replace('"',' ').replace("'"," ").replace('/',' ')
entry = entry.replace('on rails',' ')
entry = str.lower(entry)
entry = entry.split()
if isinstance(entry, str): Entry = [entry]
else: Entry = entry
Entry = [ str.strip(entry) for entry in Entry if entry] # skip ''
if isinstance(Entry, str): Entry = [Entry]
prgrLang.extend(Entry)
noOfEntries = len(Lines)-1
from collections import Counter
objCounter = Counter(prgrLang)
lstResults = sorted( [ [value, key]
for key, value in objCounter.items() ], reverse=True )
lstResPerc = [ [key, value, int(100*value/noOfEntries)]
for value, key in lstResults ]
print( *lstResPerc, sep='\n' )将排序更改为:
['python', 227, 18]
['java', 209, 16]
['javascript', 96, 7]
['php', 73, 5]
['c++', 50, 3]
['typescript', 45, 3]
['c#', 35, 2]
['scala', 30, 2]
['kotlin', 30, 2]
['swift', 29, 2]
['go', 28, 2]
['.net', 27, 2]
['ruby', 23, 1]
['sql', 22, 1]
['js', 18, 1]
['c', 17, 1]
['middle', 12, 0]
['android', 10, 0]
['r', 9, 0]
['ios', 9, 0]
['kubernetes', 8, 0]
['abap', 8, 0]
['react', 7, 0]
['angular', 7, 0]
['senior', 6, 0]
['aws', 6, 0]
['spark', 5, 0]
['nodejs', 5, 0]
['golang', 5, 0]
['embedded', 5, 0]
['elixir', 5, 0]
['sap', 4, 0]
['qa', 4, 0]
['cloud', 4, 0]
['terraform', 3, 0]
['rails', 3, 0]
['on', 3, 0]
['node.js', 3, 0]
['management', 3, 0]
['linux', 3, 0]
['bi', 3, 0]
['bash', 3, 0]
['5', 3, 0]
['yaml', 2, 0]
['ts', 2, 0]
['qlik', 2, 0]
['platform', 2, 0]
['nlp', 2, 0]
['networking', 2, 0]
['network', 2, 0]
['ml', 2, 0]
['jira', 2, 0]
['frontend', 2, 0]
['figma', 2, 0]
['devops', 2, 0]
['azure', 2, 0]
['a', 2, 0]
['с#', 1, 0]
['офмф', 1, 0]
['web', 1, 0]
['vhdl', 1, 0]
['vb', 1, 0]
['unix', 1, 0]
['uml', 1, 0]
['tools', 1, 0]
['tool', 1, 0]
['there', 1, 0]
['the', 1, 0]
['test', 1, 0]
['t-sql', 1, 0]
['systems', 1, 0]
['stuff', 1, 0]
['sre', 1, 0]
['spring', 1, 0]
['sketch', 1, 0]
['several', 1, 0]
['scientist)', 1, 0]
['sas', 1, 0]
['salesforce', 1, 0]
['react.js', 1, 0]
['ranges', 1, 0]
['qml', 1, 0]
['pythin', 1, 0]
['pyrhon', 1, 0]
['ps', 1, 0]
['powershell', 1, 0]
['power', 1, 0]
['postgre', 1, 0]
['pm', 1, 0]
['pl', 1, 0]
['perl', 1, 0]
['pegasystems', 1, 0]
['pascal', 1, 0]
['oracle', 1, 0]
['operation', 1, 0]
['objective-c', 1, 0]
['nothing', 1, 0]
['none', 1, 0]
['nodsjs', 1, 0]
['node', 1, 0]
['no', 1, 0]
['n', 1, 0]
['mysql', 1, 0]
['manager', 1, 0]
['magento', 1, 0]
['lab', 1, 0]
['kubrrnetes', 1, 0]
['kuberenetes', 1, 0]
['kernel', 1, 0]
['k8s', 1, 0]
['julia', 1, 0]
['jenkins', 1, 0]
['javscript', 1, 0]
['in', 1, 0]
['haskell', 1, 0]
['hardware', 1, 0]
['hadoop', 1, 0]
['groovy', 1, 0]
['grails', 1, 0]
['google', 1, 0]
['gcp', 1, 0]
['frontend:', 1, 0]
['firm', 1, 0]
['fbd', 1, 0]
['es6', 1, 0]
['erlang', 1, 0]
['dwh', 1, 0]
['driving', 1, 0]
['distributed', 1, 0]
['development', 1, 0]
['developer', 1, 0]
['dc', 1, 0]
['consumer', 1, 0]
['computer', 1, 0]
['cobol', 1, 0]
['clouds', 1, 0]
['clojure', 1, 0]
['charles', 1, 0]
['c-level', 1, 0]
['bw', 1, 0]
['but', 1, 0]
['business', 1, 0]
['blockchain', 1, 0]
['backend', 1, 0]
['autonomous', 1, 0]
['automation', 1, 0]
['atlassian', 1, 0]
['as', 1, 0]
['are', 1, 0]
['apotheker', 1, 0]
['apache', 1, 0]
['analysis', 1, 0]
['ai', 1, 0]
['agile', 1, 0]
['4', 1, 0]
['--', 1, 0]
['-', 1, 0]
['+', 1, 0]
['(react)', 1, 0]
['(django)', 1, 0]
['&', 1, 0]https://stackoverflow.com/questions/73682019
复制相似问题