我正在使用value_counts实现特征提取,以显示最大数量的重复字符串,但我想提取一个特定的单词,并将值1赋给出现的单词,而其他NaN值必须填充为0。我现在正在做的是在字符串中手动搜索该单词,然后将字符串映射为1,并使用NaN (0)将填充值填充为0。
print(train.key_skills.value_counts(), '\n')
train['key_skills'] = train['key_skills'].map({
'Linear Regression, Insurance Analytics, Business Analysis..':1,
'Linear Regression, Insurance Analytics, Business Analysis...':1,
'Analytics, SAS, banking, insurance, Analytics Head':1,
'NoSQL, Spark, Mapreduce, SQL, Cassandra, Data Science, SCALA, Big Data...':1,
'NoSQL, Spark, Mapreduce, SQL, Cassandra, Data Science, SCALA, Big Data...':1,
'Excel, SQL, Data Analysis, Segmentation, SAS, Data Mining, SPSS...':1,
'Linear Regression, Business Analysis, Model Development, Segmentation, Base...':1,
'Data analysis, SQL, Consulting, Data management, SPSS, FMCG, Analytical...':1,
'Data Analytics, Business Intelligence, Communication Protocols...':1,
'r, advanced analytics, segmentation, sas, machine learning...':1,
'Data Analytics, Data Science, Predictive Modeling, Project Management...':1,
'NLP, Neural Networks, Machine Learning, Data Mining...':1,
'Text Mining, Hive, NoSQL, Python, R, SQL, Data Analysis, Machine Learning...':1,
'Data Science, R, Machine Learning, Linear Regression, Cluster Analysis...':1,
'Retail Analytics, Analytics, clustering, segmentation, ranking, correlation...':1,
'Linear Regression, SAS, Data Analytics, Correlation, Statistics, analytic...':1,
'Analytics, Machine Learning, TensorFlow, Pytorch, python libraries...':1,
'Data Analytics, SQL, Statistics, R, Econometrics, Data Mining...':1,
'Quant Analytics, Analytics, Data Analysis, Sentiment Analysis...':1,
'machine learning, text mining, r, python, neural networks, sql, sas...':1,
'Predictive Modeling, Logistic Regression, R, SAS, Predictive Analytics...':1,
'Business Analyst, Data Analytics, R, Python, MATLAB, SQL, Machine Learning,...':1,
'Business Analyst, Data Analytics, R, Python, MATLAB, SQL, Machine Learning,...':1,
'Retail Analytics, Business Analysis, Excel, SAS, Data Analytics, VBA...':1,
'Deep Learning, R, Machine Learning, Python, Stakeholder Management...':1,
'Hadoop, Java, Data Science, Cloudera, Spark, Hive, Impala, Presales...':1,
'SQL, Javascript, Automation, Python, Ruby, Analytics, Machine learning...':1,
'machine learning, team leading, Analytics, Natural Language Processing...':1,
'Analytics, Data Science, Program Delivery, Solutioning, Presales, Proposals...':1,
'NLP, SAS, User Stories, Agile Development, Machine Learning, Test Scenarios...':1,
'Analytics, Head - Analytics, data analytics, Data Science, business process...':1,
'Java, SCALA, Spring, Python, Solr, Redis, Machine Learning, Algorithms, Web...':1,
'Deep Learning, NLP, Spark, Information Retrieval, Java, Python...':1,
'SCALA, Machine Learning, Java, Python, SQL, R, Pig, Data Mining, Perl...':1
})
在这里,我想要一个代码,它应该映射数据科学家一词,在字符串中的任何位置,通过1,在它没有出现的地方,它应该放在0。
发布于 2019-07-28 11:51:47
您无需手动绘制地图,只需结合使用str.contains和np.where即可
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['train_skills'] = [
'Linear Regression, Insurance Analytics, Business Analysis..',
'Linear Regression, Insurance Analytics, Business Analysis...',
'Analytics, SAS, banking, insurance, Analytics Head',
'NoSQL, Spark, Mapreduce, SQL, Cassandra, Data Science, SCALA, Big Data...',
'NoSQL, Spark, Mapreduce, SQL, Cassandra, Data Science, SCALA, Big Data...',
np.nan]
###### THE LINE OF CODE YOU NEED ######
df['train_skills'] = np.where(df.train_skills.str.contains('Data Science'), 1, 0)
输出:
train_skills
0 0
1 0
2 0
3 1
4 1
5 1
https://stackoverflow.com/questions/57240171
复制