首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >从Python数据帧的文本列中的特定单词创建虚拟变量和分类变量

从Python数据帧的文本列中的特定单词创建虚拟变量和分类变量
EN

Stack Overflow用户
提问于 2019-08-19 00:39:24
回答 1查看 484关注 0票数 0

我正在尝试使用Python从dataframe中的文本列生成虚拟变量和分类变量。想象一下,在名为“Cars_listing”的数据帧中有一个文本列“Cars_notes”:

代码语言:javascript
运行
复制
- "This Audi has ABS braking, leather interior and bucket seats..."
- "The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."
- "Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."
- "This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."
- "The Renault Le Car has been sitting in the garage, a little rust..."
- "The Kia Sorento for sale has a CD player, new tires..."
- "Red Dodge Viper convertible for sale, ceramic brakes, low miles..."

如何创建新变量:

代码语言:javascript
运行
复制
- car_type: American [Ford] (1), European [Audi, Renault] (2), Asian [Toyota, Kia] (3)
- ABS_brakes: description includes 'ABS brak' (1), or not (0)
- imperfection: description includes 'rust' or 'scratches' (1) or not (0)
- sporty: description includes 'convertible' (1) or not (0) 

我首先尝试了re.search() (而不是re.match()),例如:

代码语言:javascript
运行
复制
sporty = re.search("convertible",'Cars_notes')

我刚刚开始学习Python文本操作和NLP。我已经在这里搜索了信息以及其他来源(Data Camp,Udemy,Google搜索),但我还没有找到一些东西来解释如何操作文本来创建这样的分类变量或虚拟变量。我们将非常感谢您的帮助。谢谢!

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-08-19 04:13:12

这是我对此的看法。

因为您正在处理文本,所以pandas.Series.str.contains应该足够了(不需要使用re.search

在根据条件分配新变量时,np.wherenp.select非常有用。

代码语言:javascript
运行
复制
import pandas as pd
import numpy as np

Cars_listing = pd.DataFrame({
    'Cars_notes': 
    ['"This Audi has ABS braking, leather interior and bucket seats..."',
    '"The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."',
    '"Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."',
    '"This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."',
    '"The Renault Le Car has been sitting in the garage, a little rust..."',
    '"The Kia Sorento for sale has a CD player, new tires..."',
    '"Red Dodge Viper convertible for sale, ceramic brakes, low miles..."']
})


# 1. car_type
Cars_listing['car_type'] = np.select(
    condlist=[ # note you could use the case-insensitive search with `case=False`
        Cars_listing['Cars_notes'].str.contains('ford', case=False),
        Cars_listing['Cars_notes'].str.contains('audi|renault', case=False),
        Cars_listing['Cars_notes'].str.contains('Toyota|Kia')
    ],
    choicelist=[1, 2, 3], # dummy variables
    default=0 # you could set it to `np.nan` etc
)

# 2. ABS_brakes
Cars_listing['ABS_brakes'] = np.where(# where(condition, [x, y])
    Cars_listing['Cars_notes'].str.contains('ABS brak'), 1, 0)

# 3. imperfection
Cars_listing['imperfection'] = np.where(
    Cars_listing['Cars_notes'].str.contains('rust|scratches'), 1, 0)

# 4. sporty
Cars_listing['sporty'] = np.where(
    Cars_listing['Cars_notes'].str.contains('convertible'), 1, 0)
代码语言:javascript
运行
复制
    Cars_notes              car_type    ABS_brakes  imperfection    sporty
0   """This Audi has ..."   2           1           0               0
1   """The Ford F150 ..."   1           0           0               0
2   """Our Nissan Sen..."   0           1           0               0
3   """This Toyota Co..."   3           0           1               0
4   """The Renault Le..."   2           0           1               0
5   """The Kia Sorent..."   3           0           0               0
6   """Red Dodge Vipe..."   0           0           0               1
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/57546489

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档