我正在尝试使用Python从dataframe中的文本列生成虚拟变量和分类变量。想象一下,在名为“Cars_listing”的数据帧中有一个文本列“Cars_notes”:
- "This Audi has ABS braking, leather interior and bucket seats..."
- "The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."
- "Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."
- "This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."
- "The Renault Le Car has been sitting in the garage, a little rust..."
- "The Kia Sorento for sale has a CD player, new tires..."
- "Red Dodge Viper convertible for sale, ceramic brakes, low miles..."
如何创建新变量:
- car_type: American [Ford] (1), European [Audi, Renault] (2), Asian [Toyota, Kia] (3)
- ABS_brakes: description includes 'ABS brak' (1), or not (0)
- imperfection: description includes 'rust' or 'scratches' (1) or not (0)
- sporty: description includes 'convertible' (1) or not (0)
我首先尝试了re.search() (而不是re.match()),例如:
sporty = re.search("convertible",'Cars_notes')
我刚刚开始学习Python文本操作和NLP。我已经在这里搜索了信息以及其他来源(Data Camp,Udemy,Google搜索),但我还没有找到一些东西来解释如何操作文本来创建这样的分类变量或虚拟变量。我们将非常感谢您的帮助。谢谢!
发布于 2019-08-18 20:13:12
这是我对此的看法。
因为您正在处理文本,所以pandas.Series.str.contains
应该足够了(不需要使用re.search
。
在根据条件分配新变量时,np.where
和np.select
非常有用。
import pandas as pd
import numpy as np
Cars_listing = pd.DataFrame({
'Cars_notes':
['"This Audi has ABS braking, leather interior and bucket seats..."',
'"The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."',
'"Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."',
'"This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."',
'"The Renault Le Car has been sitting in the garage, a little rust..."',
'"The Kia Sorento for sale has a CD player, new tires..."',
'"Red Dodge Viper convertible for sale, ceramic brakes, low miles..."']
})
# 1. car_type
Cars_listing['car_type'] = np.select(
condlist=[ # note you could use the case-insensitive search with `case=False`
Cars_listing['Cars_notes'].str.contains('ford', case=False),
Cars_listing['Cars_notes'].str.contains('audi|renault', case=False),
Cars_listing['Cars_notes'].str.contains('Toyota|Kia')
],
choicelist=[1, 2, 3], # dummy variables
default=0 # you could set it to `np.nan` etc
)
# 2. ABS_brakes
Cars_listing['ABS_brakes'] = np.where(# where(condition, [x, y])
Cars_listing['Cars_notes'].str.contains('ABS brak'), 1, 0)
# 3. imperfection
Cars_listing['imperfection'] = np.where(
Cars_listing['Cars_notes'].str.contains('rust|scratches'), 1, 0)
# 4. sporty
Cars_listing['sporty'] = np.where(
Cars_listing['Cars_notes'].str.contains('convertible'), 1, 0)
Cars_notes car_type ABS_brakes imperfection sporty
0 """This Audi has ..." 2 1 0 0
1 """The Ford F150 ..." 1 0 0 0
2 """Our Nissan Sen..." 0 1 0 0
3 """This Toyota Co..." 3 0 1 0
4 """The Renault Le..." 2 0 1 0
5 """The Kia Sorent..." 3 0 0 0
6 """Red Dodge Vipe..." 0 0 0 1
https://stackoverflow.com/questions/57546489
复制