问从Python数据帧的文本列中的特定单词创建虚拟变量和分类变量
EN

Stack Overflow用户

提问于 2019-08-18 16:39:24

回答 1查看 484关注 0票数 0

我正在尝试使用Python从dataframe中的文本列生成虚拟变量和分类变量。想象一下，在名为“Cars_listing”的数据帧中有一个文本列“Cars_notes”：

- "This Audi has ABS braking, leather interior and bucket seats..."
- "The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."
- "Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."
- "This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."
- "The Renault Le Car has been sitting in the garage, a little rust..."
- "The Kia Sorento for sale has a CD player, new tires..."
- "Red Dodge Viper convertible for sale, ceramic brakes, low miles..."

如何创建新变量：

- car_type: American [Ford] (1), European [Audi, Renault] (2), Asian [Toyota, Kia] (3)
- ABS_brakes: description includes 'ABS brak' (1), or not (0)
- imperfection: description includes 'rust' or 'scratches' (1) or not (0)
- sporty: description includes 'convertible' (1) or not (0)

我首先尝试了re.search() (而不是re.match())，例如：

sporty = re.search("convertible",'Cars_notes')

我刚刚开始学习Python文本操作和NLP。我已经在这里搜索了信息以及其他来源(Data Camp，Udemy，Google搜索)，但我还没有找到一些东西来解释如何操作文本来创建这样的分类变量或虚拟变量。我们将非常感谢您的帮助。谢谢!

python

string

pandas

variables

text

电商安全场景方案

帮助电商行业客户一站式解决Web安全、黑产羊毛党对抗、高危漏洞、BOT工具、恶意入侵门户服务器等安全问题

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-08-18 20:13:12

这是我对此的看法。

因为您正在处理文本，所以pandas.Series.str.contains应该足够了(不需要使用re.search。

在根据条件分配新变量时，np.where和np.select非常有用。

import pandas as pd
import numpy as np

Cars_listing = pd.DataFrame({
    'Cars_notes': 
    ['"This Audi has ABS braking, leather interior and bucket seats..."',
    '"The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."',
    '"Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."',
    '"This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."',
    '"The Renault Le Car has been sitting in the garage, a little rust..."',
    '"The Kia Sorento for sale has a CD player, new tires..."',
    '"Red Dodge Viper convertible for sale, ceramic brakes, low miles..."']
})


# 1. car_type
Cars_listing['car_type'] = np.select(
    condlist=[ # note you could use the case-insensitive search with `case=False`
        Cars_listing['Cars_notes'].str.contains('ford', case=False),
        Cars_listing['Cars_notes'].str.contains('audi|renault', case=False),
        Cars_listing['Cars_notes'].str.contains('Toyota|Kia')
    ],
    choicelist=[1, 2, 3], # dummy variables
    default=0 # you could set it to `np.nan` etc
)

# 2. ABS_brakes
Cars_listing['ABS_brakes'] = np.where(# where(condition, [x, y])
    Cars_listing['Cars_notes'].str.contains('ABS brak'), 1, 0)

# 3. imperfection
Cars_listing['imperfection'] = np.where(
    Cars_listing['Cars_notes'].str.contains('rust|scratches'), 1, 0)

# 4. sporty
Cars_listing['sporty'] = np.where(
    Cars_listing['Cars_notes'].str.contains('convertible'), 1, 0)

    Cars_notes              car_type    ABS_brakes  imperfection    sporty
0   """This Audi has ..."   2           1           0               0
1   """The Ford F150 ..."   1           0           0               0
2   """Our Nissan Sen..."   0           1           0               0
3   """This Toyota Co..."   3           0           1               0
4   """The Renault Le..."   2           0           1               0
5   """The Kia Sorent..."   3           0           0               0
6   """Red Dodge Vipe..."   0           0           0               1