前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Feature Engineering 特征工程 2. Categorical Encodings

Feature Engineering 特征工程 2. Categorical Encodings

作者头像
Michael阿明
发布2020-07-13 14:38:34
1K0
发布2020-07-13 14:38:34
举报

在中级机器学习里介绍过了Label EncodingOne-Hot Encoding,下面将学习count encoding计数编码,target encoding目标编码、singular value decomposition奇异值分解

在上一篇中使用LabelEncoder(),得分为Validation AUC score: 0.7467

代码语言:javascript
复制
# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)

1. Count Encoding 计数编码

  • 计数编码,就是把该类型的value,替换为其出现的次数 例如:一个特征中CN出现了100次,那么就将CN,替换成数值100
  • category_encoders.CountEncoder(),最终得分Validation AUC score: 0.7486
代码语言:javascript
复制
import category_encoders as ce
cat_features = ['category', 'currency', 'country']
count_enc = ce.CountEncoder()
count_encoded = count_enc.fit_transform(ks[cat_features])

data = baseline_data.join(count_encoded.add_suffix("_count"))

# Training a model on the baseline data
train, valid, test = get_data_splits(data)
bst = train_model(train, valid)

2. Target Encoding 目标编码

  • category_encoders.TargetEncoder(),最终得分Validation AUC score: 0.7491

Target encoding replaces a categorical value with the average value of the target for that value of the feature. 目标编码:将会用该特征值的 label 的平均值 替换 分类特征值 For example, given the country value “CA”, you’d calculate the average outcome for all the rows with country == ‘CA’, around 0.28. 举例子:特征值 “CA”,你要计算所有 “CA” 行的 label(即outcome列)的均值,用该均值来替换 “CA” This is often blended with the target probability over the entire dataset to reduce the variance of values with few occurences. 这么做,可以降低很少出现的值的方差?

This technique uses the targets to create new features. So including the validation or test data in the target encodings would be a form of target leakage. 这种编码方法会产生新的特征,不要把验证集和测试集拿进来fit,会产生数据泄露 Instead, you should learn the target encodings from the training dataset only and apply it to the other datasets. 应该从训练集里fit,应用到其他数据集

代码语言:javascript
复制
import category_encoders as ce
cat_features = ['category', 'currency', 'country']

# Create the encoder itself
target_enc = ce.TargetEncoder(cols=cat_features)

train, valid, _ = get_data_splits(data)

# Fit the encoder using the categorical features and target
target_enc.fit(train[cat_features], train['outcome'])

# Transform the features, rename the columns with _target suffix, and join to dataframe
train = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

train.head()
bst = train_model(train, valid)

3. CatBoost Encoding

  • category_encoders.CatBoostEncoder(),最终得分Validation AUC score: 0.7492

This is similar to target encoding in that it’s based on the target probablity for a given value. 跟目标编码类似的点在于,它基于给定值的 label 目标概率 However with CatBoost, for each row, the target probability is calculated only from the rows before it. 计算上,对每一行,目标概率的计算只依靠它之前的行

代码语言:javascript
复制
cat_features = ['category', 'currency', 'country']
target_enc = ce.CatBoostEncoder(cols=cat_features)

train, valid, _ = get_data_splits(data)
target_enc.fit(train[cat_features], train['outcome'])

train = train.join(target_enc.transform(train[cat_features]).add_suffix('_cb'))
valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_cb'))

bst = train_model(train, valid)
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2020/05/20 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1. Count Encoding 计数编码
  • 2. Target Encoding 目标编码
  • 3. CatBoost Encoding
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档