前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >sklearn 机器学习 Pipeline 模板

sklearn 机器学习 Pipeline 模板

作者头像
Michael阿明
发布2021-02-19 10:35:22
4900
发布2021-02-19 10:35:22
举报
文章被收录于专栏:Michael阿明学习之路

文章目录

    • 1. 导入工具包
    • 2. 读取数据
    • 3. 数字特征、文字特征分离
    • 4. 数据处理Pipeline
    • 5. 尝试不同的模型
    • 6. 参数搜索
    • 7. 特征重要性筛选
    • 8. 最终完整Pipeline

使用 sklearn 的 pipeline 搭建机器学习的流程 本文例子为 [Kesci] 新人赛 · 员工满意度预测 参考 [Hands On ML] 2. 一个完整的机器学习项目(加州房价预测)

1. 导入工具包

代码语言:javascript
复制
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

2. 读取数据

代码语言:javascript
复制
data = pd.read_csv("../competition/Employee_Satisfaction/train.csv")
test = pd.read_csv("../competition/Employee_Satisfaction/test.csv")
data.columns
代码语言:javascript
复制
Index(['id', 'last_evaluation', 'number_project', 'average_monthly_hours',
       'time_spend_company', 'Work_accident', 'package',
       'promotion_last_5years', 'division', 'salary', 'satisfaction_level'],
      dtype='object')
  • 训练数据,标签分离
代码语言:javascript
复制
y = data['satisfaction_level']
X = data.drop(['satisfaction_level'], axis=1)

3. 数字特征、文字特征分离

代码语言:javascript
复制
def num_cat_splitor(X):
    s = (X.dtypes == 'object')
    object_cols = list(s[s].index)
    # object_cols # ['package', 'division', 'salary']
    num_cols = list(set(X.columns) - set(object_cols))
    # num_cols
    # ['Work_accident', 'time_spend_company', 'promotion_last_5years', 'id',
    #  'average_monthly_hours',  'last_evaluation',  'number_project']
    return num_cols, object_cols
num_cols, object_cols = num_cat_splitor(X)
# print(num_cols)
# print(object_cols)
# X[object_cols].values
  • 特征数值筛选器
代码语言:javascript
复制
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

4. 数据处理Pipeline

  • 数字特征
代码语言:javascript
复制
num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_cols)),
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])
  • 文字特征
代码语言:javascript
复制
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(object_cols)),
        ('cat_encoder', OneHotEncoder(sparse=False)),
    ])
  • 组合数字和文字特征
代码语言:javascript
复制
full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])
X_prepared = full_pipeline.fit_transform(X)

5. 尝试不同的模型

代码语言:javascript
复制
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_scores = cross_val_score(forest_reg,X_prepared,y,
                               scoring='neg_mean_squared_error',cv=3)
forest_rmse_scores = np.sqrt(-forest_scores)
print(forest_rmse_scores)
print(forest_rmse_scores.mean())
print(forest_rmse_scores.std())

还可以尝试别的模型

6. 参数搜索

代码语言:javascript
复制
param_grid = [
    {'n_estimators' : [3,10,30,50,80],'max_features':[2,4,6,8]},
    {'bootstrap':[False], 'n_estimators' : [3,10],'max_features':[2,3,4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                          scoring='neg_mean_squared_error')
grid_search.fit(X_prepared,y)
  • 最佳参数
代码语言:javascript
复制
grid_search.best_params_
  • 最优模型
代码语言:javascript
复制
grid_search.best_estimator_
  • 搜索结果
代码语言:javascript
复制
cv_result = grid_search.cv_results_
for mean_score, params in zip(cv_result['mean_test_score'], cv_result['params']):
    print(np.sqrt(-mean_score), params)
代码语言:javascript
复制
0.2129252723367584 {'max_features': 2, 'n_estimators': 3}
0.19276874697889504 {'max_features': 2, 'n_estimators': 10}
0.1865548358477794 {'max_features': 2, 'n_estimators': 30}
.......

7. 特征重要性筛选

代码语言:javascript
复制
feature_importances = grid_search.best_estimator_.feature_importances_
  • 选择前 k 个最重要的特征
代码语言:javascript
复制
k = 3
def indices_of_top_k(arr, k):
    return np.sort(np.argpartition(np.array(arr), -k)[-k:])

class TopFeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_importances, k):
        self.feature_importances = feature_importances
        self.k = k
    def fit(self, X, y=None):
        self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)
        return self
    def transform(self, X):
        return X[:, self.feature_indices_]

8. 最终完整Pipeline

代码语言:javascript
复制
prepare_select_and_predict_pipeline = Pipeline([
    ('preparation', full_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k)),
    ('forst_reg', RandomForestRegressor())
])
  • 参数搜索
代码语言:javascript
复制
param_grid = [{
    'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],
    'feature_selection__k': list(range(5, len(feature_importances) + 1)),
    'forst_reg__n_estimators' : [200,250,300,310,330],
    'forst_reg__max_features':[2,4,6,8]
}]

grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=10,
                                scoring='neg_mean_squared_error', verbose=2, n_jobs=-1)
  • 训练
代码语言:javascript
复制
grid_search_prep.fit(X,y)
grid_search_prep.best_params_
final_model = grid_search_prep.best_estimator_
  • 预测
代码语言:javascript
复制
y_pred_test = final_model.predict(test)
result = pd.DataFrame()
result['id'] = test['id']
result['satisfaction_level'] = y_pred_test
result.to_csv('rf_ML_pipeline.csv',index=False)

以上只是粗略的大体框架,还有很多细节,大家多指教!

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2020/07/29 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 文章目录
  • 1. 导入工具包
  • 2. 读取数据
  • 3. 数字特征、文字特征分离
  • 4. 数据处理Pipeline
  • 5. 尝试不同的模型
  • 6. 参数搜索
  • 7. 特征重要性筛选
  • 8. 最终完整Pipeline
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档