TVP

# 用机器学习分析流行音乐（三）：构建模型

## 对数据帧进行子集化，并将分类变量转换为虚拟变量

``````df_model = df[['popl_by_co_yn', 'reason', 'yr_listened',
'gender_pref','daily_music_hr', 'watch_MV_yn', 'daily_MV_hr',
'obsessed_yn','news_medium', 'pursuit', 'time_cons_yn', 'life_chg',
'pos_eff','yr_merch_spent', 'money_src', 'concert_yn', 'crazy_ev', 'age','country',
'job', 'gender', 'num_gr_like', 'bts_vs_others']] ``````

## 训练和测试拆分

`X` 为“ `daily_music_hr` ”之外的所有其他变量，设 `Y` 为“ `daily_music_hr` ”。然后，使用 80% 作为训练集，剩下的 20% 作为测试集。

## 多元线性回归

``````from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
# initialize the linear regression model
lm = LinearRegression()
# train the model
lm.fit(X_train, y_train)
# perform predicion on the test data
y_pred = lm.predict(X_test)
# performance metrics
print('Coefficients:', lm.coef_)
print('Intercept:', lm.intercept_)
print('Mean absolute error (MAE): %.2f' % mean_absolute_error(y_test, y_pred)) ``````

## 套索回归

Lasso 的 MAE 为 1.58。

## 随机森林回归

``````from sklearn.model_selection import GridSearchCV
params = {'n_estimators':range(10,100,10),
'criterion':('mse','mae'),
'max_features':('auto','sqrt','log2')}
gs_rf = GridSearchCV(rf, params,
scoring = 'neg_mean_absolute_error', cv = 10)
gs_rf.fit(X_train, y_train) ``````

## XGBoost

``````params = {'min_child_weight': [3, 5, ],
'gamma': [0.5, 1],
'subsample': [0.8, 1.0],
'colsample_bytree': [0.6, 0.8],
'max_depth': [1,2]}
gs_xgb = GridSearchCV(xgb, params,
scoring = 'neg_mean_absolute_error', cv = 10)
gs_xgb.fit(X_train, y_train) ``````

## 比较所有模型的性能

``````lm_pred = lm.predict(X_test)
lm_las_pred = lm_las.predict(X_test)
lm_rid_pred = lm_rid.predict(X_test)
rf_pred = gs_rf.best_estimator_.predict(X_test)
xgb_pred = gs_xgb.best_estimator_.predict(X_test)
``````

Jaemin Lee，专攻数据分析与数据科学，数据科学应届毕业生。

https://towardsdatascience.com/analyzing-k-pop-using-machine-learning-part-3-model-building-c19149964a22

• 发表于:
• 本文为 InfoQ 中文站特供稿件
• 首发地址https://www.infoq.cn/article/RpO2oAjua52z4LMycCJZ
• 如有侵权，请联系 cloudcommunity@tencent.com 删除。

2020-08-15

2020-09-23

2018-06-14

2023-09-08

2020-02-18

2018-04-21

2018-06-13

2021-05-21

2018-07-09

2018-06-13

2018-06-14

2018-08-09

2018-12-11

2018-07-16

2018-05-30

2023-09-16

2018-05-18

2018-02-19

2018-12-19

2018-01-26