前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Kaggle初探--房价预测案例之模型建立

Kaggle初探--房价预测案例之模型建立

作者头像
zhuanxu
发布2018-08-23 13:07:36
2.8K0
发布2018-08-23 13:07:36
举报
文章被收录于专栏:进击的程序猿

概述

本文数据来源kaggle的House Prices: Advanced Regression Techniques大赛。

本文接着Kaggle 初探 -- 房价预测案例之数据分析做模型部分。

代码语言:javascript
复制
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
from scipy.stats import skew
from scipy.stats import norm
import matplotlib
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# import warnings
# warnings.filterwarnings('ignore')

%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook
%matplotlib inline
代码语言:javascript
复制
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")

特征工程

此处特征的处理根据Kaggle 初探 -- 房价预测案例之数据分析的分析来做。

代码语言:javascript
复制
all_df = pd.concat((train_df.loc[:,'MSSubClass':'SaleCondition'], test_df.loc[:,'MSSubClass':'SaleCondition']), axis=0,ignore_index=True)
all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)
quantitative = [f for f in all_df.columns if all_df.dtypes[f] != 'object']
qualitative = [f for f in all_df.columns if all_df.dtypes[f] == 'object']

缺失数据处理

对于缺失数据,我们直接将列删除

代码语言:javascript
复制
missing = all_df.isnull().sum()
missing.sort_values(inplace=True,ascending=False)
missing = missing[missing > 0]
代码语言:javascript
复制
#dealing with missing data
all_df = all_df.drop(missing[missing>1].index,1)
# 对于missing 1 的我们到时候以平均数填充
代码语言:javascript
复制
all_df.isnull().sum()[all_df.isnull().sum()>0]
代码语言:javascript
复制
Exterior1st    1
Exterior2nd    1
BsmtFinSF1     1
BsmtFinSF2     1
BsmtUnfSF      1
TotalBsmtSF    1
Electrical     1
KitchenQual    1
GarageCars     1
GarageArea     1
SaleType       1
dtype: int64

处理log项

GrLivArea、1stFlrSF、2ndFlrSF、TotalBsmtSF、LotArea、KitchenAbvGr、GarageArea 以上特征我们进行logp处理

代码语言:javascript
复制
logfeatures = ['GrLivArea','1stFlrSF','2ndFlrSF','TotalBsmtSF','LotArea','KitchenAbvGr','GarageArea']
代码语言:javascript
复制
for logfeature in logfeatures:
    all_df[logfeature] = np.log1p(all_df[logfeature].values)

处理Boolean变量

代码语言:javascript
复制
all_df['HasBasement'] = all_df['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
all_df['HasGarage'] = all_df['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
all_df['Has2ndFloor'] = all_df['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
all_df['HasWoodDeck'] = all_df['WoodDeckSF'].apply(lambda x: 1 if x > 0 else 0)
all_df['HasPorch'] = all_df['OpenPorchSF'].apply(lambda x: 1 if x > 0 else 0)
all_df['HasPool'] = all_df['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
all_df['IsNew'] = all_df['YearBuilt'].apply(lambda x: 1 if x > 2000 else 0)
代码语言:javascript
复制
quantitative = [f for f in all_df.columns if all_df.dtypes[f] != 'object']
qualitative = [f for f in all_df.columns if all_df.dtypes[f] == 'object']

对于定性变量的encode

代码语言:javascript
复制
all_dummy_df = pd.get_dummies(all_df)

对于数值变量进行标准化

代码语言:javascript
复制
all_dummy_df.isnull().sum().sum()
代码语言:javascript
复制
6
代码语言:javascript
复制
mean_cols = all_dummy_df.mean()
all_dummy_df = all_dummy_df.fillna(mean_cols)
代码语言:javascript
复制
all_dummy_df.isnull().sum().sum()
代码语言:javascript
复制
0
代码语言:javascript
复制
X = all_dummy_df[quantitative]
std = StandardScaler()
s = std.fit_transform(X)
代码语言:javascript
复制
all_dummy_df[quantitative] = s
代码语言:javascript
复制
dummy_train_df = all_dummy_df.loc[train_df.index]
dummy_test_df = all_dummy_df.loc[test_df.index]
代码语言:javascript
复制
y_train = np.log(train_df.SalePrice)

模型预测

此处我们先运用多个模型进行预测,最后进行bagging操作

岭回归

代码语言:javascript
复制
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
代码语言:javascript
复制
y_train.values
代码语言:javascript
复制
array([ 12.24769432,  12.10901093,  12.31716669, ...,  12.49312952,
        11.86446223,  11.90158345])
代码语言:javascript
复制
def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, dummy_train_df, y_train.values, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)
代码语言:javascript
复制
alphas = np.logspace(-3, 2, 50)
cv_ridge = []
coefs = []
for alpha in alphas:
    model = Ridge(alpha = alpha)
    model.fit(dummy_train_df,y_train)
    cv_ridge.append(rmse_cv(model).mean())
    coefs.append(model.coef_)
代码语言:javascript
复制
import matplotlib.pyplot as plt
%matplotlib inline
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation - Just Do It")
plt.xlabel("alpha")
plt.ylabel("rmse")
# plt.plot(alphas, cv_ridge)
# plt.title("Alpha vs CV Error")
代码语言:javascript
复制
<matplotlib.text.Text at 0x118dd0ef0>

output_30_1.png

代码语言:javascript
复制
# 岭迹图
# matplotlib.rcParams['figure.figsize'] = (12.0, 12.0)
ax = plt.gca()

# ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])

ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()

output_31_0.png

很尴尬的岭迹图,主要是现在feature太多了。看不出什么东西来

Lasso

Lasso能针对上面特征太多的问题,来选择一部分重要的特征

代码语言:javascript
复制
from sklearn.linear_model import Lasso,LassoCV
代码语言:javascript
复制
# alphas = np.logspace(-3, 2, 50)
# alphas = [1, 0.1, 0.001, 0.0005]
alphas = np.logspace(-4, -2, 100)
cv_lasso = []
coefs = []
for alpha in alphas:
    model = Lasso(alpha = alpha,max_iter=5000)
    model.fit(dummy_train_df,y_train)
    cv_lasso.append(rmse_cv(model).mean())
    coefs.append(model.coef_)
代码语言:javascript
复制
cv_lasso = pd.Series(cv_lasso, index = alphas)
cv_lasso.plot(title = "Validation - Just Do It")
plt.xlabel("alpha")
plt.ylabel("rmse")
# plt.plot(alphas, cv_ridge)
# plt.title("Alpha vs CV Error")
代码语言:javascript
复制
<matplotlib.text.Text at 0x118bca940>

output_36_1.png

代码语言:javascript
复制
print(cv_lasso.min(), cv_lasso.argmin())
代码语言:javascript
复制
0.128843680722 0.000585702081806
代码语言:javascript
复制
model = Lasso(alpha = 0.00058,max_iter=5000)
model.fit(dummy_train_df,y_train)
代码语言:javascript
复制
Lasso(alpha=0.00058, copy_X=True, fit_intercept=True, max_iter=5000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
代码语言:javascript
复制
coef = pd.Series(model.coef_, index = dummy_train_df.columns)
代码语言:javascript
复制
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")
代码语言:javascript
复制
Lasso picked 84 variables and eliminated the other 142 variables
代码语言:javascript
复制
imp_coef = pd.concat([coef.sort_values().head(10),
                     coef.sort_values().tail(10)])
代码语言:javascript
复制
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
代码语言:javascript
复制
<matplotlib.text.Text at 0x11aa1dbe0>

output_42_1.png

Elastic Net

结合了 Lasso 和 Ridge 两个模型,既能解决 Lasso 的共线问题,又能很好的筛选变量

代码语言:javascript
复制
from sklearn.linear_model import ElasticNet,ElasticNetCV
代码语言:javascript
复制
elastic = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], 
                                    alphas=[0.001, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75], cv=5,max_iter=5000)
代码语言:javascript
复制
elastic.fit(dummy_train_df, y_train)
代码语言:javascript
复制
ElasticNetCV(alphas=[0.001, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75],
       copy_X=True, cv=5, eps=0.001, fit_intercept=True,
       l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], max_iter=5000,
       n_alphas=100, n_jobs=1, normalize=False, positive=False,
       precompute='auto', random_state=None, selection='cyclic',
       tol=0.0001, verbose=0)
代码语言:javascript
复制
rmse_cv(elastic).mean()
代码语言:javascript
复制
0.12908591441325348

特征二

很尴尬的发现这种提取特征的方式,取得的结果不是很好,所以,此处我们采用https://www.kaggle.com/opanichev/ensemble-of-4-models-with-cv-lb-0-11489 这篇文章的方式来处理特征

代码语言:javascript
复制
import utils
代码语言:javascript
复制
train_df_munged,label_df,test_df_munged = utils.feature_engineering()
代码语言:javascript
复制
Training set size: (1456, 111)
Test set size: (1459, 111)
Features engineering..
0:00:14.427659
代码语言:javascript
复制
test_df = pd.read_csv('../input/test.csv')
代码语言:javascript
复制
from sklearn.metrics import mean_squared_error,make_scorer
from sklearn.model_selection import cross_val_score
# 定义自己的score函数
def my_custom_loss_func(ground_truth, predictions):
    return np.sqrt(mean_squared_error(np.exp(ground_truth), np.exp(predictions)))

my_loss_func  = make_scorer(my_custom_loss_func, greater_is_better=False)
代码语言:javascript
复制
def rmse_cv2(model):
    rmse= np.sqrt(-cross_val_score(model, train_df_munged, label_df.SalePrice, scoring='neg_mean_squared_error', cv = 5))
    return(rmse)

L2 岭回归

代码语言:javascript
复制
from sklearn.linear_model import RidgeCV,Ridge
代码语言:javascript
复制
alphas = np.logspace(-3, 2, 100)
model_ridge = RidgeCV(alphas=alphas).fit(train_df_munged, label_df.SalePrice)
代码语言:javascript
复制
# Run prediction on training set to get a rough idea of how well it does.
pred_Y_ridge = model_ridge.predict(train_df_munged)
print("Ridge score on training set: ", model_ridge.score(train_df_munged,label_df.SalePrice))
代码语言:javascript
复制
Ridge score on training set:  0.940191172098
代码语言:javascript
复制
print("cross_validation: ",rmse_cv2(model_ridge).mean())
代码语言:javascript
复制
cross_validation:  0.111384227695

Lasso

代码语言:javascript
复制
from sklearn.linear_model import Lasso,LassoCV
代码语言:javascript
复制
model_lasso = LassoCV(eps=0.0001,max_iter=20000).fit(train_df_munged, label_df.SalePrice)
代码语言:javascript
复制
# Run prediction on training set to get a rough idea of how well it does.
pred_Y_lasso = model_lasso.predict(train_df_munged)
print("Lasso score on training set: ", model_lasso.score(train_df_munged,label_df.SalePrice))
代码语言:javascript
复制
Lasso score on training set:  0.940560493411
代码语言:javascript
复制
print("cross_validation: ",rmse_cv2(model_lasso).mean())
代码语言:javascript
复制
cross_validation:  0.11036670335

Elastic Net

代码语言:javascript
复制
from sklearn.linear_model import ElasticNet,ElasticNetCV
代码语言:javascript
复制
model_elastic = ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], 
                                    alphas=[0.001, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75], cv=5,max_iter=10000)
代码语言:javascript
复制
model_elastic.fit(train_df_munged, label_df.SalePrice)
代码语言:javascript
复制
ElasticNetCV(alphas=[0.001, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75],
       copy_X=True, cv=5, eps=0.001, fit_intercept=True,
       l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], max_iter=10000,
       n_alphas=100, n_jobs=1, normalize=False, positive=False,
       precompute='auto', random_state=None, selection='cyclic',
       tol=0.0001, verbose=0)
代码语言:javascript
复制
# Run prediction on training set to get a rough idea of how well it does.
pred_Y_elastic = model_elastic.predict(train_df_munged)
print("Elastic score on training set: ", model_elastic.score(train_df_munged,label_df.SalePrice))
代码语言:javascript
复制
Elastic score on training set:  0.940707195529
代码语言:javascript
复制
print("cross_validation: ",rmse_cv2(model_elastic).mean())
代码语言:javascript
复制
cross_validation:  0.109106832215

XGBoost

参看:https://www.kaggle.com/aharless/amit-choudhary-s-kernel-notebook-ified

此处XGBoost怎么进行调优缺失

代码语言:javascript
复制
# XGBoost -- I did some "manual" cross-validation here but should really find
# these hyperparameters using CV. ;-)

import xgboost as xgb

model_xgb = xgb.XGBRegressor(
                 colsample_bytree=0.2,
                 gamma=0.0,
                 learning_rate=0.05,
                 max_depth=6,
                 min_child_weight=1.5,
                 n_estimators=7200,                                                                  
                 reg_alpha=0.9,
                 reg_lambda=0.6,
                 subsample=0.2,
                 seed=42,
                 silent=1)

model_xgb.fit(train_df_munged, label_df.SalePrice)

# Run prediction on training set to get a rough idea of how well it does.
pred_Y_xgb = model_xgb.predict(train_df_munged)
print("XGBoost score on training set: ", model_xgb.score(train_df_munged,label_df.SalePrice)) # 过拟合
代码语言:javascript
复制
XGBoost score on training set:  0.990853904354
代码语言:javascript
复制
print("cross_validation: ",rmse_cv2(model_xgb).mean())
代码语言:javascript
复制
cross_validation:  0.11857237109
代码语言:javascript
复制
print("score: ",mean_squared_error(model_xgb.predict(train_df_munged),label_df.SalePrice))
代码语言:javascript
复制
score:  0.0014338471114

Ensemble

代码语言:javascript
复制
from sklearn.linear_model import LinearRegression
# Create linear regression object
regr = LinearRegression()
代码语言:javascript
复制
train_x = np.concatenate(
    (pred_Y_lasso[np.newaxis, :].T,pred_Y_ridge[np.newaxis, :].T,
     pred_Y_elastic[np.newaxis, :].T,pred_Y_xgb[np.newaxis, :].T), axis=1)
代码语言:javascript
复制
regr.fit(train_x,label_df.SalePrice)
代码语言:javascript
复制
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
代码语言:javascript
复制
regr.coef_
代码语言:javascript
复制
array([ 2.28665601, -0.15426296, -2.43483763,  1.30394217])
代码语言:javascript
复制
print("Ensemble score on training set: ", regr.score(train_x,label_df.SalePrice)) # 过拟合
代码语言:javascript
复制
Ensemble score on training set:  0.993716162184

很尴尬的发现通过ensemble操作并没有任何帮助

代码语言:javascript
复制
print("score: ",mean_squared_error(regr.predict(train_x),label_df.SalePrice))
代码语言:javascript
复制
score:  0.000985126664884

提交答案

代码语言:javascript
复制
model_lasso.predict(test_df_munged)[np.newaxis, :].T
代码语言:javascript
复制
array([ 11.67407587,  11.95939264,  12.11110308, ...,  12.01706033,
        11.70077616,  12.29221647])
代码语言:javascript
复制
test_x = np.concatenate(
(model_lasso.predict(test_df_munged)[np.newaxis, :].T,model_ridge.predict(test_df_munged)[np.newaxis, :].T,
                           model_elastic.predict(test_df_munged)[np.newaxis, :].T, model_xgb.predict(test_df_munged)[np.newaxis, :].T)
        ,axis=1)
代码语言:javascript
复制
y_final = regr.predict(test_x)
代码语言:javascript
复制
y_final
代码语言:javascript
复制
array([ 11.83896506,  11.95544055,  12.08303061, ...,  12.02530217,
        11.71776755,  12.16714229])
代码语言:javascript
复制
submission_df = pd.DataFrame(data= {'Id' : test_df.Id, 'SalePrice': np.exp(y_final)})
代码语言:javascript
复制
submission_df.to_csv("bag-4.csv",index=False) # 取消index的存储
代码语言:javascript
复制
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2017.06.22 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 概述
  • 特征工程
    • 缺失数据处理
      • 处理log项
        • 处理Boolean变量
          • 对于定性变量的encode
            • 对于数值变量进行标准化
            • 模型预测
              • 岭回归
                • Lasso
                  • Elastic Net
                  • 特征二
                    • L2 岭回归
                      • Lasso
                        • Elastic Net
                          • XGBoost
                            • Ensemble
                            • 提交答案
                            领券
                            问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档