每周一道Kaggle题,学习机器学习!
今天给大家来讲讲《House Prices: Advanced Regression Techniques》(房价预测模型)的思路:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, skew
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
去除ID
train_path = "http://kaggle.shikanon.com/house-prices-advanced-regression-techniques/train.csv"
test_path = "http://kaggle.shikanon.com/house-prices-advanced-regression-techniques/test.csv"
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
train_df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } .rendered_html th, .rendered_html td { text-align: right; vertical-align: middle; padding: 0.5em 0.5em; line-height: normal; white-space: nowrap; width: 80px; border: none; }
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
我们在分析之前要先了解个字段的意思:
train_df.columns
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
'SaleCondition', 'SalePrice'],
dtype='object')
#Saving Ids
train_ID = train_df['Id']
test_ID = test_df['Id']
#Dropping Ids
train_df.drop("Id", axis = 1, inplace = True)
test_df.drop("Id", axis = 1, inplace = True)
更加常识,一般和房价最相关的是居住面积,也就是GrLivArea
,我们查看下GrLivArea
和SalePrice
的关系
fig, ax = plt.subplots()
ax.scatter(x = train_df['GrLivArea'], y = train_df['SalePrice'], c = "skyblue")
plt.ylabel('SalePrice', fontsize=8)
plt.xlabel('GrLivArea', fontsize=8)
plt.show()
我们发现有个别值特别的偏离,GrLivArea
有两个点在4000以上,但其价格不到200000,首先这种点特别少(不到总数的3%),我们把他作为异常值去掉(其实是否去掉我们可以多做几次实验来验证)
Kaggle的作者在这里有建议去掉:)
train_df.drop(train_df[(train_df['GrLivArea']>4000)&(train_df['GrLivArea']<30000)].index,inplace=True)
fig, ax = plt.subplots()
ax.scatter(x = train_df['GrLivArea'], y = train_df['SalePrice'], c = "skyblue")
plt.ylabel('SalePrice', fontsize=8)
plt.xlabel('GrLivArea', fontsize=8)
plt.show()
SalePrice
, 我们可以看到她跟各变量的关系,还有各变量相互之间的关系观察数据分布
在机器学习中,对数据的认识是很重要的,他会影响我们的特征构建和建模,特别对于偏态分布,我们要做一些变换# 统计表述
train_df['SalePrice'].describe()
count 1456.000000
mean 180151.233516
std 76696.592530
min 34900.000000
25% 129900.000000
50% 163000.000000
75% 214000.000000
max 625000.000000
Name: SalePrice, dtype: float64
# 绘制分布图
sns.distplot(train_df['SalePrice'],
kde_kws={"color": "coral", "lw": 1, "label": "KDE"},
hist_kws={"histtype": "stepfilled", "linewidth": 3, "alpha": 1, "color": "skyblue"});
Q-Q图,全称 Quantile Quantile Plot,中文名叫分位数图,Q-Q图是一个概率图,用于比较观测与预测值之间的概率分布差异,这里的比较对象一般采用正态分布,Q-Q图可以用于检验数据分布的相似性,而P-P图是根据变量的累积概率对应于所指定的理论分布累积概率绘制的散点图,两者基本一样
# 绘制P-P图
fig = plt.figure()
res = stats.probplot(train_df['SalePrice'], dist="norm", plot=plt)
plt.show()
红色线是正态分布,蓝色线是我们的数据,可以看出,我们的数据头尾都严重偏离了正太分布,我们尝试对数据做变换,常用的变换有指数变换、对数变换、幂函数等。
# 对数变换
train_df['SalePrice_Log'] = np.log(train_df['SalePrice'])
sns.distplot(train_df['SalePrice_Log'],
kde_kws={"color": "coral", "lw": 1, "label": "KDE"},
hist_kws={"histtype": "stepfilled", "linewidth": 3, "alpha": 1, "color": "skyblue"});
# 偏度与峰值(skewness and kurtosis)
print("Skewness: %f" % train_df['SalePrice_Log'].skew())
print("Kurtosis: %f" % train_df['SalePrice_Log'].kurt())
fig = plt.figure()
res = stats.probplot(train_df['SalePrice_Log'], plot=plt)
plt.show()
Skewness: 0.065449
Kurtosis: 0.666438
# 指数变换
train_df['SalePrice_Exp'] = np.exp(train_df['SalePrice']/train_df['SalePrice'].mean())
sns.distplot(train_df['SalePrice_Exp'],
kde_kws={"color": "coral", "lw": 1, "label": "KDE"},
hist_kws={"histtype": "stepfilled", "linewidth": 3, "alpha": 1, "color": "skyblue"});
# 偏度与峰值(skewness and kurtosis)
print("Skewness: %f" % train_df['SalePrice_Exp'].skew())
print("Kurtosis: %f" % train_df['SalePrice_Exp'].kurt())
fig = plt.figure()
res = stats.probplot(train_df['SalePrice_Exp'], plot=plt)
plt.show()
Skewness: 6.060076
Kurtosis: 56.822460
# 幂函数变换
train_df['SalePrice_Square'] = train_df['SalePrice']**0.5
sns.distplot(train_df['SalePrice_Square'],
kde_kws={"color": "coral", "lw": 1, "label": "KDE"},
hist_kws={"histtype": "stepfilled", "linewidth": 3, "alpha": 1, "color": "skyblue"});
# 偏度与峰值(skewness and kurtosis)
print("Skewness: %f" % train_df['SalePrice_Square'].skew())
print("Kurtosis: %f" % train_df['SalePrice_Square'].kurt())
fig = plt.figure()
res = stats.probplot(train_df['SalePrice_Square'], plot=plt)
plt.show()
Skewness: 0.810797
Kurtosis: 1.245798
三个函数拟合对比,对数转换最吻合,但是我们知道对数意味着小于1的时候为负数,这明显和认知不符合,应该采用log(1+x),也就是log1p,保证了x数据的有效性,当x很小时,如: 10^{-16} ,由于太小超过数值有效性,用log(x+1)计算得到结果为0
# 对数变换
train_df['SalePrice_Log1p'] = np.log1p(train_df['SalePrice'])
sns.distplot(train_df['SalePrice_Log1p'],
kde_kws={"color": "coral", "lw": 1, "label": "KDE"},
hist_kws={"histtype": "stepfilled", "linewidth": 3, "alpha": 1, "color": "skyblue"});
# 偏度与峰值(skewness and kurtosis)
print("Skewness: %f" % train_df['SalePrice_Log1p'].skew())
print("Kurtosis: %f" % train_df['SalePrice_Log1p'].kurt())
fig = plt.figure()
res = stats.probplot(train_df['SalePrice_Log1p'], plot=plt)
plt.show()
Skewness: 0.065460
Kurtosis: 0.666423
del train_df['SalePrice_Square']
del train_df["SalePrice_Exp"]
del train_df['SalePrice_Log']
del train_df["SalePrice"]
将测试数据和训练数据联合一起进行特征分析
size_train_df = train_df.shape[0]
size_test_df = test_df.shape[0]
target_variable = train_df['SalePrice_Log1p'].values
data = pd.concat((train_df, test_df),sort=False).reset_index(drop=True)
data.drop(['SalePrice_Log1p'], axis=1, inplace=True)
缺失值是实际数据分析很重要的一块,在实际生产中一直都会有大量的缺失值存在,如何处理好缺失值是很关键也很重要的一步。 常见的缺失值处理有:
data.count().sort_values().head(20) # 通过 count 可以找出有缺失值的数据
PoolQC 8
MiscFeature 105
Alley 198
Fence 570
FireplaceQu 1495
LotFrontage 2429
GarageCond 2756
GarageYrBlt 2756
GarageFinish 2756
GarageQual 2756
GarageType 2758
BsmtExposure 2833
BsmtCond 2833
BsmtQual 2834
BsmtFinType2 2835
BsmtFinType1 2836
MasVnrType 2891
MasVnrArea 2892
MSZoning 2911
BsmtFullBath 2913
dtype: int64
data_na = (data.isnull().sum() / len(data)) * 100
data_na.drop(data_na[data_na==0].index,inplace=True)
data_na = data_na.sort_values(ascending=False)
f, ax = plt.subplots(figsize=(10, 10))
plt.xticks(rotation='90')
sns.barplot(x=data_na.index, y=data_na)
plt.xlabel('Features', fontsize=10)
plt.ylabel('Percent of missing values', fontsize=10)
plt.title('Percent missing data by feature', fontsize=10)
Text(0.5, 1.0, 'Percent missing data by feature')
var = 'Utilities'
train_var_count = train_df[var].value_counts()
fig = sns.barplot(x=train_var_count.index, y=train_var_count)
plt.xticks(rotation=90);
var = 'MSZoning'
train_var_count = train_df[var].value_counts()
fig = sns.barplot(x=train_var_count.index, y=train_var_count)
plt.xticks(rotation=90);
# 填充nil
features_fill_na_none = ['PoolQC','MiscFeature','Alley','Fence','FireplaceQu',
'GarageQual','GarageCond','GarageFinish','GarageType',
'BsmtExposure','BsmtCond','BsmtQual','BsmtFinType1','BsmtFinType2',
'MasVnrType']
# 填充0
features_fill_na_0 = ['GarageYrBlt', 'GarageArea', 'GarageCars', 'MasVnrArea',
'BsmtFullBath','BsmtHalfBath', 'BsmtFinSF1', 'BsmtFinSF2',
'BsmtUnfSF', 'TotalBsmtSF']
# 填众数
features_fill_na_mode = ["Functional", "MSZoning", "SaleType", "Electrical",
"KitchenQual", "Exterior2nd", "Exterior1st"]
for feature_none in features_fill_na_none:
data[feature_none].fillna('None',inplace=True)
for feature_0 in features_fill_na_0:
data[feature_0].fillna(0,inplace=True)
for feature_mode in features_fill_na_mode:
mode_value = data[feature_mode].value_counts().sort_values(ascending=False).index[0]
data[features_fill_na_mode] = data[features_fill_na_mode].fillna(mode_value)
# 用中值代替
data["LotFrontage"] = data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
# 像 Utilities 这种总共才两个值,同时有一个值是作为主要的,这种字段是无意义的,应该删除
data.drop(['Utilities'], axis=1,inplace=True)
data_na = (data.isnull().sum() / len(data)) * 100
data_na.drop(data_na[data_na==0].index,inplace=True)
data_na = data_na.sort_values(ascending=False)
data_na # data_na 为空
Series([], dtype: float64)
类型转换,将某些实际是类别类型但用数字表示的强制转换成文本,比如有些调查男表示1,女表示0,在这种情况下,如果我们直接通过dataframe类型判断会导致错误,我们要根据实际情况做转换
#MSSubClass=The building class
data['MSSubClass'] = data['MSSubClass'].apply(str)
#Changing OverallCond into a categorical variable
data['OverallCond'] = data['OverallCond'].astype(str)
#Year and month sold are transformed into categorical features.
data['YrSold'] = data['YrSold'].astype(str)
data['MoSold'] = data['MoSold'].astype(str)
关系矩阵可以很直观的告诉我们那些变量之间相关,哪些变量并不相关
# 关系矩阵
corrmat = train_df.corr()
corrmat
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } .rendered_html th, .rendered_html td { text-align: right; vertical-align: middle; padding: 0.5em 0.5em; line-height: normal; white-space: nowrap; width: 80px; border: none; }
MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | ... | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice_Log1p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MSSubClass | 1.000000 | -0.408655 | -0.142192 | 0.032416 | -0.059277 | 0.027689 | 0.040459 | 0.022800 | -0.075268 | -0.065598 | ... | -0.012853 | -0.006687 | -0.011966 | -0.043802 | -0.025979 | 0.007957 | -0.007666 | -0.013512 | -0.021330 | -0.075083 |
LotFrontage | -0.408655 | 1.000000 | 0.387570 | 0.225994 | -0.055673 | 0.113913 | 0.079785 | 0.162026 | 0.133704 | 0.056971 | ... | 0.077154 | 0.116131 | 0.017010 | 0.075575 | 0.047832 | 0.075160 | 0.005636 | 0.028042 | 0.010598 | 0.363286 |
LotArea | -0.142192 | 0.387570 | 1.000000 | 0.088719 | -0.002832 | 0.006590 | 0.006930 | 0.081170 | 0.173426 | 0.114691 | ... | 0.167040 | 0.061679 | -0.016108 | 0.021505 | 0.045620 | 0.033875 | 0.039192 | 0.007188 | -0.013014 | 0.258945 |
OverallQual | 0.032416 | 0.225994 | 0.088719 | 1.000000 | -0.090692 | 0.571712 | 0.550971 | 0.400028 | 0.213079 | -0.057520 | ... | 0.232819 | 0.297803 | -0.112407 | 0.031621 | 0.067732 | 0.018121 | -0.031068 | 0.076414 | -0.024321 | 0.819240 |
OverallCond | -0.059277 | -0.055673 | -0.002832 | -0.090692 | 1.000000 | -0.375691 | 0.074703 | -0.130053 | -0.042542 | 0.040015 | ... | -0.003063 | -0.029649 | 0.070103 | 0.025419 | 0.054617 | 0.008079 | 0.068729 | -0.003135 | 0.043755 | -0.036843 |
YearBuilt | 0.027689 | 0.113913 | 0.006590 | 0.571712 | -0.375691 | 1.000000 | 0.591906 | 0.314066 | 0.248272 | -0.048393 | ... | 0.222690 | 0.183905 | -0.386904 | 0.031717 | -0.049703 | -0.014373 | -0.034193 | 0.013881 | -0.012593 | 0.588977 |
YearRemodAdd | 0.040459 | 0.079785 | 0.006930 | 0.550971 | 0.074703 | 0.591906 | 1.000000 | 0.176067 | 0.121690 | -0.067188 | ... | 0.204020 | 0.222649 | -0.193348 | 0.045596 | -0.038176 | -0.009490 | -0.010100 | 0.022629 | 0.036597 | 0.568986 |
MasVnrArea | 0.022800 | 0.162026 | 0.081170 | 0.400028 | -0.130053 | 0.314066 | 0.176067 | 1.000000 | 0.235557 | -0.071337 | ... | 0.149491 | 0.106073 | -0.109453 | 0.020270 | 0.065296 | -0.016025 | -0.029691 | 0.004019 | -0.004880 | 0.430073 |
BsmtFinSF1 | -0.075268 | 0.133704 | 0.173426 | 0.213079 | -0.042542 | 0.248272 | 0.121690 | 0.235557 | 1.000000 | -0.048738 | ... | 0.201462 | 0.071851 | -0.103053 | 0.029879 | 0.070026 | 0.016380 | 0.005149 | -0.001773 | 0.018506 | 0.382710 |
BsmtFinSF2 | -0.065598 | 0.056971 | 0.114691 | -0.057520 | 0.040015 | -0.048393 | -0.067188 | -0.071337 | -0.048738 | 1.000000 | ... | 0.069028 | 0.005083 | 0.036269 | -0.030090 | 0.088676 | 0.053178 | 0.004871 | -0.015726 | 0.031384 | 0.006420 |
BsmtUnfSF | -0.140890 | 0.142075 | -0.003774 | 0.310164 | -0.137267 | 0.148810 | 0.180972 | 0.111684 | -0.526140 | -0.209286 | ... | -0.006876 | 0.129148 | -0.002336 | 0.020843 | -0.012435 | -0.031244 | -0.023802 | 0.035456 | -0.040834 | 0.223248 |
TotalBsmtSF | -0.255441 | 0.313392 | 0.221940 | 0.532666 | -0.176000 | 0.399867 | 0.294866 | 0.337869 | 0.460324 | 0.116478 | ... | 0.229984 | 0.215559 | -0.095872 | 0.041762 | 0.094511 | 0.004418 | -0.018253 | 0.030026 | -0.012192 | 0.641553 |
1stFlrSF | -0.265001 | 0.398400 | 0.267644 | 0.462042 | -0.145613 | 0.279929 | 0.238304 | 0.316507 | 0.386453 | 0.106134 | ... | 0.230865 | 0.179049 | -0.063074 | 0.060553 | 0.097093 | 0.032088 | -0.020801 | 0.045081 | -0.010014 | 0.613742 |
2ndFlrSF | 0.311294 | 0.049902 | 0.037277 | 0.279745 | 0.031297 | 0.002953 | 0.136103 | 0.156796 | -0.183358 | -0.098241 | ... | 0.083670 | 0.198407 | 0.065690 | -0.023740 | 0.043308 | 0.038375 | 0.017111 | 0.039163 | -0.024874 | 0.306605 |
LowQualFinSF | 0.046499 | 0.042589 | 0.005675 | -0.029826 | 0.025406 | -0.183720 | -0.062215 | -0.069533 | -0.066611 | 0.014714 | ... | -0.025114 | 0.019339 | 0.060975 | -0.004334 | 0.026713 | 0.073070 | -0.003822 | -0.022441 | -0.029074 | -0.037698 |
GrLivArea | 0.077956 | 0.341427 | 0.231887 | 0.583519 | -0.078567 | 0.192645 | 0.289264 | 0.363472 | 0.121479 | -0.004995 | ... | 0.241827 | 0.307325 | 0.016148 | 0.023967 | 0.112408 | 0.064346 | -0.000974 | 0.065328 | -0.031898 | 0.718844 |
BsmtFullBath | 0.003282 | 0.074686 | 0.147595 | 0.104092 | -0.053107 | 0.185009 | 0.116765 | 0.080342 | 0.661933 | 0.160254 | ... | 0.174636 | 0.056251 | -0.049034 | 0.000249 | 0.024075 | 0.037039 | -0.022877 | -0.023770 | 0.067665 | 0.238851 |
BsmtHalfBath | -0.002509 | -0.009424 | 0.047391 | -0.047172 | 0.117207 | -0.039945 | -0.013297 | 0.012157 | 0.068869 | 0.071986 | ... | 0.034626 | -0.024385 | -0.007803 | 0.035565 | 0.032902 | 0.027886 | -0.007211 | 0.038478 | -0.045303 | -0.014974 |
FullBath | 0.132131 | 0.187048 | 0.117336 | 0.543791 | -0.194167 | 0.466710 | 0.438212 | 0.266317 | 0.037159 | -0.075291 | ... | 0.182131 | 0.252911 | -0.113812 | 0.036304 | -0.006557 | 0.021509 | -0.013872 | 0.058197 | -0.016574 | 0.590919 |
HalfBath | 0.177476 | 0.037031 | 0.005981 | 0.267431 | -0.059927 | 0.240144 | 0.181136 | 0.195264 | -0.014508 | -0.031243 | ... | 0.104538 | 0.194921 | -0.094317 | -0.004590 | 0.073497 | 0.001010 | 0.001589 | -0.007127 | -0.008853 | 0.311191 |
BedroomAbvGr | -0.023627 | 0.269945 | 0.118960 | 0.096848 | 0.013249 | -0.072623 | -0.041919 | 0.099176 | -0.121893 | -0.015134 | ... | 0.044039 | 0.093803 | 0.042402 | -0.024263 | 0.044941 | 0.064118 | 0.007965 | 0.048477 | -0.034849 | 0.204117 |
KitchenAbvGr | 0.281783 | -0.002959 | -0.016565 | -0.184281 | -0.087204 | -0.174481 | -0.149288 | -0.036558 | -0.082722 | -0.040926 | ... | -0.089670 | -0.069738 | 0.037113 | -0.024670 | -0.051779 | -0.012306 | 0.062294 | 0.026340 | 0.031454 | -0.147891 |
TotRmsAbvGrd | 0.040247 | 0.330249 | 0.173629 | 0.415834 | -0.055766 | 0.089207 | 0.187520 | 0.265334 | 0.001877 | -0.033490 | ... | 0.159720 | 0.219969 | 0.006790 | -0.005908 | 0.061924 | 0.041588 | 0.025640 | 0.041966 | -0.032190 | 0.533446 |
Fireplaces | -0.046377 | 0.238140 | 0.259701 | 0.387425 | -0.022277 | 0.143162 | 0.108732 | 0.236906 | 0.236219 | 0.049027 | ... | 0.194972 | 0.160647 | -0.022885 | 0.012042 | 0.187656 | 0.051221 | 0.001943 | 0.053947 | -0.022567 | 0.487126 |
GarageYrBlt | 0.084926 | 0.058055 | -0.033111 | 0.547320 | -0.323836 | 0.825192 | 0.641445 | 0.249917 | 0.146532 | -0.087353 | ... | 0.222580 | 0.224239 | -0.296517 | 0.023897 | -0.074779 | -0.035670 | -0.032231 | 0.006725 | -0.000030 | 0.544005 |
GarageCars | -0.040490 | 0.289064 | 0.150977 | 0.598739 | -0.185494 | 0.536749 | 0.419573 | 0.362122 | 0.224043 | -0.037331 | ... | 0.223010 | 0.209762 | -0.150590 | 0.036290 | 0.051622 | 0.003360 | -0.042886 | 0.041608 | -0.037179 | 0.680408 |
GarageArea | -0.100145 | 0.318820 | 0.162183 | 0.554905 | -0.150679 | 0.477311 | 0.369590 | 0.361810 | 0.268651 | -0.016485 | ... | 0.219967 | 0.228089 | -0.120615 | 0.036213 | 0.053732 | 0.011637 | -0.027088 | 0.034602 | -0.025870 | 0.655212 |
WoodDeckSF | -0.012853 | 0.077154 | 0.167040 | 0.232819 | -0.003063 | 0.222690 | 0.204020 | 0.149491 | 0.201462 | 0.069028 | ... | 1.000000 | 0.053498 | -0.125151 | -0.032472 | -0.073489 | 0.068309 | -0.009287 | 0.024595 | 0.023860 | 0.330573 |
OpenPorchSF | -0.006687 | 0.116131 | 0.061679 | 0.297803 | -0.029649 | 0.183905 | 0.222649 | 0.106073 | 0.071851 | 0.005083 | ... | 0.053498 | 1.000000 | -0.092094 | -0.005148 | 0.077261 | 0.030360 | -0.018277 | 0.072515 | -0.056326 | 0.327038 |
EnclosedPorch | -0.011966 | 0.017010 | -0.016108 | -0.112407 | 0.070103 | -0.386904 | -0.193348 | -0.109453 | -0.103053 | 0.036269 | ... | -0.125151 | -0.092094 | 1.000000 | -0.037427 | -0.083154 | 0.068796 | 0.018277 | -0.029565 | -0.010343 | -0.148636 |
3SsnPorch | -0.043802 | 0.075575 | 0.021505 | 0.031621 | 0.025419 | 0.031717 | 0.045596 | 0.020270 | 0.029879 | -0.030090 | ... | -0.032472 | -0.005148 | -0.037427 | 1.000000 | -0.031526 | -0.006770 | 0.000326 | 0.029386 | 0.018516 | 0.056065 |
ScreenPorch | -0.025979 | 0.047832 | 0.045620 | 0.067732 | 0.054617 | -0.049703 | -0.038176 | 0.065296 | 0.070026 | 0.088676 | ... | -0.073489 | 0.077261 | -0.083154 | -0.031526 | 1.000000 | 0.063724 | 0.031884 | 0.022863 | 0.010383 | 0.123860 |
PoolArea | 0.007957 | 0.075160 | 0.033875 | 0.018121 | 0.008079 | -0.014373 | -0.009490 | -0.016025 | 0.016380 | 0.053178 | ... | 0.068309 | 0.030360 | 0.068796 | -0.006770 | 0.063724 | 1.000000 | 0.035480 | -0.022901 | -0.062640 | 0.040679 |
MiscVal | -0.007666 | 0.005636 | 0.039192 | -0.031068 | 0.068729 | -0.034193 | -0.010100 | -0.029691 | 0.005149 | 0.004871 | ... | -0.009287 | -0.018277 | 0.018277 | 0.000326 | 0.031884 | 0.035480 | 1.000000 | -0.006657 | 0.004806 | -0.019752 |
MoSold | -0.013512 | 0.028042 | 0.007188 | 0.076414 | -0.003135 | 0.013881 | 0.022629 | 0.004019 | -0.001773 | -0.015726 | ... | 0.024595 | 0.072515 | -0.029565 | 0.029386 | 0.022863 | -0.022901 | -0.006657 | 1.000000 | -0.146229 | 0.062231 |
YrSold | -0.021330 | 0.010598 | -0.013014 | -0.024321 | 0.043755 | -0.012593 | 0.036597 | -0.004880 | 0.018506 | 0.031384 | ... | 0.023860 | -0.056326 | -0.010343 | 0.018516 | 0.010383 | -0.062640 | 0.004806 | -0.146229 | 1.000000 | -0.034319 |
SalePrice_Log1p | -0.075083 | 0.363286 | 0.258945 | 0.819240 | -0.036843 | 0.588977 | 0.568986 | 0.430073 | 0.382710 | 0.006420 | ... | 0.330573 | 0.327038 | -0.148636 | 0.056065 | 0.123860 | 0.040679 | -0.019752 | 0.062231 | -0.034319 | 1.000000 |
37 rows × 37 columns
mask = np.zeros_like(corrmat) # 返回相同大小的0矩阵
mask[np.triu_indices_from(mask)] = True # triu_indices_from: 函数的上三角矩阵
mask
array([[1., 1., 1., ..., 1., 1., 1.],
[0., 1., 1., ..., 1., 1., 1.],
[0., 0., 1., ..., 1., 1., 1.],
...,
[0., 0., 0., ..., 1., 1., 1.],
[0., 0., 0., ..., 0., 1., 1.],
[0., 0., 0., ..., 0., 0., 1.]])
# 绘制热力图
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, mask=mask, linewidths=.5, vmax=0.9, square=True, cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c20d156d8>
对数据做特征变换:
对于类别数据,一般采用LabelEncoder
的方式,把每个类别的数据变成数值型;也可以采用one-hot
变成稀疏矩阵
对于数值型的数据,尽量将其变为正态分布。
encode_cat_variables = ('Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'CentralAir',
'Condition1', 'Condition2', 'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fence',
'FireplaceQu', 'Foundation', 'Functional', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType',
'Heating', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope', 'LotConfig', 'LotShape',
'MSSubClass', 'MSZoning', 'MasVnrType', 'MiscFeature', 'MoSold', 'Neighborhood', 'OverallCond', 'PavedDrive',
'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'Street', 'YrSold')
numerical_features = [col for col in data.columns if col not in encode_cat_variables]
# for variable in encode_cat_variables:
# lbl = LabelEncoder()
# lbl.fit(list(data[variable].values))
# data[variable] = lbl.transform(list(data[variable].values))
for variable in data.columns:
if variable not in encode_cat_variables:
data[variable] = data[variable].apply(float)
else:
data[variable] = data[variable].apply(str)
print(data.shape)
data = pd.get_dummies(data)
print(data.shape)
(2915, 78)
(2915, 343)
# 可以计算一个总面积指标
data['TotalSF'] = data['TotalBsmtSF'] + data['1stFlrSF'] + data['2ndFlrSF']
print("Categorical Features: %d"%len(encode_cat_variables))
print("Numerical Features: %d"%len(numerical_features))
Categorical Features: 46
Numerical Features: 32
数值型变量的分布
#Boxplot for numerical_features
sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(15, 15))
ax.set_xscale("log")
ax = sns.boxplot(data=data[numerical_features] , orient="h", palette="ch:2.5,-.2,dark=.3")
ax.set(ylabel="Features")
ax.set(xlabel="Value")
ax.set(title="Distribution")
sns.despine(trim=True, left=True)
box-cox变换 Box-Cox变换是Box和Cox在1964年提出的一种广义幂变换方法,用于连续的响应变量不满足正态分布的情况。Box-Cox变换之后,可以一定程度上减小不可观测的误差和预测变量的相关性。Box-Cox变换的主要特点是引入一个参数,通过数据本身估计该参数进而确定应采取的数据变换形式
# 计算数值型变量的偏态
skewed_features = data[numerical_features].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewed_features
MiscVal 21.932147
PoolArea 18.701829
LotArea 13.123758
LowQualFinSF 12.080315
3SsnPorch 11.368094
KitchenAbvGr 4.298845
BsmtFinSF2 4.142863
EnclosedPorch 4.000796
ScreenPorch 3.943508
BsmtHalfBath 3.942892
MasVnrArea 2.600697
OpenPorchSF 2.529245
WoodDeckSF 1.848285
1stFlrSF 1.253011
LotFrontage 1.092709
GrLivArea 0.977860
BsmtFinSF1 0.974138
BsmtUnfSF 0.920135
2ndFlrSF 0.843237
TotRmsAbvGrd 0.749579
Fireplaces 0.725958
HalfBath 0.698770
TotalBsmtSF 0.662657
BsmtFullBath 0.622820
BedroomAbvGr 0.328129
GarageArea 0.217748
OverallQual 0.181902
FullBath 0.159917
GarageCars -0.219402
YearRemodAdd -0.449113
YearBuilt -0.598087
GarageYrBlt -3.903046
dtype: float64
skewed_features = skewed_features[abs(skewed_features) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewed_features.shape[0]))
from scipy.special import boxcox1p
skewed_features_name = skewed_features.index
lam = 0.15 # 超参数
for feat in skewed_features_name:
tranformer_feat = boxcox1p(data[feat], lam)
data[feat] = tranformer_feat
data[numerical_features].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
There are 20 skewed numerical features to Box Cox transform
PoolArea 16.487849
3SsnPorch 8.918476
LowQualFinSF 8.737917
MiscVal 5.592866
BsmtHalfBath 3.798589
KitchenAbvGr 3.695780
ScreenPorch 2.975708
BsmtFinSF2 2.561989
EnclosedPorch 2.023182
TotRmsAbvGrd 0.749579
Fireplaces 0.725958
HalfBath 0.698770
TotalBsmtSF 0.662657
MasVnrArea 0.636713
BsmtFullBath 0.622820
2ndFlrSF 0.329999
BedroomAbvGr 0.328129
WoodDeckSF 0.225571
GarageArea 0.217748
OverallQual 0.181902
LotArea 0.178523
1stFlrSF 0.173668
FullBath 0.159917
GrLivArea 0.102287
OpenPorchSF 0.100679
GarageCars -0.219402
YearRemodAdd -0.449113
BsmtFinSF1 -0.488447
YearBuilt -0.598087
LotFrontage -0.808022
BsmtUnfSF -1.536727
GarageYrBlt -3.922152
dtype: float64
#Boxplot for numerical_features
sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(15, 15))
ax.set_xscale("log")
ax = sns.boxplot(data=data[numerical_features] , orient="h", palette="ch:2.5,-.2,dark=.3")
ax.set(ylabel="Features")
ax.set(xlabel="Value")
ax.set(title="Distribution")
sns.despine(trim=True, left=True)
特征处理完后可以将数据再分割开:
train = data[:size_train_df]
test = data[size_train_df:]
构建算法模型,常用的几个算法模型都试试,然后设置交叉检验:
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb
定义一个交叉评估函数
#Validation function
n_folds = 5
def rmsle_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
rmse= np.sqrt(-cross_val_score(model, train.values, target_variable, scoring="neg_mean_squared_error", cv = kf))
return(rmse)
LASSO回归(LASSO Regression)
采用了L1范数,即绝对值之和。
注:当数据包含许多异常值,使用均值和方差缩放可能并不是一个很好的选择。这种情况下,可以使用 robust_scale
以及 RobustScaler
作为替代品。
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1101 (0.0058)
岭回归(Kernel Ridge Regression) 采用了L2范数,即平方和。
KRR = make_pipeline(RobustScaler(), KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5))
score = rmsle_cv(KRR)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1152 (0.0043)
弹性网络回归(Elastic Net Regression) ElasticNet 是一种使用L1和L2先验作为正则化矩阵的线性回归模型,弹性网络是结合了岭回归和Lasso回归,由两者加权平均所得。 https://www.jianshu.com/p/65a573b9fe32
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
score = rmsle_cv(ENet)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1100 (0.0059)
组合模型 现在常用的组合模型有提升树(Gradient Boosting Regression)、XGBoost、LightGBM 等 提升树(Gradient Boosting Regression):
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state=5)
score = rmsle_cv(GBoost)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1164 (0.0082)
XGBoost
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,
learning_rate=0.05, max_depth=3,
min_child_weight=1.7817, n_estimators=2200,
reg_alpha=0.4640, reg_lambda=0.8571,
subsample=0.5213, silent=1,
random_state =7, nthread = -1)
score = rmsle_cv(model_xgb)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1172 (0.0053)
LightGBM
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
learning_rate=0.05, n_estimators=720,
max_bin = 55, bagging_fraction = 0.8,
bagging_freq = 5, feature_fraction = 0.2319,
feature_fraction_seed=9, bagging_seed=9,
min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
score = rmsle_cv(model_lgb)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1162 (0.0060)
寻找最优参数 通过可视化的方式来看看如何寻找模型最优参数
alphas = [0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01]
cv_ridge_score = [rmsle_cv(make_pipeline(RobustScaler(), Lasso(alpha=alpha, random_state=1))).mean()
for alpha in alphas]
cv_ridge = pd.Series(cv_ridge_score, index = alphas)
cv_ridge.plot(title = "Validation - Just Do It")
plt.xlabel("alpha")
plt.ylabel("rmse")
Text(0, 0.5, 'rmse')
几个基础模型预测值的比较:
train_size = int(len(train)*0.7)
X_train = train.values[:train_size]
X_test = train.values[train_size:]
y_train = target_variable[:train_size]
y_test = target_variable[train_size:]
GBoost.fit(X_train, y_train)
ENet.fit(X_train, y_train)
Pipeline(memory=None,
steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
with_scaling=True)), ('elasticnet', ElasticNet(alpha=0.0005, copy_X=True, fit_intercept=True, l1_ratio=0.9,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=3, selection='cyclic', tol=0.0001, warm_start=False))])
残差图
preds = pd.DataFrame({"preds":GBoost.predict(X_test), "true":y_test})
preds["residuals"] = preds["true"] - preds["preds"]
preds.plot(x = "preds", y = "residuals",kind = "scatter")
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c1a72aef0>
preds = pd.DataFrame({"preds":ENet.predict(X_test), "true":y_test})
preds["residuals"] = preds["true"] - preds["preds"]
preds.plot(x = "preds", y = "residuals",kind = "scatter")
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c1a6ed128>
xgb_preds = np.expm1(GBoost.predict(X_test))
lasso_preds = np.expm1(ENet.predict(X_test))
predictions = pd.DataFrame({"xgb":xgb_preds, "lasso":lasso_preds})
predictions.plot(x = "xgb", y = "lasso", kind = "scatter")
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c00ba9d30>
Stacking models 是指通过将多个模型进行集成以取得更小预测方差的方法, 通过前面的两个模型残差图的分布可以发现具有一定的互补性,所以这才是融合的目的,将他的残差分布变得更均匀。 平均模型
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, models):
self.models = models
# we define clones of the original models to fit the data in
def fit(self, X, y):
self.models_ = [clone(x) for x in self.models]
# Train cloned base models
for model in self.models_:
model.fit(X, y)
return self
#Now we do the predictions for cloned models and average them
def predict(self, X):
predictions = np.column_stack([
model.predict(X) for model in self.models_
])
return np.mean(predictions, axis=1)
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))
score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Averaged base models score: 0.1081 (0.0056)
堆叠模型(Stacking Averaged Models) 其由两部分组成,一部分是基准模型,一部分是回归模型: 基准模型和大量数据组合预测一堆结果,然后根据预测结果通过回归模型预测和真实值的差异。具体如下:
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, n_folds=5):
self.base_models = base_models
self.meta_model = meta_model
self.n_folds = n_folds
def fit(self, X, y):
self.base_models_ = [list() for x in self.base_models]
self.meta_model_ = clone(self.meta_model) # 复制基准模型,因为这里会有多个模型
kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
# 训练基准模型,基于基准模型训练的结果导出成特征
# that are needed to train the cloned meta-model
out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
for i, model in enumerate(self.base_models):
for train_index, holdout_index in kfold.split(X, y): #分为预测和训练
instance = clone(model)
self.base_models_[i].append(instance)
instance.fit(X[train_index], y[train_index])
y_pred = instance.predict(X[holdout_index])
out_of_fold_predictions[holdout_index, i] = y_pred
# 将基准模型预测数据作为特征用来给meta_model训练
self.meta_model_.fit(out_of_fold_predictions, y)
return self
def predict(self, X):
meta_features = np.column_stack([
np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
for base_models in self.base_models_ ])
return self.meta_model_.predict(meta_features)
from sklearn.linear_model import LinearRegression
meta_model = LinearRegression()
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR, lasso),
meta_model = meta_model,
n_folds=10)
score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))
Stacking Averaged models score: 0.1086 (0.0060)
https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard https://www.kaggle.com/jolasa/eda-anda-data-preparation-7th-place https://www.kaggle.com/apapiu/regularized-linear-models