前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >机器学习入门数据集--2.波士顿房价

机器学习入门数据集--2.波士顿房价

作者头像
birdskyws
发布2019-03-04 10:21:16
2.8K0
发布2019-03-04 10:21:16
举报

sklearn有一个较小的房价数据集,特征有13个维度。而这个在数据集中,特征维度是79,本文用了2种模型对数据进行处理,线性回归模型和随机森林;用了2种模型评判方法R2和MSE。通过实验数据表明,随机森林模型的效果更好,一种原因是随机森林的Bag模型有抗过拟合效果更好,另一方面房价特征较多,决策树模型可以得到更好的结果。

数据展示

波士顿房价数据集,sklearn中可以下载已经做好预处理的数据集。

代码语言:javascript
复制
import sklearn
import numpy as np
from sklearn.datasets import load_boston
np.set_printoptions(suppress=True)
boston = load_boston()

print("data shape:{}".format(boston.data.shape))
print("target shape:{}".format(boston.target.shape))
print("line head 5:\n{}".format(boston.data[:5]))
print("target head 5:\n{}".format(boston.target[:5]))

查看结果:

代码语言:javascript
复制
data shape:(506, 13)
target shape:(506,)
line head 5:
[[  0.00632  18.        2.31      0.        0.538     6.575    65.2
    4.09      1.      296.       15.3     396.9       4.98   ]
 [  0.02731   0.        7.07      0.        0.469     6.421    78.9
    4.9671    2.      242.       17.8     396.9       9.14   ]
 [  0.02729   0.        7.07      0.        0.469     7.185    61.1
    4.9671    2.      242.       17.8     392.83      4.03   ]
 [  0.03237   0.        2.18      0.        0.458     6.998    45.8
    6.0622    3.      222.       18.7     394.63      2.94   ]
 [  0.06905   0.        2.18      0.        0.458     7.147    54.2
    6.0622    3.      222.       18.7     396.9       5.33   ]]
target head 5:
[24.  21.6 34.7 33.4 36.2]

这个数据可以用任何一个简单模型进行处理,可以参考下面的文章。 https://www.jianshu.com/p/f828eae005a1?utm_campaign=haruki&utm_content=note&utm_medium=reader_share&utm_source=weixin_timeline&from=timeline 还有一个数据集,格式为csv,数据特征有80列,下面我们要处理这个格式的数据。

波士顿房价数据集

数据预处理

加载数据
代码语言:javascript
复制
train_df = pd.read_csv("/Users/wangsen/ai/03/9day_discuz/firstDiscuz/02_houseprice/data/train.csv",index_col=0)
test_df = pd.read_csv("/Users/wangsen/ai/03/9day_discuz/firstDiscuz/02_houseprice/data/test.csv",index_col=0)
## read_csv加载csv文件
## index_col=0,指明第一列为id列
print(train_df.info())
##print(train_df.describe().T)
print(train_df['MSSubClass'].value_counts())
print(train_df['MSSubClass'].unique())
## unique 查看数据
## value_counts 数据统计
#数据预处理,训练集和测试集一起做数据预处理

all_df = pd.concat((train_df,test_df),axis=0)
print(all_df.shape)
print(all_df['MSSubClass'].value_counts())
print(all_df['MSSubClass'].unique())
print(pd.get_dummies(sb))

print(pd.concat((all_df['MSSubClass'][:5],pd.get_dummies(all_df['MSSubClass'], prefix='MSSubClass')[:5]),axis=1).T)
  • df.info() 查看多少列,每一个列的属性
代码语言:javascript
复制
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(34), object(43)
memory usage: 923.9+ KB
  • pd.get_dummies 对离散型特征进行哑编码(也叫独热编码one-hot)。由于pd的编码没有fit,transform等操作,需要将训练集和测试集联结。 以第一列MSSubClass为例,可以先用unique()或value_counts()函数查看值分布。 pd.get_dummies(all_df['MSSubClass'], prefix='MSSubClass') 对某一列进行编码,观察输出结果,首先将特征值排序,最小值20为[1,0,0...],第二小值30为[0,1,0,....]。数据结果如下:
代码语言:javascript
复制
Id               1   2   3   4   5
MSSubClass      60  20  60  70  60
MSSubClass_20    0   1   0   0   0
MSSubClass_30    0   0   0   0   0
MSSubClass_40    0   0   0   0   0
MSSubClass_45    0   0   0   0   0
MSSubClass_50    0   0   0   0   0
MSSubClass_60    1   0   1   0   1
MSSubClass_70    0   0   0   1   0
MSSubClass_75    0   0   0   0   0
MSSubClass_80    0   0   0   0   0
MSSubClass_85    0   0   0   0   0
MSSubClass_90    0   0   0   0   0
MSSubClass_120   0   0   0   0   0
MSSubClass_150   0   0   0   0   0
MSSubClass_160   0   0   0   0   0
MSSubClass_180   0   0   0   0   0
MSSubClass_190   0   0   0   0   0
  • 查看空值
代码语言:javascript
复制
all_dummy_df = pd.get_dummies(all_df)
# print(all_dummy_df.head())
print(all_dummy_df.shape)
print(all_dummy_df.isnull().sum().sort_values(ascending=False))
代码语言:javascript
复制
all_df shape:(2919, 79)
LotFrontage              486
GarageYrBlt              159
MasVnrArea                23
BsmtFullBath               2
BsmtHalfBath               2
BsmtFinSF1                 1
BsmtFinSF2                 1
BsmtUnfSF                  1
TotalBsmtSF                1
GarageArea                 1
GarageCars                 1
Condition1_RRNe            0
Condition1_RRNn            0
  • 空值填充:平均值填充
代码语言:javascript
复制
mean_cols = all_dummy_df.mean()
all_dummy_df = all_dummy_df.fillna(mean_cols)
  • 模型训练
代码语言:javascript
复制
dummy_train_df = all_dummy_df[:train_len]

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
print("训练集评分:{}".format(lr.score(dummy_train_df,train_target)))

kf = KFold(n_splits=5, shuffle=True)
score_ndarray = cross_val_score(lr, dummy_train_df, train_target, cv=kf)
print(score_ndarray)
print(score_ndarray.mean())

输出结果:

代码语言:javascript
复制
训练集评分:0.9332679645484127
[ 0.88669507 -1.54529853  0.90133954  0.84720817  0.86750469  0.92840145
  0.8299786   0.91205312  0.92400129  0.91065317  0.55149449  0.87645062
  0.48737113  0.82570995  0.91949504  0.8890254   0.79646233  0.94457746
  0.65656125  0.91573777]
0.7162711000424071
score

LinearRegression的评分为R^2,模型在训练集上可以达到0.93,但是最后的交叉验证只得到了0.71的分数,说明模型存在过拟合问题。

R2

查看R2源码:github

cross_val_score 交叉验证误差

由于R^2误差不能直接表达误差的大小,对比两个模型的MSE。线性回归和随机森林。

代码语言:javascript
复制
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
#lr.fit(dummy_train_df,train_target)
#print("训练集评分:{}".format(lr.score(dummy_train_df,train_target,scor)))

kf = KFold(n_splits=5, shuffle=True)
score_ndarray = np.sqrt(-cross_val_score(lr, dummy_train_df, train_target, cv=kf,scoring="neg_mean_squared_error"))
print(score_ndarray.mean())

from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor(n_estimators=200, max_features=3)
score_ndarray = np.sqrt(-cross_val_score(clf, dummy_train_df, train_target, cv=kf,scoring="neg_mean_squared_error"))
print(score_ndarray.mean())

clf.fit(dummy_train_df,train_target)
train_predict = clf.predict(dummy_train_df)
from sklearn.metrics import mean_squared_error
print("随机森林算法的误差:",np.sqrt(mean_squared_error(train_target,train_predict)))

lr.fit(dummy_train_df,train_target)
train_predict = lr.predict(dummy_train_df)
from sklearn.metrics import mean_squared_error
print("线性回归的误差:",np.sqrt(mean_squared_error(train_target,train_predict)))
代码语言:javascript
复制
随机森林算法的误差: 13134.86059929929
线性回归的误差: 20514.990603536615

总结

随机森林模型要比线性回归模型的结果好。

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019.02.10 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 数据展示
  • 数据预处理
    • 加载数据
      • score
        • cross_val_score 交叉验证误差
        • 总结
        领券
        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档