机器学习主要分为分类和回归两类。上一篇文章我们通过实例介绍了利用决策树和随机森林来做分类。 这次我们来预测房价,实际演练一下R语言中的回归分析模型。
这次选择的竞赛网址为:https://www.kaggle.com/c/house-prices-advanced-regression-techniques
竞赛给了已经成交的近1500座房子的80个特征,然后让我们根据这些特征来预测房子的销售价格。数据集包含的特征字段相当多,除了地段、面积、层数等基本信息外,还有诸如地下室、离街道的距离、房屋的外墙材料等在国内完全不会关心的特征。 在房价如此疯狂的中国,基本只需要看到地段和面积就可以估算出来价格了。
在动手构造模型之前,我们还是先熟悉一下数据的缺失和分布情况。
首先下载训练数据和测试数据,放在目录D:/RData/House/下,然后合并训练数据和测试数据。其中SalePrice就是这次要预测的房价字段。
#读取训练数据集和测试数据集
train <- read.csv("D:/RData/House/train.csv")
test <- read.csv("D:/RData/House/test.csv")
# 合并两个训练集
test$SalePrice <- NA
all <- rbind(train, test)
首先查看一下各个变量的情况。这里变量很多,在附件中附上变量的具体解释。
str(all)
结果:
'data.frame': 2919 obs. of 81 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
$ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
$ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Alley: Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
$ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
$ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Utilities: Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
$ LotConfig: Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
$ LandSlope: Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
$ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
$ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
$ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
$ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
$ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
$ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
$ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
$ YearBuilt: int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
$ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
$ RoofStyle: Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
$ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
$ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
$ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
$ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
$ ExterQual: Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
$ ExterCond: Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
$ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
$ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
$ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
$ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
$ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
$ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
$ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
$ BsmtUnfSF: int 150 284 434 540 490 64 317 216 952 140 ...
$ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
$ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
$ HeatingQC: Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
$ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
$ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
$ X1stFlrSF: int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
$ X2ndFlrSF: int 854 0 866 756 1053 566 0 983 752 0 ...
$ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
$ GrLivArea: int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
$ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
$ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
$ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
$ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
$ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
$ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
$ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
$ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
$ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
$ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
$ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
$ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
$ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
$ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
$ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
$ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
$ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
$ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
$ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
$ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
$ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
$ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
$ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
$ Fence: Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
$ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
$ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
$ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
$ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
$ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
$ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
$ SalePrice: int 208500 181500 223500 140000 250000 143000 307000 200000 1299
变量主要分为两类,一类为数字类型,一类为因子类型。
# 获取数据中 factor变量的个数
res <- sapply(all, class )
table(res)
结果
factor integer
43 38
总体来看,数据集一共81个变量、2919个记录,其中43个因子变量,38个数字变量
从上面的变量取值情况可以看到数据集中有很多变量存在缺失值,所以第一步我们要处理缺失值。
首先按照各变量中的缺失值所占比例排序
# 统计所有变量的缺失值
res <- sapply(all, function(x) sum(is.na(x)) )
# 按照缺失率排序
miss <- sort(res, decreasing=T)
miss[miss>0]
执行结果 — 这里只给出了有缺失值的变量,并经过人工注释
# 变量 缺失数 缺失比例 含义
PoolQC 2909 100% # 泳池质量
MiscFeature 2814 96% # 特殊的设施
Alley 2721 93% # 房屋附近的小巷
Fence 2348 80% # 房屋的篱笆
FireplaceQu 1420 49% # 壁炉的质量
LotFrontage 486 17% # 房子同街道之间的距离
GarageYrBlt 159 5% # 车库
GarageFinish 159 5%
GarageQual 159 5%
GarageCond 159 5%
GarageType 157 5%
BsmtCond 82 3% # 地下室
BsmtExposure 82 3%
BsmtQual 81 3%
BsmtFinType2 80 3%
BsmtFinType1 79 3%
MasVnrType 24 1% # 外墙装饰
MasVnrArea 23 1%
MSZoning 4 0% # 其他
Utilities 2 0%
BsmtFullBath 2 0%
BsmtHalfBath 2 0%
Functional 2 0%
Exterior1st 1 0%
Exterior2nd 1 0%
BsmtFinSF1 1 0%
BsmtFinSF2 1 0%
BsmtUnfSF 1 0%
TotalBsmtSF 1 0%
Electrical 1 0%
KitchenQual 1 0%
GarageCars 1 0%
GarageArea 1 0%
SaleType 1 0%
然后查看有缺失值的变量的概况。这里只给出了缺失值比较多的变量
# 查看有缺失数据的变量
summary(all[,names(miss)[miss>0]])
结果
PoolQC MiscFeature Alley Fence SalePrice FireplaceQu
Ex : 4 Gar2: 5 Grvl: 120 GdPrv: 118 Min. : 34900 Ex : 43
Fa : 2 Othr: 4 Pave: 78 GdWo : 112 1st Qu.:129975 Fa : 74
Gd : 4 Shed: 95 NA's:2721 MnPrv: 329 Median :163000 Gd : 744
NA's:2909 TenC: 1 MnWw : 12 Mean :180921 Po : 46
NA's:2814 NA's :2348 3rd Qu.:214000 TA : 592
Max. :755000 NA's:1420
NA's :1459
LotFrontage GarageYrBlt GarageFinish GarageQual GarageCond
Min. : 21.00 Min. :1895 Fin : 719 Ex : 3 Ex : 3
1st Qu.: 59.00 1st Qu.:1960 RFn : 811 Fa : 124 Fa : 74
Median : 68.00 Median :1979 Unf :1230 Gd : 24 Gd : 15
Mean : 69.31 Mean :1978 NA's: 159 Po : 5 Po : 14
3rd Qu.: 80.00 3rd Qu.:2002 TA :2604 TA :2654
Max. :313.00 Max. :2207 NA's: 159 NA's: 159
NA's :486 NA's :159
GarageType BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1
2Types : 23 Fa : 104 Av : 418 Ex : 258 ALQ : 52 ALQ :429
Attchd :1723 Gd : 122 Gd : 276 Fa : 88 BLQ : 68 BLQ :269
Basment: 36 Po : 5 Mn : 239 Gd :1209 GLQ : 34 GLQ :849
BuiltIn: 186 TA :2606 No :1904 TA :1283 LwQ : 87 LwQ :154
CarPort: 15 NA's: 82 NA's: 82 NA's: 81 Rec : 105 Rec :288
Detchd : 779 Unf :2493 Unf :851
NA's : 157 NA's: 80 NA's: 79
MasVnrType MasVnrArea MSZoning Utilities BsmtFullBath
BrkCmn : 25 Min. : 0.0 C (all): 25 AllPub:2916 Min. :0.0000
BrkFace: 879 1st Qu.: 0.0 FV : 139 NoSeWa: 1 1st Qu.:0.0000
None :1742 Median : 0.0 RH : 26 NA's : 2 Median :0.0000
Stone : 249 Mean : 102.2 RL :2265 Mean :0.4299
NA's : 24 3rd Qu.: 164.0 RM : 460 3rd Qu.:1.0000
Max. :1600.0 NA's : 4 Max. :3.0000
NA's :23 NA's :2
BsmtHalfBath Functional Exterior1st Exterior2nd BsmtFinSF1
Min. :0.00000 Typ :2717 VinylSd:1025 VinylSd:1014 Min. : 0.0
1st Qu.:0.00000 Min2 : 70 MetalSd: 450 MetalSd: 447 1st Qu.: 0.0
Median :0.00000 Min1 : 65 HdBoard: 442 HdBoard: 406 Median : 368.5
Mean :0.06136 Mod : 35 Wd Sdng: 411 Wd Sdng: 391 Mean : 441.4
3rd Qu.:0.00000 Maj1 : 19 Plywood: 221 Plywood: 270 3rd Qu.: 733.0
Max. :2.00000 (Other): 11 (Other): 369 (Other): 390 Max. :5644.0
NA's :2 NA's : 2 NA's : 1 NA's : 1 NA's :1
BsmtFinSF2 BsmtUnfSF TotalBsmtSF Electrical KitchenQual
Min. : 0.00 Min. : 0.0 Min. : 0.0 FuseA: 188 Ex : 205
1st Qu.: 0.00 1st Qu.: 220.0 1st Qu.: 793.0 FuseF: 50 Fa : 70
Median : 0.00 Median : 467.0 Median : 989.5 FuseP: 8 Gd :1151
Mean : 49.58 Mean : 560.8 Mean :1051.8 Mix : 1 TA :1492
3rd Qu.: 0.00 3rd Qu.: 805.5 3rd Qu.:1302.0 SBrkr:2671 NA's: 1
Max. :1526.00 Max. :2336.0 Max. :6110.0 NA's : 1
NA's :1 NA's :1 NA's :1
GarageCars GarageArea SaleType
Min. :0.000 Min. : 0.0 WD :2525
1st Qu.:1.000 1st Qu.: 320.0 New : 239
Median :2.000 Median : 480.0 COD : 87
Mean :1.767 Mean : 472.9 ConLD : 26
3rd Qu.:2.000 3rd Qu.: 576.0 CWD : 12
Max. :5.000 Max. :1488.0 (Other): 29
NA's :1 NA's :1 NA's : 1
缺失数据的变量有很多,处理情况可以分为如下几类:
直接数据集中剔除哪些存在大量缺失值的变量
缺失量比较多的PoolQC、MiscFeature、Alley、Fence、FireplaceQu是由于房子没有泳池、特殊的设施、旁边的小巷、篱笆、壁炉等设施。 由于缺失量比较多,我们直接移除这几个变量。
# 删除如下变量
Drop <- names(all) %in% c("PoolQC","MiscFeature","Alley","Fence","FireplaceQu")
all <- all[!Drop]
将NA作为新的一个因子
查看变量的描述文件可以知道,车库相关的五个变量GarageType、GarageYrBlt、GarageFinish、GarageQual、GarageCond也是由于房子没有车库而缺失。
同理,BsmtExposure、BsmtFinType2、BsmtQual、BsmtCond、BsmtFinType1五个变量是关于地下室的,都是由于房子没有地下室而缺失。
此类变量缺失的数量比较少,直接用None来替换缺失值。
# 将如下变量的NA值填充为None
Garage <- c("GarageType","GarageQual","GarageCond","GarageFinish")
Bsmt <- c("BsmtExposure","BsmtFinType2","BsmtQual","BsmtCond","BsmtFinType1")
for (x in c(Garage, Bsmt) )
{
all[[x]] <- factor( all[[x]], levels= c(levels(all[[x]]),c('None')))
all[[x]][is.na(all[[x]])] <- "None"
}
其中GarageYrBlt为车库的年份,我们用房子的建造年份来替代
# 单独处理车库年份
all$GarageYrBlt[is.na(all$GarageYrBlt)] <- all$YearBuilt[is.na(all$GarageYrBlt)]
人工补齐缺失值
对剩下的变量我们依次查看其详细数据,可以分别如下处理。
变量 LotFrontage 房子到街道的距离
这是一个数值变量,我们用中位数Median来补充。
# 用中位数来填充
all$LotFrontage[is.na(all$LotFrontage)] <- median(all$LotFrontage, na.rm = T)
变量 MasVnrType 外墙装饰材料
这个变量对价钱的影响应该不大,MasVnrType中的NA用它本身的None来代替
# 用None补充
all[["MasVnrType"]][is.na(all[["MasVnrType"]])] <- "None"
变量 MasVnrArea 外墙装饰材料的面积
这个缺失值对应着MasVnrType的None值,应该将NA用0来替代
# 用0补充
all[["MasVnrArea"]][is.na(all[["MasVnrArea"]])] <- 0
变量 Utilities 没有区分度,直接丢弃
# 删除变量 Utilities
all$Utilities <- NULL
变量 BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF GarageCars GarageArea 则是由于不存在相应的设施而缺失,这些变量都是数字变量,所以都补充为0即可。
# 由于设施缺失,导致数量缺失,补充为0
Param0 <- c("BsmtFullBath","BsmtHalfBath","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","GarageCars","GarageArea")
for (x in Param0 ) all[[x]][is.na(all[[x]])] <- 0
变量MSZoning,Functional,Exterior1st,Exterior2nd,KitchenQual,Electrical,SaleType 这些变量都是因子变量,并且只有几个缺失值,直接用最多的因子来代替
# 用最高频的因子来补充
Req <- c("MSZoning","Functional","Exterior1st","Exterior2nd","KitchenQual","Electrical","SaleType")
for (x in Req ) all[[x]][is.na(all[[x]])] <- levels(all[[x]])[which.max(table(all[[x]]))]
生成训练集
经过一系列的缺失值补齐之后,我们看到最后剩余75个变量,并且不存在缺失数据。 我们通过SalePrice是否为NA来将数据集拆分为训练集和测试集,为后面的模型训练做准备。
# 通过SalePrice是否为空来区分训练集和测试集
train <- all[!is.na(all$SalePrice), ]
test <- all[is.na(all$SalePrice), ]
线性回归的最主要的问题就是自变量的选择。选择那些与最后预测的响应变量相关度比较高的特征变量是模型成功的第一步。变量选择有很多方法,其中最关键同时也是最直接的方法就是分析师根据业务场景人工筛选。 我们首先尝试这种变量选择的方法,作为我们模型的第一步。
接下篇《Kaggle 实战-House Prices: Advanced Regression Techniques (2)》
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。