# Kaggle实战：House Prices: Advanced Regression Techniques（上篇）

## 数据熟悉

``````#读取训练数据集和测试数据集

# 合并两个训练集
test\$SalePrice <- NA
all <- rbind(train, test)
``````

``````str(all)
``````

``````'data.frame':   2919 obs. of  81 variables:
\$ Id   : int  1 2 3 4 5 6 7 8 9 10 ...
\$ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
\$ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
\$ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
\$ LotArea  : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
\$ Street   : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
\$ Alley: Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
\$ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
\$ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
\$ Utilities: Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
\$ LotConfig: Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
\$ LandSlope: Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
\$ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
\$ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
\$ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
\$ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
\$ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
\$ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
\$ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
\$ YearBuilt: int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
\$ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
\$ RoofStyle: Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
\$ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
\$ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
\$ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
\$ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
\$ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
\$ ExterQual: Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
\$ ExterCond: Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
\$ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
\$ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
\$ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
\$ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
\$ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
\$ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
\$ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
\$ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
\$ BsmtUnfSF: int  150 284 434 540 490 64 317 216 952 140 ...
\$ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
\$ Heating  : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
\$ HeatingQC: Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
\$ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
\$ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
\$ X1stFlrSF: int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
\$ X2ndFlrSF: int  854 0 866 756 1053 566 0 983 752 0 ...
\$ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
\$ GrLivArea: int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
\$ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
\$ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
\$ FullBath : int  2 2 2 1 2 1 2 2 2 1 ...
\$ HalfBath : int  1 0 1 0 1 1 0 1 0 0 ...
\$ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
\$ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
\$ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
\$ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
\$ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
\$ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
\$ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
\$ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
\$ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
\$ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
\$ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
\$ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
\$ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
\$ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
\$ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
\$ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
\$ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
\$ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
\$ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
\$ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
\$ PoolArea : int  0 0 0 0 0 0 0 0 0 0 ...
\$ PoolQC   : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
\$ Fence: Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
\$ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
\$ MiscVal  : int  0 0 0 0 0 700 0 350 0 0 ...
\$ MoSold   : int  2 5 9 2 12 10 8 11 4 1 ...
\$ YrSold   : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
\$ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
\$ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
\$ SalePrice: int  208500 181500 223500 140000 250000 143000 307000 200000 1299
``````

``````# 获取数据中 factor变量的个数
res <- sapply(all, class )
table(res)
``````

`````` factor integer
43      38
``````

## 特征处理

``````# 统计所有变量的缺失值
res <- sapply(all, function(x)  sum(is.na(x)) )

# 按照缺失率排序
miss <- sort(res, decreasing=T)
miss[miss>0]
``````

``````# 变量           缺失数   缺失比例  含义
PoolQC           2909    100% # 泳池质量
MiscFeature      2814    96% # 特殊的设施
Alley            2721    93% # 房屋附近的小巷
Fence            2348    80% # 房屋的篱笆
FireplaceQu      1420    49% # 壁炉的质量

LotFrontage      486        17% # 房子同街道之间的距离

GarageYrBlt      159    5%  # 车库
GarageFinish     159    5%
GarageQual       159    5%
GarageCond       159    5%
GarageType       157    5%

BsmtCond        82    3% # 地下室
BsmtExposure    82    3%
BsmtQual        81    3%
BsmtFinType2    80    3%
BsmtFinType1    79    3%

MasVnrType      24    1%  # 外墙装饰
MasVnrArea      23    1%

MSZoning        4    0%  # 其他
Utilities       2    0%
BsmtFullBath    2    0%
BsmtHalfBath    2    0%
Functional      2    0%
Exterior1st     1    0%
Exterior2nd     1    0%
BsmtFinSF1      1    0%
BsmtFinSF2      1    0%
BsmtUnfSF       1    0%
TotalBsmtSF     1    0%
Electrical      1    0%
KitchenQual     1    0%
GarageCars      1    0%
GarageArea      1    0%
SaleType        1    0%
``````

``````# 查看有缺失数据的变量
summary(all[,names(miss)[miss>0]])
``````

`````` PoolQC     MiscFeature  Alley        Fence        SalePrice      FireplaceQu
Ex  :   4   Gar2:   5   Grvl: 120   GdPrv: 118   Min.   : 34900   Ex  :  43
Fa  :   2   Othr:   4   Pave:  78   GdWo : 112   1st Qu.:129975   Fa  :  74
Gd  :   4   Shed:  95   NA's:2721   MnPrv: 329   Median :163000   Gd  : 744
NA's:2909   TenC:   1               MnWw :  12   Mean   :180921   Po  :  46
NA's:2814               NA's :2348   3rd Qu.:214000   TA  : 592
Max.   :755000   NA's:1420
NA's   :1459

LotFrontage      GarageYrBlt   GarageFinish GarageQual  GarageCond
Min.   : 21.00   Min.   :1895   Fin : 719    Ex  :   3   Ex  :   3
1st Qu.: 59.00   1st Qu.:1960   RFn : 811    Fa  : 124   Fa  :  74
Median : 68.00   Median :1979   Unf :1230    Gd  :  24   Gd  :  15
Mean   : 69.31   Mean   :1978   NA's: 159    Po  :   5   Po  :  14
3rd Qu.: 80.00   3rd Qu.:2002                TA  :2604   TA  :2654
Max.   :313.00   Max.   :2207                NA's: 159   NA's: 159
NA's   :486      NA's   :159

GarageType   BsmtCond    BsmtExposure BsmtQual    BsmtFinType2 BsmtFinType1
2Types :  23   Fa  : 104   Av  : 418    Ex  : 258   ALQ :  52    ALQ :429
Attchd :1723   Gd  : 122   Gd  : 276    Fa  :  88   BLQ :  68    BLQ :269
Basment:  36   Po  :   5   Mn  : 239    Gd  :1209   GLQ :  34    GLQ :849
BuiltIn: 186   TA  :2606   No  :1904    TA  :1283   LwQ :  87    LwQ :154
CarPort:  15   NA's:  82   NA's:  82    NA's:  81   Rec : 105    Rec :288
Detchd : 779                                        Unf :2493    Unf :851
NA's   : 157                                        NA's:  80    NA's: 79

MasVnrType     MasVnrArea        MSZoning     Utilities     BsmtFullBath
BrkCmn :  25   Min.   :   0.0   C (all):  25   AllPub:2916   Min.   :0.0000
BrkFace: 879   1st Qu.:   0.0   FV     : 139   NoSeWa:   1   1st Qu.:0.0000
None   :1742   Median :   0.0   RH     :  26   NA's  :   2   Median :0.0000
Stone  : 249   Mean   : 102.2   RL     :2265                 Mean   :0.4299
NA's   :  24   3rd Qu.: 164.0   RM     : 460                 3rd Qu.:1.0000
Max.   :1600.0   NA's   :   4                 Max.   :3.0000
NA's   :23                                    NA's   :2

BsmtHalfBath       Functional    Exterior1st    Exterior2nd     BsmtFinSF1
Min.   :0.00000   Typ    :2717   VinylSd:1025   VinylSd:1014   Min.   :   0.0
1st Qu.:0.00000   Min2   :  70   MetalSd: 450   MetalSd: 447   1st Qu.:   0.0
Median :0.00000   Min1   :  65   HdBoard: 442   HdBoard: 406   Median : 368.5
Mean   :0.06136   Mod    :  35   Wd Sdng: 411   Wd Sdng: 391   Mean   : 441.4
3rd Qu.:0.00000   Maj1   :  19   Plywood: 221   Plywood: 270   3rd Qu.: 733.0
Max.   :2.00000   (Other):  11   (Other): 369   (Other): 390   Max.   :5644.0
NA's   :2         NA's   :   2   NA's   :   1   NA's   :   1   NA's   :1

BsmtFinSF2        BsmtUnfSF       TotalBsmtSF     Electrical   KitchenQual
Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   FuseA: 188   Ex  : 205
1st Qu.:   0.00   1st Qu.: 220.0   1st Qu.: 793.0   FuseF:  50   Fa  :  70
Median :   0.00   Median : 467.0   Median : 989.5   FuseP:   8   Gd  :1151
Mean   :  49.58   Mean   : 560.8   Mean   :1051.8   Mix  :   1   TA  :1492
3rd Qu.:   0.00   3rd Qu.: 805.5   3rd Qu.:1302.0   SBrkr:2671   NA's:   1
Max.   :1526.00   Max.   :2336.0   Max.   :6110.0   NA's :   1
NA's   :1         NA's   :1        NA's   :1

GarageCars      GarageArea        SaleType
Min.   :0.000   Min.   :   0.0   WD     :2525
1st Qu.:1.000   1st Qu.: 320.0   New    : 239
Median :2.000   Median : 480.0   COD    :  87
Mean   :1.767   Mean   : 472.9   ConLD  :  26
3rd Qu.:2.000   3rd Qu.: 576.0   CWD    :  12
Max.   :5.000   Max.   :1488.0   (Other):  29
NA's   :1       NA's   :1        NA's   :   1
``````

``````# 删除如下变量
Drop <- names(all) %in% c("PoolQC","MiscFeature","Alley","Fence","FireplaceQu")
all <- all[!Drop]
``````

``````# 将如下变量的NA值填充为None
Garage <- c("GarageType","GarageQual","GarageCond","GarageFinish")
Bsmt <- c("BsmtExposure","BsmtFinType2","BsmtQual","BsmtCond","BsmtFinType1")
for (x in c(Garage, Bsmt) )
{
all[[x]] <- factor( all[[x]], levels= c(levels(all[[x]]),c('None')))
all[[x]][is.na(all[[x]])] <- "None"
}
``````

``````# 单独处理车库年份
all\$GarageYrBlt[is.na(all\$GarageYrBlt)] <- all\$YearBuilt[is.na(all\$GarageYrBlt)]
``````

``````# 用中位数来填充
all\$LotFrontage[is.na(all\$LotFrontage)] <- median(all\$LotFrontage, na.rm = T)
``````

``````# 用None补充
all[["MasVnrType"]][is.na(all[["MasVnrType"]])] <- "None"
``````

``````# 用0补充
all[["MasVnrArea"]][is.na(all[["MasVnrArea"]])] <- 0
``````

``````# 删除变量 Utilities
all\$Utilities <- NULL
``````

``````# 由于设施缺失，导致数量缺失，补充为0
Param0 <- c("BsmtFullBath","BsmtHalfBath","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","GarageCars","GarageArea")
for (x in Param0 )    all[[x]][is.na(all[[x]])] <- 0
``````

``````# 用最高频的因子来补充
Req <- c("MSZoning","Functional","Exterior1st","Exterior2nd","KitchenQual","Electrical","SaleType")
for (x in Req )    all[[x]][is.na(all[[x]])] <- levels(all[[x]])[which.max(table(all[[x]]))]
``````

``````# 通过SalePrice是否为空来区分训练集和测试集
train <- all[!is.na(all\$SalePrice), ]
test <- all[is.na(all\$SalePrice), ]
``````

136 篇文章114 人订阅

0 条评论

## 相关文章

2K6

3929

### SceneKit绘制模型与骨骼动画的实现

#####研究目的 sceneKit里可以绘制几种几何模型，但那些不规律的形状如果不想使用模型，那么就要自己绘制了 #####demo效果 [1.gif] [U...

8657

3075

1273

3787

1664

3087

### 【SAS Says】基础篇：8. 相关、回归等初步统计

SAS是一个专业的统计软件，前面我们介绍了很多数据管理、输出美化的东西，本节终于要介绍一点SAS做统计的知识了，不过，在基础篇中我们只大概介绍一下，更多统计分析...

3176

3396