# Kaggle实战：House Prices: Advanced Regression Techniques（下篇）

## 初步模型

• LotArea 房子的面积
• Neighborhood 城市街区 用来初步代替 区域、小区
• Condition1 Condition2 附近的交通情况
• BldgType 房屋类型 独栋别墅、联排别墅
• HouseStyle 房子的层数
• YearBuilt 房子建造的年份
• OverallQual： 房子整体质量，考量材料和完成度
• OverallCond：房子整体条件

``````# 加载库
library(ggplot2)

# 将对于因子变量画图
plot2_factor <- function(var_name){
source('D:/RData/comm/multiplot.r')
plots <- list()
plots[[1]] <- ggplot(train, aes_string(x = var_name, fill = var_name) ) +
geom_bar() +
guides(fill = FALSE) +
ggtitle(paste("count of ", var_name)) +
theme(axis.text.x = element_text(angle = 90, hjust =1))

plots[[2]] <- ggplot(train, aes_string(x = var_name, y = "SalePrice", fill = var_name) ) +
geom_boxplot() +
guides(fill = FALSE) +
ggtitle(paste( var_name, " vs SalePrice")) +
theme(axis.text.x = element_text(angle = 90, hjust =1))

multiplot(plotlist = plots, cols = 2)
}

# 对于连续数字变量画图
plot2_number <- function(var_name){
source('D:/RData/comm/multiplot.r')
plots <- list()
plots[[1]] <- ggplot(train, aes_string(x = var_name) ) +
geom_histogram() +
ggtitle(paste("count of ", var_name))

plots[[2]] <- ggplot(train, aes_string(x = var_name, y = "SalePrice") ) +
geom_point() +
ggtitle(paste( var_name, " vs SalePrice"))

multiplot(plotlist = plots, cols = 2)
}

# 街区和房价的关系
plot2_factor("Neighborhood")
plot2_number("YearBuilt")
plot2_number("OverallQual")
``````

``````# 相关系数画图
library(corrgram)

``````

``````# 通过人工选择的变量来构造一个公式
fm.base <- SalePrice ~ LotArea + Neighborhood + BldgType + HouseStyle + YearBuilt + YearRemodAdd + OverallQual + OverallCond

# 训练模型
lm.base <- lm(fm.base, train)

# 查看模型概要
summary(lm.base)
``````

``````Call:
lm(formula = fm.base, data = train)

Residuals:
Min      1Q  Median      3Q     Max
-208970  -20882   -2917   15544  351199

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         -1.455e+06  1.850e+05  -7.862 7.42e-15 ***
LotArea              1.084e+00  1.156e-01   9.375  < 2e-16 ***
NeighborhoodBlueste -1.068e+03  2.953e+04  -0.036 0.971141
NeighborhoodBrDale  -1.440e+04  1.518e+04  -0.949 0.342806
NeighborhoodBrkSide -1.876e+04  1.278e+04  -1.468 0.142460
NeighborhoodClearCr -2.352e+03  1.332e+04  -0.177 0.859842
NeighborhoodCollgCr -2.917e+04  1.086e+04  -2.685 0.007335 **
NeighborhoodCrawfor  1.747e+04  1.246e+04   1.402 0.161225
NeighborhoodEdwards -2.813e+04  1.165e+04  -2.414 0.015924 *
NeighborhoodGilbert -4.030e+04  1.157e+04  -3.484 0.000508 ***
NeighborhoodIDOTRR  -3.357e+04  1.343e+04  -2.499 0.012570 *
NeighborhoodMitchel -2.819e+04  1.196e+04  -2.356 0.018617 *
NeighborhoodNAmes   -2.202e+04  1.130e+04  -1.950 0.051426 .
NeighborhoodNoRidge  6.105e+04  1.226e+04   4.980 7.13e-07 ***
NeighborhoodNPkVill  6.340e+03  1.650e+04   0.384 0.700928
NeighborhoodNridgHt  4.876e+04  1.104e+04   4.417 1.08e-05 ***
NeighborhoodNWAmes  -2.126e+04  1.166e+04  -1.823 0.068457 .
NeighborhoodOldTown -2.915e+04  1.243e+04  -2.344 0.019194 *
NeighborhoodSawyer  -2.575e+04  1.188e+04  -2.168 0.030350 *
NeighborhoodSawyerW -2.224e+04  1.154e+04  -1.927 0.054234 .
NeighborhoodSomerst -1.228e+04  1.093e+04  -1.123 0.261764
NeighborhoodStoneBr  5.984e+04  1.249e+04   4.790 1.84e-06 ***
NeighborhoodSWISU   -2.365e+04  1.433e+04  -1.651 0.099024 .
NeighborhoodTimber  -1.326e+04  1.236e+04  -1.073 0.283489
NeighborhoodVeenker  2.303e+04  1.555e+04   1.481 0.138905
BldgType2fmCon       1.230e+03  7.413e+03   0.166 0.868218
BldgTypeDuplex      -7.231e+02  5.831e+03  -0.124 0.901330
BldgTypeTwnhs       -6.675e+04  7.811e+03  -8.546  < 2e-16 ***
BldgTypeTwnhsE      -4.916e+04  4.892e+03 -10.049  < 2e-16 ***
HouseStyle1.5Unf    -2.835e+04  1.102e+04  -2.573 0.010184 *
HouseStyle1Story    -3.981e+03  3.977e+03  -1.001 0.316972
HouseStyle2.5Fin     5.328e+04  1.472e+04   3.619 0.000306 ***
HouseStyle2.5Unf    -4.606e+03  1.250e+04  -0.368 0.712613
HouseStyle2Story     4.069e+03  4.205e+03   0.968 0.333393
HouseStyleSFoyer    -1.173e+04  7.791e+03  -1.505 0.132424
HouseStyleSLvl      -6.438e+03  6.197e+03  -1.039 0.299077
YearBuilt            4.285e+02  8.428e+01   5.084 4.20e-07 ***
YearRemodAdd         3.114e+02  7.505e+01   4.149 3.53e-05 ***
OverallQual          2.849e+04  1.187e+03  24.010  < 2e-16 ***
OverallCond          1.613e+03  1.150e+03   1.402 0.161035
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 38880 on 1419 degrees of freedom
Multiple R-squared:  0.7671,    Adjusted R-squared:  0.7605
F-statistic: 116.8 on 40 and 1419 DF,  p-value: < 2.2e-16
``````

## 结果解读

``````Residuals:
Min      1Q  Median      3Q     Max
-208970  -20882   -2917   15544  351199
``````

``````Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         -1.455e+06  1.850e+05  -7.862 7.42e-15 ***
LotArea              1.084e+00  1.156e-01   9.375  < 2e-16 ***
NeighborhoodBlueste -1.068e+03  2.953e+04  -0.036 0.971141
NeighborhoodBrDale  -1.440e+04  1.518e+04  -0.949 0.342806
NeighborhoodBrkSide -1.876e+04  1.278e+04  -1.468 0.142460
...
...
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
``````

``````- Estimate 表示回归系数的估计
- Std. Error 表示回归系数的标准误差
- t value 表示假设此回归系数为 0 时的 T 检验值
- Pr(>|t|) 则是上述假设成立的置信度 p-value
``````

P-value 越小则说明假设（假设回归系数为 0）越不容易出现，反过来就是此变量的回归系数不为 0 的几率越大，故此变量在整个回归拟合中作用越显著。一般用置信度 0.05 作为判断依据。

• 最后的三颗星表示此变量显著，星号越多越显著，最多三个。
• 最后一行 Signif. codes 标识着显著标识编码 当 P-value 小于 0.001 时三颗星，小于 0.01 时两颗星，大于 0.05 则认为不太显著。

R2 和调整 R2

``````Multiple R-squared:  0.7671,    Adjusted R-squared:  0.7605
``````

R-squared（判定系数，coefficient of determination） 也称为模型拟合的确定系数，取值 0~1 之间，越接近 1，表明模型的因变量对响应变量 y 的解释能力越强。 Adjusted R-squared 当自变量个数增加时，尽管有的自变量与 y 的线性关系不显著，R square 也会增大。Adjusted R square 增加了对变量增多的惩罚，故我们以 Adjusted R square 为判断模型好坏的基本标准。

F-statistic: 116.8 on 40 and 1419 DF, p-value: < 2.2e-16 F 统计量用来检验模型是否显著 假设模型所有的回归系数均为 0，即该模型是不显著的。对此假设做 F 检验，在 p-value 的置信度下拒绝了此假设，则模型为显著的。

``````# 初步决定的 lm.base 模型的变量
fm.base <- SalePrice ~ LotArea + Neighborhood + BldgType + HouseStyle + YearBuilt + YearRemodAdd + OverallQual

# 训练模型
lm.base <- lm(fm.base, train)
``````

``````# 用 lm.base 模型预测
lm.pred <- predict(lm.base, test)

# 写出结果文件
res <- data.frame(Id = test\$Id, SalePrice = lm.pred)
write.csv(res, file = "D:/RData/House/res_base.csv", row.names = FALSE)
``````

## 初步优化

``````# 快速打印残差图、QQ 图等
layout(matrix(1:4,2,2))
plot(lm.base)
``````
• 残差-拟合图（Residuals vs Fitted）

• 尺度-位置图（Scale-Location Graph）

• 正态 Q-Q 图（Normal Q-Q）

• 残差与杠杆图（Residuals vs Leverage）

``````# 通过 cook 距离来查看异常点
cooksd <- cooks.distance(lm.base)

# 画图
plot(cooksd, pch=".", cex=1, main="Influential Obs by Cooks distance")  # plot cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red")  # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>20*mean(cooksd, na.rm=T),names(cooksd),""), col="red")  # add labels
``````

``````# ４倍以上的为异常点
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])
train <- train[ -influential, ]
``````

``````# 查看 SalePrice 的分布
layout(matrix(1:2,1,2))
hist(train\$SalePrice)
hist(log(train\$SalePrice))
``````

``````# 新的函数
fm.base <- log(SalePrice) ~ log(LotArea) + Neighborhood + BldgType + HouseStyle + YearBuilt + YearRemodAdd + OverallQual + OverallCond
# 训练模型
lm.base <- lm(fm.base, train)
``````

## 不同的变量选择方法对比

``````
#####################################################
# 取空函数和全函数
null=lm(log(SalePrice)~1, data=train)
full=lm(log(SalePrice)~ .-Id , data=train)

# 向前计算
set.seed(999)
lm.for <- step(null, scope=list(lower=null, upper=full), direction="forward")
summary(lm.for)
``````

``````# 最后选择的变量
log(SalePrice) ~ OverallQual + Neighborhood + GrLivArea + BsmtFinSF1 +
OverallCond + YearBuilt + TotalBsmtSF + GarageCars + MSZoning +
BldgType + Functional + LotArea + SaleCondition + CentralAir +
Condition1 + BsmtFullBath + Exterior1st + Fireplaces + YearRemodAdd +
Heating + ScreenPorch + WoodDeckSF + LotFrontage + Foundation +
KitchenQual + BsmtExposure + HeatingQC + SaleType + GarageCond +
KitchenAbvGr + EnclosedPorch + BsmtFinSF2 + X3SsnPorch +
HalfBath + FullBath + LotConfig + Street + GarageArea + OpenPorchSF
``````

LASSO

``````# 安装
install.packages("glmnet")
library(glmnet)

# 准备数据
formula <- as.formula( log(SalePrice)~ .-Id )

# model.matrix 会自动将分类变量变成哑变量
x <- model.matrix(formula, train)
y <- log(train\$SalePrice)

#执行 lasso
set.seed(999)
lm.lasso <- cv.glmnet(x, y, alpha=1)

# 画图
plot(lm.lasso)

# 得到各变量的系数
coef(lm.lasso, s = "lambda.min")

#由于 SalePrice 为 NA 无法数组化
test\$SalePrice <- 1
test_x <- model.matrix(formula, test)

# 预测、输出结果
lm.pred <- predict(lm.lasso, newx = test_x, s = "lambda.min")
res <- data.frame(Id = test\$Id, SalePrice = exp(lm.pred))
write.csv(res, file = "D:/RData/House/res_lasso.csv", row.names = FALSE)
``````

``````#加载随机森林包
library(randomForest)
library(caret)

#设定种子
set.seed(223)

# 设定控制参数
# method = "cv" -- k 折交叉验证
# number -- K 折交叉验证中的 K， number=10 则是 10 折交叉验证
# repeats -- 交叉验证的次数
# verboseIter -- 打印训练日志
ctrl <- trainControl(method = "cv", number = 10, repeats = 20, verboseIter = TRUE)

# 训练模型
lm.rf <- train(log(SalePrice)~ .-Id, data = train,  method = "rf",  trControl = ctrl,  tuneLength = 3)

# 输出结果
write_res(lm.rf, test, 'rf')

# 输出结果
lm.pred <- predict(lm.rf, test)
res <- data.frame(Id = test\$Id, SalePrice = exp(lm.pred))
write.csv(res, file = "D:/RData/House/res_rf.csv", row.names = FALSE)
``````

GBDT 全称为 Gradient Boosting Decision Tree,它是一种基于决策树(decision tree) 实现的分类回归算法。它和随机森林一样都是模型组合的一种，都是将简单的模型组合起来，效果比单个更复杂的模型好。 组合的方式不同导致算法不同，随机森林用了随机化方法，而 GBDT 则使用了 Gradient Boosting 的方法。

``````# 安装包
install.packages("gbm")
# 训练模型
lm.gbm <- train(log(SalePrice)~ .-Id, data = train,  method = "gbm",  trControl = ctrl)

# 输出结果
lm.pred <- predict(lm.gbm, test)
res <- data.frame(Id = test\$Id, SalePrice = exp(lm.pred))
write.csv(res, file = "D:/RData/House/res_gbm.csv", row.names = FALSE)
``````

136 篇文章114 人订阅

0 条评论

## 相关文章

### 详解 Kaggle 房价预测竞赛优胜方案：用 Python 进行全面数据探索

AI 研习社按：Kaggle 的房价预测竞赛从 2016 年 8 月开始，到 2017 年 2 月结束。这段时间内，超过 2000 多人参与比赛，选手采用高级回...

6937

1682

1092

76011

### 13:图像模糊处理

13:图像模糊处理 总时间限制: 1000ms 内存限制: 65536kB描述 给定n行m列的图像各像素点的灰度值，要求用如下方法对其进行模糊化处理： 1....

3825

3414

2623

### iOS基于GPUImage的图像形变设计（简单形变部分）

GPUImage是iOS平台主流的GPU图像处理框架，能够非常方便地使用GPU对图像进行处理，包括：滤镜、分布统计等。 我们知道，如果需要对一个图像进行滤镜处理...

6309

### 6.5编程实例-立方体透视投影

GLint winWidth = 600, winHeight = 600; //设置初始化窗口大小

2221

### CSS深入理解学习笔记之z-index

1、z-index基础 　　z-index含义：指定了元素及其子元素的”z顺序“，而”z顺序“可以决定元素的覆盖顺序。z-index值越大越在上面。 　　z-i...

3095