# 机器学习实战（2）之预测房价

```# -*- coding: utf-8 -*-
"""
Created on Sun Oct 21 14:37:15 2018

"""

% reset -f
% clear

# In[*]
##########第一步  导入包
# In[*]
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import os
from scipy.stats import skew
from scipy.stats.stats import pearsonr

# In[*]
##########第二步  导入数据
# In[*]

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))```

```# In[*]
# 第三步，将目标变量标准化

matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"],
"log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()

#log transform the target:```
```# In[*]
# 第四步，将预测变量标准化
train["SalePrice"] = np.log1p(train["SalePrice"])

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna()))

skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])```

```# In[*]
# 第五步，处理字符型变量以及将填充缺失值
# In[*]
all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data.mean())
# In[*]
# 第六步，划分训练集和测试集
# In[*]
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice```

```# -*- coding: utf-8 -*-
"""
Created on Sun Oct 21 14:37:15 2018

"""

% reset -f
% clear

# In[*]
##########第一步  导入包
# In[*]
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import os
from scipy.stats import skew
from scipy.stats.stats import pearsonr

# In[*]
##########第二步  导入数据
# In[*]

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))
# In[*]

#Data preprocessing:

#We're not going to do anything fancy here:

#First I'll transform the skewed numeric features by taking log(feature + 1) -
#this will make the features more normal
#Create Dummy variables for the categorical features
#Replace the numeric missing values (NaN's) with the mean of their respective columns

# In[*]
# 第三步，将目标变量标准化
# In[*]
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"],
"log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()

#log transform the target:
# In[*]
# 第四步，将预测变量标准化
# In[*]
train["SalePrice"] = np.log1p(train["SalePrice"])

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna()))

skewed_feats = skewed_feats[skewed_feats > 0.75]

skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

# In[*]
# 第五步，处理字符型变量以及将填充缺失值
# In[*]
all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data.mean())
# In[*]
# 第六步，划分训练集和测试集
# In[*]
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice```

200 篇文章37 人订阅

0 条评论

## 相关文章

### 第二章：图形绘制TextureSpriteBatch 类（相当于画笔）为什么要2的N次方？TextureRegion 可用于图片截取。Sprite类清屏

1.纹理（Texture）：图片从原始格式，解码并上传到GPU的过程，被称之为纹理。 2.用途：其实就是承装获取到目的图片的容器 3.使用：Gdx.file...

912

### html5新特性

1.新的文档类型 2.脚本和链接无需type 3.语义Header和Footer 4.Hgroup 5.标记元素 6.图形元素 7.重新定义 8....

793

3476

812

### （数据科学学习手札43）Plotly基础内容介绍

Plotly是一个非常著名且强大的开源数据可视化框架，它通过构建基于浏览器显示的web形式的可交互图表来展示信息，可创建多达数十种精美的图表和地图，本文就将...

4454

### 庖丁解牛：GIF

GIF 是一种使用 LZW 压缩，支持多张图像的容器。支持256色，透明通道为1bit。作为互联网表情包的载体，GIF 这项80年代的技术依然生生不息。

2310

4178

962

1623

2722