前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >机器学习实战(2)之预测房价

机器学习实战(2)之预测房价

作者头像
用户1359560
发布2018-11-08 10:54:32
6030
发布2018-11-08 10:54:32
举报
文章被收录于专栏:生信小驿站生信小驿站

这一篇主要是系统地对数据进行机器学习前的预处理。

# -*- coding: utf-8 -*-
"""
Created on Sun Oct 21 14:37:15 2018

@author: Administrator
"""

% reset -f
% clear

# In[*]
##########第一步  导入包
# In[*]
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import os
from scipy.stats import skew
from scipy.stats.stats import pearsonr
os.chdir("C:\\Users\\Administrator\\Desktop\\all")

# In[*]
##########第二步  导入数据
# In[*]
train = pd.read_csv('train.csv',header = 0,index_col=0)
test  = pd.read_csv('test.csv',header = 0,index_col=0)

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))

前两步,导入包和数据。

数据大概80列,3000个观测值,属性包括有数字列,同时也有字符串列。

# In[*]
# 第三步,将目标变量标准化

matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"],
                       "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()

#log transform the target:
# In[*]
# 第四步,将预测变量标准化
train["SalePrice"] = np.log1p(train["SalePrice"])

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) 

skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

这一步主要目的是将数字类型的属性,将这些特征其中比较偏,不属于正态分布的特征做log标准化。

# In[*]
# 第五步,处理字符型变量以及将填充缺失值
# In[*]
all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data.mean())
# In[*]
# 第六步,划分训练集和测试集
# In[*]
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

数据预处理要点: 1.使用log(x+1)来转换偏斜的数字特征 -,这将使我们的数据更加正常 2.为分类要素创建虚拟变量 3.将数字缺失值(NaN)替换为各自列的平均值

全部代码:

# -*- coding: utf-8 -*-
"""
Created on Sun Oct 21 14:37:15 2018

@author: Administrator
"""

% reset -f
% clear

# In[*]
##########第一步  导入包
# In[*]
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import os
from scipy.stats import skew
from scipy.stats.stats import pearsonr
os.chdir("C:\\Users\\Administrator\\Desktop\\all")

# In[*]
##########第二步  导入数据
# In[*]
train = pd.read_csv('train.csv',header = 0,index_col=0)
test  = pd.read_csv('test.csv',header = 0,index_col=0)

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))
# In[*]

#Data preprocessing:

#We're not going to do anything fancy here:

#First I'll transform the skewed numeric features by taking log(feature + 1) - 
#this will make the features more normal
#Create Dummy variables for the categorical features
#Replace the numeric missing values (NaN's) with the mean of their respective columns

# In[*]
# 第三步,将目标变量标准化
# In[*]
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"],
                       "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()

#log transform the target:
# In[*]
# 第四步,将预测变量标准化
# In[*]
train["SalePrice"] = np.log1p(train["SalePrice"])

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) 

skewed_feats = skewed_feats[skewed_feats > 0.75]


skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

# In[*]
# 第五步,处理字符型变量以及将填充缺失值
# In[*]
all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data.mean())
# In[*]
# 第六步,划分训练集和测试集
# In[*]
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2018.10.21 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档