前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >房产估值模型训练及预测结果

房产估值模型训练及预测结果

作者头像
潇洒坤
发布2018-09-10 10:46:30
1.2K0
发布2018-09-10 10:46:30
举报
文章被收录于专栏:简书专栏简书专栏

本文房产估值模型源数据为厦门市房价数据,文件下载链接: https://pan.baidu.com/s/1vOact6MsyZZlTSxjmMqTbw 密码: 8zg6 下载文件打开后如下图所示:

文件打开图示.png

从上图中可以看出数据已经经过简单的处理,只需要再稍微调整就可以投入模型的训练中。

两个模型MLPR

GB___df_y = df['unitPrice']___得到DataFrame的unitPrice字段数据,y = df_y.values得到shape为(21935,),类型为numpy.ndarray的矩阵,即长度为21935的一维矩阵。 df_x = df.drop(['unitPrice'],axis=1)得到DataFrame的除了unitPrice字段的其他字段,x = df_x.values得到shape为(21935,120),类型为numpy.ndarray的矩阵,即大小为21935*120的二维矩阵。 用sklearn中的预处理函数preprocessing.StandardScaler()对数据标准化处理,处理过程是先用训练集fit,再把测试集也标准化处理。 调用MLPRegresso()获得多层感知器-回归模型,再用训练集进行训练,最后对测试集进行测试得分。 调用GradientBoostingRegressor()获得集成-回归模型,再用训练集进行训练,最后对测试集进行测试得分。

代码语言:javascript
复制
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd

#boston = load_boston()
df = pd.read_excel("数据处理结果.xlsx")
df_y = df['unitPrice']
df_x = df.drop(['unitPrice'],axis=1)
x = df_x.values
y = df_y.values

train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
                                                  random_state=33)
ss_x = preprocessing.StandardScaler()
train_x1 = ss_x.fit_transform(train_x)
test_x1 = ss_x.transform(test_x)

ss_y = preprocessing.StandardScaler()
train_y1 = ss_y.fit_transform(train_y.reshape(-1,1))
test_y1 = ss_y.transform(test_y.reshape(-1,1))

model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x1,train_y1.ravel())
mlp_score = model_mlp.score(test_x1,test_y1.ravel())
print("sklearn多层感知器-回归模型得分",mlp_score)

model_gbr = GradientBoostingRegressor()
model_gbr.fit(train_x1,train_y1.ravel())
gbr_score = model_gbr.score(test_x1,test_y1.ravel())
print("sklearn集成-回归模型得分",gbr_score)

打印的结果是: sklearn多层感知器-回归模型得分 0.683941816792 sklearn集成-回归模型得分 0.762351806857 对于第一次调整模型,这个结果还可以接受。

异常值处理

image.png

从图中我们可以看到有的房子单价达到几十上百万,这种异常值需要删除。 暂时没有发现可以直接调用处理异常值的函数,所以需要自己写。下面的代码中定义了一个cleanOutlier函数,函数的功能主要是删除异常值。首先得清楚下四分位数和上四分位数的概念:例如总共有100个数,中位数是从小到大排序第50个数的值,低位数是从小到大排序第25个数,高位数是从小到大排序第75个数。 四分位距是上四分位数减下四分位数所得值,例如:上四分位数为900,下四分位数为700,则四分位距为200 异常值指的是过大或者过小的值。在我们这个删除异常值的方法中,低于(下四分位数-3四分位距)的值或者高于(上四分位数+3四分位距)的值会被判定为异常值并删除。例如,上四分位数为900,下四分位数为700,则低于100或者高于1500的数被删除。 将DataFrame转换为ndarray只需要用df.values就可以获得,训练模型时数值类型一般为float,所以用df.values.astype('float')来获得浮点类型数值的矩阵。 用cleanOutlier函数删除异常值,然后把第0列负值给y变量,把1列到最后一列赋值给x变量 因为x大多是1-hot编码,所以不需要再进行标准化。

代码语言:javascript
复制
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd

def cleanOutlier(data,column,mul=3):
    data = data[data[:,column].argsort()] #得到排序后的ndarray
    l = len(data)
    low = int(l/4)
    high = int(l/4*3)
    lowValue = data[low,column]
    highValue = data[high,column]
    print("下四分位数为{}  上四分位数{}".format(lowValue,highValue))
    if lowValue - mul * (highValue - lowValue) < data[0,column] :
        delLowValue = data[0,column]
    else:
        delLowValue = lowValue - mul * (highValue - lowValue)
    if highValue + mul * (highValue - lowValue) > data[-1,column]:
        delHighValue = data[-1,column]
    else:
        delHighValue = highValue + mul * (highValue - lowValue)
    print("删除第{}列中数值小于{}或者大于{}的部分".format(column,\
          delLowValue,delHighValue))
    for i in range(low):
        if data[i,column] >= delLowValue:
            recordLow = i 
            break
    for i in range(len(data)-1,high,-1):
        if data[i,column] <= delHighValue:
            recordHigh = i
            break
    #打印处理异常值的相关信息
    print("原矩阵共有{}行".format(len(data)),end=',')
    print("保留{}到{}行".format(recordLow,recordHigh),end=',')
    data = data[recordLow:recordHigh+1]
    print("删除第{}列中的异常值后剩余{}行".format(column,\
          recordHigh+1-recordLow))
    return data

df = pd.read_excel("数据处理结果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]

train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
                                                  random_state=33)

ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))

model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x,train_y.ravel())
mlp_score = model_mlp.score(test_x,test_y.ravel())
print("sklearn多层感知器-回归模型得分",mlp_score)

model_gbr = GradientBoostingRegressor(learning_rate=0.1)
model_gbr.fit(train_x,train_y.ravel())

ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))

gbr_score = model_gbr.score(test_x,test_y.ravel())
print("sklearn集成-回归模型得分",gbr_score)

打印的结果是: sklearn多层感知器-回归模型得分 0.795028773029 sklearn集成-回归模型得分 0.767157061712 对于第二次调整模型,我们可以看到sklearn多层感知器-回归模型得分明显提高,而对于sklearn集成-回归模型则没有太大提高。总之,这次异常值处理是成功的。

正态化

正态化就是将y的值以e为底取对数,得到新的一列赋值给y。 正态化用一个循环完成:for i in range(len(y)): y[i] = math.log(y[i]) 正态化之后按照原理是不用再标准化了,但是经过实验,对x,y标准化都可以提高得分。

代码语言:javascript
复制
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
import math

def cleanOutlier(data,column,mul=3):
    data = data[data[:,column].argsort()] #得到排序后的ndarray
    l = len(data)
    low = int(l/4)
    high = int(l/4*3)
    lowValue = data[low,column]
    highValue = data[high,column]
    print("下四分位数为{}  上四分位数{}".format(lowValue,highValue))
    if lowValue - mul * (highValue - lowValue) < data[0,column] :
        delLowValue = data[0,column]
    else:
        delLowValue = lowValue - mul * (highValue - lowValue)
    if highValue + mul * (highValue - lowValue) > data[-1,column]:
        delHighValue = data[-1,column]
    else:
        delHighValue = highValue + mul * (highValue - lowValue)
    print("删除第{}列中数值小于{}或者大于{}的部分".format(column,\
          delLowValue,delHighValue))
    for i in range(low):
        if data[i,column] >= delLowValue:
            recordLow = i 
            break
    for i in range(len(data)-1,high,-1):
        if data[i,column] <= delHighValue:
            recordHigh = i
            break
    #打印处理异常值的相关信息
    print("原矩阵共有{}行".format(len(data)),end=',')
    print("保留{}到{}行".format(recordLow,recordHigh),end=',')
    data = data[recordLow:recordHigh+1]
    print("删除第{}列中的异常值后剩余{}行".format(column,\
          recordHigh+1-recordLow))
    return data

df = pd.read_excel("数据处理结果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]
for i in range(len(y)):
    y[i] = math.log(y[i])
    
train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
                                                  random_state=33)

ss_x = preprocessing.StandardScaler()
train_x = ss_x.fit_transform(train_x)
test_x = ss_x.transform(test_x)

ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))

model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x,train_y.ravel())
mlp_score = model_mlp.score(test_x,test_y.ravel())
print("sklearn多层感知器-回归模型得分",mlp_score)

model_gbr = GradientBoostingRegressor(learning_rate=0.1)
model_gbr.fit(train_x,train_y.ravel())
gbr_score = model_gbr.score(test_x,test_y.ravel())
print("sklearn集成-回归模型得分",gbr_score)

打印的结果是: sklearn多层感知器-回归模型得分 0.831448099649 sklearn集成-回归模型得分 0.780133207248 相比较于前一次,分数又得到了提高,是一次成功的调整。

交叉验证

主要使用的是sklearn.model_selection中的KFold方法选择训练集和测试集 kf = KFold(n_splits=5,shuffle=True)这一行代码初始化KFold对象 for train_index,test_index in kf.split(x):这一行代码可以看出kf.split(x)得到的是一个长度为n_splits的列表,即长度为5的列表,列表中元素是元组,元组中的元素是训练集和测试集的索引。

代码语言:javascript
复制
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
import math
from sklearn.model_selection import KFold

def cleanOutlier(data,column,mul=3):
    data = data[data[:,column].argsort()] #得到排序后的ndarray
    l = len(data)
    low = int(l/4)
    high = int(l/4*3)
    lowValue = data[low,column]
    highValue = data[high,column]
    print("下四分位数为{}  上四分位数{}".format(lowValue,highValue))
    if lowValue - mul * (highValue - lowValue) < data[0,column] :
        delLowValue = data[0,column]
    else:
        delLowValue = lowValue - mul * (highValue - lowValue)
    if highValue + mul * (highValue - lowValue) > data[-1,column]:
        delHighValue = data[-1,column]
    else:
        delHighValue = highValue + mul * (highValue - lowValue)
    print("删除第{}列中数值小于{}或者大于{}的部分".format(column,\
          delLowValue,delHighValue))
    for i in range(low):
        if data[i,column] >= delLowValue:
            recordLow = i 
            break
    for i in range(len(data)-1,high,-1):
        if data[i,column] <= delHighValue:
            recordHigh = i
            break
    #打印处理异常值的相关信息
    print("原矩阵共有{}行".format(len(data)),end=',')
    print("保留{}到{}行".format(recordLow,recordHigh),end=',')
    data = data[recordLow:recordHigh+1]
    print("删除第{}列中的异常值后剩余{}行".format(column,\
          recordHigh+1-recordLow))
    return data

df = pd.read_excel("数据处理结果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]
for i in range(len(y)):
    y[i] = math.log(y[i])

kf = KFold(n_splits=5,shuffle=True)

for train_index,test_index in kf.split(x):
    train_x = x[train_index]    
    test_x = x[test_index]
    train_y = y[train_index]
    test_y = y[test_index]
 
    ss_x = preprocessing.StandardScaler()
    train_x = ss_x.fit_transform(train_x)
    test_x = ss_x.transform(test_x)
    
    ss_y = preprocessing.StandardScaler()
    train_y = ss_y.fit_transform(train_y.reshape(-1,1))
    test_y = ss_y.transform(test_y.reshape(-1,1))
    
    model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
    model_mlp.fit(train_x,train_y.ravel())
    mlp_score = model_mlp.score(test_x,test_y.ravel())
    print("sklearn多层感知器-回归模型得分",mlp_score)
    
    model_gbr = GradientBoostingRegressor(learning_rate=0.1)
    model_gbr.fit(train_x,train_y.ravel())
    gbr_score = model_gbr.score(test_x,test_y.ravel())
    print("sklearn集成-回归模型得分",gbr_score)

打印结果是: sklearn多层感知器-回归模型得分 0.8427725943791746 sklearn集成-回归模型得分 0.7915684454283963 sklearn多层感知器-回归模型得分 0.8317854959807023 sklearn集成-回归模型得分 0.7705608099963528 sklearn多层感知器-回归模型得分 0.8369280445356948 sklearn集成-回归模型得分 0.7851823734454625 sklearn多层感知器-回归模型得分 0.8364897250676866 sklearn集成-回归模型得分 0.7833199279062474 sklearn多层感知器-回归模型得分 0.8335782493590231 sklearn集成-回归模型得分 0.7722233325504181

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2018.06.26 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 两个模型MLPR
  • 异常值处理
  • 正态化
  • 交叉验证
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档