前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Bagging与Boosting

Bagging与Boosting

作者头像
用户1733462
发布2018-07-04 12:10:25
2870
发布2018-07-04 12:10:25
举报
文章被收录于专栏:数据处理数据处理

加载数据

代码语言:javascript
复制
import pandas as pd
df_wine = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
df_wine.columns = ['Class label', 'Alcohol', 
                   'Malic acid', 'Ash', 
                   'Alcalinity of ash', 'Magnesium', 
                   'Total phenols', 'Flavanoids', 
                   'Nonflavanoid phenols', 'Proanthocyanins', 
                   'Color intensity', 'Hue', 
                   'OD280/OD315 of diluted wines', 'Proline']
y = df_wine['Class label'].values

特征选择

为了方便后面可视化,我们只选取2个特征,通过自变量与因变量y相关系数来选择

代码语言:javascript
复制
# pearsonr可以计算相关系数与p值
# 当p<0.01表示两个变量强相关
from scipy.stats import pearsonr

lable=df_wine.values[:,0]
lr = []
for i, line in enumerate(df_wine.values.T):
    lr.append([pearsonr(lable,line),i])
lr.sort()
X = df_wine[[df_wine.columns[lr[0][1]],df_wine.columns[lr[-2][1]]]].values

还可以通过PCA降维来选择,本例降维后分类效果并不好

代码语言:javascript
复制
# pearsonr可以计算相关系数与p值
# 当p<0.01表示两个变量强相关
from scipy.stats import pearsonr

lable=df_wine.values[:,0]
lr = []
for i, line in enumerate(df_wine.values.T):
    lr.append([pearsonr(lable,line),i])
lr.sort()
X = df_wine[[df_wine.columns[lr[0][1]],df_wine.columns[lr[-2][1]]]].values

因为这里有标签,还可以通过LDA来降维选择,效果比较好,数据分类达到100%正确

代码语言:javascript
复制
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
X = df_wine.iloc[:,range(1,len(df_wine.columns),1)].values
lda = LinearDiscriminantAnalysis(n_components=2)
X = lda.fit(X, y).transform(X)

调参,这里只调一个决策树深度参数

代码语言:javascript
复制
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# # 拆分训练集的30%作为测试集
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=1)
param_test1 = {'max_depth':range(1,20,1)}
gsearch1 = GridSearchCV(estimator = DecisionTreeClassifier(criterion="entropy",
                                random_state=10), 
                       param_grid = param_test1,cv=10)
gsearch1.fit(X_train,y_train)
#print gsearch1.grid_scores_, 
print gsearch1.best_params_ 
print gsearch1.best_score_

输出

代码语言:javascript
复制
{'max_depth': 8}
0.822580645161
度量单个决策树的准确性
代码语言:javascript
复制
# 度量单个决策树的准确性
from sklearn.metrics import accuracy_score
tree = DecisionTreeClassifier(criterion="entropy", max_depth=gsearch1.best_params_['max_depth'])
tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f' % (tree_train, tree_test))
代码语言:javascript
复制
Decision tree train/test accuracies 0.984/0.815
代码语言:javascript
复制
# 生成50个决策树,详细的参数建议参考官方文档
bag = BaggingClassifier(base_estimator=tree, n_estimators=50, 
                        max_samples=1.0, max_features=1.0, 
                        bootstrap=True, bootstrap_features=False, 
                        n_jobs=1, random_state=1)

# 度量bagging分类器的准确性
bag = bag.fit(X_train, y_train)
y_train_pred = bag.predict(X_train)
y_test_pred = bag.predict(X_test)
bag_train = accuracy_score(y_train, y_train_pred)
bag_test = accuracy_score(y_test, y_test_pred)
print('Bagging train/test accuracies %.3f/%.3f' % (bag_train, bag_test))

Bagging分类器的效果的确要比单个决策树的效果好,提高了一点

代码语言:javascript
复制
Bagging train/test accuracies 1.000/0.852

Boosting分类器, Bagging是投票平均模式,Boosting

代码语言:javascript
复制
ada = AdaBoostClassifier(base_estimator=tree, n_estimators=1000, learning_rate=0.1, random_state=0)
ada = ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)
ada_train = accuracy_score(y_train, y_train_pred)
ada_test = accuracy_score(y_test, y_test_pred)
print('AdaBoost train/test accuracies %.3f/%.3f' % (ada_train, ada_test))
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2018.06.29 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 加载数据
  • 特征选择
  • 调参,这里只调一个决策树深度参数
    • 度量单个决策树的准确性
    • Bagging分类器的效果的确要比单个决策树的效果好,提高了一点
    • Boosting分类器, Bagging是投票平均模式,Boosting
    领券
    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档