# sklearn调包侠之决策树算法

### 决策树原理

##### 改进算法

• ID3 是信息增益划分
• C4.5 是信息增益率划分
• CART 做分类工作时，采用 GINI 值作为节点分裂的依据

### 实战——泰坦尼克号生还预测

##### 数据导入与预处理

```import numpy as np
import pandas as pd

```# 删除列
df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
# Sex转换
def f1(x):
if x == 'male':
return 1
else:
return 0
df['Sex'] = df['Sex'].apply(f1)```

```import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.countplot(x="Embarked",data=df)```
```df['Embarked'] = df['Embarked'].fillna('S')
labels = df['Embarked'].unique().tolist()
df['Embarked'] = df['Embarked'].apply(lambda n: labels.index(n))```

```sns.set(style="darkgrid", palette="muted", color_codes=True)
sns.distplot(df[df['Age'].notnull()]['Age'])```
```df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Age'].isnull().sum()```

##### 切分数据集
```from sklearn.model_selection import train_test_split
X = df.iloc[:, 1:]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)```
##### 模型训练与评估

• criterion：算法选择。一种是信息熵（entropy），一种是基尼系数（gini），默认为gini。
• max_depth：指定数的最大深度。
• min_samples_split：默认为2，指定能创建分支的数据集大小。
• min_impurity_decrease：指定信息增益的阈值。

```from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

# result
# 0.82122905027932958```

```from sklearn.model_selection import cross_val_score

result = cross_val_score(clf, X, y, cv=10)
print(result.mean())

# result
# 0.772279536942```
##### 模型调优

```def cv_score(d):
clf = DecisionTreeClassifier(max_depth=d)
clf.fit(X_train, y_train)
tr_score = clf.score(X_train, y_train)
cv_score = clf.score(X_test, y_test)
return (tr_score, cv_score)

depths = range(2, 15)
scores = [cv_score(d) for d in depths]
tr_scores = [s[0] for s in scores]
cv_scores = [s[1] for s in scores]

best_score_index = np.argmax(cv_scores)
best_score = cv_scores[best_score_index]
best_param = depths[best_score_index]
print('best param: {0}; best score: {1}'.format(best_param, best_score))

plt.figure(figsize=(10, 6), dpi=144)
plt.grid()
plt.xlabel('max depth of decision tree')
plt.ylabel('score')
plt.plot(depths, cv_scores, '.g-', label='cross-validation score')
plt.plot(depths, tr_scores, '.r--', label='training score')
plt.legend()

# result
# best param: 11; best score: 0.8212290502793296```
##### 网格搜索

• 结果不稳定。当划分不同的数据集时，可能结果都一样。
• 不能选择多参数。当需要多参数进行调优时，代码量会变的很多（多次嵌套循环）。

```from sklearn.model_selection import GridSearchCV
threshholds = np.linspace(0, 0.5, 50)
param_grid = {'criterion':['gini', 'entropy'],
'min_impurity_decrease':threshholds,
'max_depth':range(2, 15)}

clf = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
clf.fit(X, y)

print("best param: {0}\nbest score: {1}".format(clf.best_params_,
clf.best_score_))

# result
# best param: {'criterion': 'entropy', 'max_depth': 8, 'min_impurity_decrease': 0.0}
best score: 0.8204264870931538```

104 篇文章39 人订阅

0 条评论

## 相关文章

4054

### Kit 3D 更新

Kit3D is a 3D graphics engine written for Microsoft Silverlight. Kit3D was inita...

2506

### Spring Reactor 项目核心库Reactor Core

Non-Blocking Reactive Streams Foundation for the JVM both implementing a Reactiv...

2102

### sqlserver使用存储过程跟踪SQL

USE [master] GO /****** Object: StoredProcedure [dbo].[sp_perfworkload_trace_s...

1990

4798

3955

### Miguel de Icaza 细说 Mix 07大会上的Silverlight和DLR

Mono之父Miguel de Icaza 详细报道微软Mix 07大会上的Silverlight和DLR ,上面还谈到了Mono and Silverligh...

2667

### LINQ via C# 系列文章

LINQ via C# Recently I am giving a series of talk on LINQ. the name “LINQ via C...

2605

### Mix 10 上的asp.net mvc 2的相关Session

Beyond File | New Company: From Cheesy Sample to Social Platform Scott Hansel...

2517

### Luminous版本PG 分布调优

Luminous版本开始新增的balancer模块在PG分布优化方面效果非常明显，操作也非常简便，强烈推荐各位在集群上线之前进行这一操作，能够极大的提升整个集群...

3035