# 机器学习算法应用中常用技巧-1

### 1. 取样

```indices = [100,200,300]

# 把sample原来的序号去掉重新分配
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print "Chosen samples:"
display(samples)```

### 2. Split数据

```from sklearn import cross_validation
X = new_data
y = data['Milk']
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.25, random_state = 0)
print len(X_train), len(X_test), len(y_train), len(y_test)```

### 分离出 Features & Label

```# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
data = full_data.drop('Survived', axis=1)```

### 3. 用 train 来训练模型，用 test 来检验

```from sklearn import tree
regressor = tree.DecisionTreeRegressor()
regressor = regressor.fit(X_train, y_train)
score = regressor.score(X_test, y_test)```

### 4. 判断 feature 间的关联程度

`pd.scatter_matrix(data, alpha = 0.3, figsize = (14, 8), diagonal = 'kde');`

### 5. scaling

`pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');`

scaling前后对比图：

### 6. Outliers

```for feature in log_data.keys():
Q1 = np.percentile(log_data[feature], 25)
Q3 = np.percentile(log_data[feature], 75)
step = 1.5 * (Q3 - Q1)
print "Outliers for feature '{}':".format(feature)
print Q1, Q3, step
display(log_data[~((log_data[feature]>=Q1-step) & (log_data[feature]<=Q3+step))].sort([feature]))```

```plt.figure()
plt.boxplot([log_data.Fresh, log_data.Milk, log_data.Grocery, log_data.Frozen, log_data.Detergents_Paper, log_data.Delicassen], 0, 'gD');```

134 篇文章53 人订阅

0 条评论

## 相关文章

2658

1192

### 用DaPy进行机器学习

DaPy自带了少量著名的数据集，比如用于分类问题的红酒分类和鸢尾花数据集。 接下来，我们首先启动一个Python Shell并加载作为例子的红酒数据集：

1103

2997

### CNN+MNIST+INPUT_DATA数字识别

TALK IS CHEAP，SHOW ME THE CODE，先从MNIST数据集下载脚本Input_data开始

4433

38313

3.4K6

### 图像处理中任意核卷积(matlab中conv2函数)的快速实现。

卷积其实是图像处理中最基本的操作，我们常见的一些算法比如：均值模糊、高斯模糊、锐化、Sobel、拉普拉斯、prewitt边缘检测等等一些和领域相关的算...

9728

912

5166