GPU加速数据分析和机器学习

代码医生工作室

发布于 2019-07-30 16:22:22

1.4K0

发布于 2019-07-30 16:22:22

文章被收录于专栏：相约机器人

介绍

GPU加速现在变得越来越重要。这种转变的主要两个驱动因素是：

世界上的数据量每年都在翻倍[1]。
由于量子领域的限制，摩尔定律现在即将结束[2]。

作为这种转变的示范，越来越多的在线数据科学平台现在正在添加GPU解决方案。一些示例是：Kaggle，Google Colaboratory，Microsoft Azure和Amazon Web Services（AWS）。

在本文中将首先介绍NVIDIA开源Python RAPIDS库，然后将提供RAPIDS如何将数据分析加速多达50次的实际演示。

本文中使用的所有代码都可以在GitHub和Google Colaboratory上找到。

https://github.com/pierpaolo28/Artificial-Intelligence-Projects/tree/master/NVIDIA-RAPIDS%20AI?source=post_page---------------------------

https://colab.research.google.com/drive/1oEoAxBbZONUqm4gt9w2PIzmLTa7IjjV9?source=post_page---------------------------

RAPIDS

在过去几年中已经提出了许多解决方案以便处理大量数据。一些例子是MapReduce，Hadoop和Spark。

RAPIDS现在被设计为数据处理的下一个发展步骤。由于采用了Apache Arrow内存格式，与Spark内存处理相比，RAPIDS可以将速度提高约50倍（图1）。此外，它还能够从一个GPU扩展到多个GPU [3]。

所有RAPIDS库都基于Python，并且设计为具有Pandas和Sklearn之类的接口以便于采用。

图1：数据处理演变[3]

所有RAPIDS软件包现在都可以免费用于Anaconda，Docker和基于云的解决方案，例如Google Colaboratory。

RAPIDS结构基于不同的库，以便从头到尾加速数据科学（图2）。其主要组成部分是：

cuDF =用于执行数据处理任务（像熊猫一样）。
cuML =用于创建机器学习模型（Sklearn之类）。
cuGraph =用于执行图形任务（图论）。

RAPIDS还集成了：用于深度学习的PyTorch和Chainer，用于可视化的Kepler GL，以及用于分布式计算的Dask [4]。

图2：RAPIDS架构[3]

示范

现在展示与使用Pandas和Sklearn相比，使用RAPIDS如何能够实现更快的数据分析。将使用的所有代码都可以在Google Colaboratory上找到。

https://colab.research.google.com/drive/1oEoAxBbZONUqm4gt9w2PIzmLTa7IjjV9?source=post_page---------------------------

为了使用RAPIDS，首先需要使Google Colaboratory笔记本能够在GPU模式下使用Tesla T4 GPU，然后安装所需的依赖项。

预处理

一旦完成所有设置，就可以导入所有必需的库。

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from xgboost import XGBClassifier
import cudf
import xgboost as xgb
from sklearn.metrics import accuracy_score

在这个例子中，将展示与仅使用Sklearn相比，RAPIDS如何加速机器学习工作流程。在这种情况下，决定使用Pandas预处理RAPIDS和Sklearn分析。在Google Colaboratory笔记本上也有另一个例子，使用cuDF进行预处理。使用cuDF而不是Pandas，可以加快预处理速度，尤其是在处理大量数据时。

对于这个例子，决定使用由三个特征和两个标签（0/1）组成的高斯分布来构造一个简单的数据集。

# Creating a linearly separable dataset using Gaussian Distributions.
# The first half of the number in Y is 0 and the other half 1.
# Therefore I made the first half of the 3 features quite different from
# the second half of the features (setting the value of the means quite
# similar) so that make quite simple the classification between the
# classes (the data is linearly separable).
dataset_len = 8000000
dlen = int(dataset_len/2)
X_11 = pd.Series(np.random.normal(2,2,dlen))
X_12 = pd.Series(np.random.normal(9,2,dlen))
X_1 = pd.concat([X_11, X_12]).reset_index(drop=True)
X_21 = pd.Series(np.random.normal(1,3,dlen))
X_22 = pd.Series(np.random.normal(7,3,dlen))
X_2 = pd.concat([X_21, X_22]).reset_index(drop=True)
X_31 = pd.Series(np.random.normal(3,1,dlen))
X_32 = pd.Series(np.random.normal(3,4,dlen))
X_3 = pd.concat([X_31, X_32]).reset_index(drop=True)
Y = pd.Series(np.repeat([0,1],dlen))
df = pd.concat([X_1, X_2, X_3, Y], axis=1)
df.columns = ['X1', 'X2', 'X3', 'Y']
df.head()

选择均值的均值和标准偏差的值，以便使这种分类问题相当容易（线性可分的数据）。

图3：样本数据集

一旦创建了数据集，就将它划分为功能和标签，然后定义一个函数来预处理它。

X = df.drop(['Y'], axis = 1).values
y = df['Y']
 
def preproces(df, X, y, train_size = 0.80):
  # label_encoder object knows how to understand word labels.
  label_encoder = preprocessing.LabelEncoder()
 
  # Encode labels
  y = label_encoder.fit_transform(y)
 
  # identify shape and indices
  num_rows, num_columns = df.shape
  delim_index = int(num_rows * train_size)
 
  # Splitting the dataset in training and test sets
  X_train, y_train = X[:delim_index, :], y[:delim_index]
  X_test, y_test = X[delim_index:, :], y[delim_index:]
 
  # Checking sets dimensions
  print('X_train dimensions: ', X_train.shape, 'y_train: ', y_train.shape)
  print('X_test dimensions:', X_test.shape, 'y_validation: ', y_test.shape)
 
  # Checking dimensions in percentages
  total = X_train.shape[0] + X_test.shape[0]
  print('X_train Percentage:', (X_train.shape[0]/total)*100, '%')
  print('X_test Percentage:', (X_test.shape[0]/total)*100, '%')
  
  return X_train, y_train, X_test, y_test
 
X_train, y_train, X_test, y_test = preproces(df, X, y)

现在已经获得了训练/测试集，终于准备开始使用机器学习了。在这个例子中，将使用XGBoost（Extreme Gradient Boosting）作为分类器。

RAPIDS

为了将XGBoost与RAPIDS一起使用，首先需要以矩阵形式转换Training / Tests输入。

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

接下来，可以开始训练模型。

%%time
 
# Initial xgb parameters
params = {}
 
clf = xgb.train(params, dtrain)

上述单元格的输出如下所示。使用RAPIDS提供的XGBoost库只需不到两分钟的时间来训练模型。

CPU times: user 1min 54s, sys: 307 ms, total: 1min 54s
Wall time: 1min 54s

此外，RAPIDS XGBoost库还提供了一个非常方便的功能，可以对数据集中每个特征的重要性进行排名和绘制（图4）。

# Feature Importance plot!
xgb.plot_importance(clf)

这对于减少数据的维度非常有用。通过选择最重要的功能并在其上训练模型，实际上会降低过度拟合数据的风险，也会加快训练时间。

图4：XGBoost功能重要性

最后，现在可以计算出分类器的准确性。

rapids_pred = clf.predict(dtest)
 
rapids_pred = np.round(rapids_pred)
rapids_acc = round(accuracy_score(y_test, rapids_pred), 2)
print("XGB accuracy using RAPIDS:", rapids_acc*100, '%')

使用RAPIDS的模型的总体准确度等于98％。

XGB accuracy using RAPIDS: 98.0 %

Sklearn

现在将使用普通的Sklearn重复相同的分析。

%%time
 
model = XGBClassifier()
model.fit(X_train, y_train)

在这个场合，训练模型花了11分钟。这意味着使用Sklearn来解决这个问题的大小比使用RAPIDS（662s / 114s）慢5.8倍。通过在预处理阶段使用cuDF而不是Pandas，可以为本示例的整个工作流程减少执行时间。

CPU times: user 11min 2s, sys: 594 ms, total: 11min 3s
Wall time: 11min 2s

最后，使用Sklearn计算模型的整体精度。

sk_pred = model.predict(X_test)
sk_pred = np.round(sk_pred)
sk_acc = round(accuracy_score(y_test, sk_pred), 2)
print("XGB accuracy using Sklearn:", sk_acc*100, '%')

此外在这种情况下，总体准确度等于98％。这意味着使用RAPIDS可以在不影响所有模型精度的情况下实现更快的结果。