前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Your Guide to DL with MLSQL Stack (3)

Your Guide to DL with MLSQL Stack (3)

作者头像
用户2936994
发布2019-05-17 14:40:22
5090
发布2019-05-17 14:40:22
举报
文章被收录于专栏:祝威廉祝威廉

This is the third article of Your Guide with MLSQL Stack series. We hope this article series shows you how MLSQL stack helps people do AI job.

As we have seen in the previous posts that MLSQL stack give you the power to use the built-in Algorithms and Python ML frameworks. The ability to use Python ML framework means you are totally free to use Deep Learning tools like PyTorch, Tensorflow. But this time, we will teach you how to use built-in DL framework called BigDL to accomplish image classification task first.

Requirements

This guide requires MLSQL Stack 1.3.0-SNAPSHOT. You can setup MLSQL stack with following links. We recommend you deploy MLSQL stack in local.

  1. Docker
  2. Mannually Compile
  3. Prebuild Distribution

If you meet any problem when deploying, please let me know and feel free to address any issue in this link.

Project Structure

I have created a project named store1, and there is a directory called image_classify contains all mlsql script we talk today. It looks like this:

image.png

We will teach you how to build the project step by step.

Upload Image

First, download cifar10 raw images from url: https://github.com/allwefantasy/spark-deep-learning-toy/releases/download/v0.01/cifar.tgz ungzip it and make sure it's a tar file.

Though MLSQL Console supports directory uploading, but the huge number of files in the directory will crash the uploading component in the web page, and of course, we hope we can fix this issue in future. Now, there is one way that packaging the directory as a tar file to walk around this uploading crash issue.

image.png

then save upload tar file to your home:

代码语言:javascript
复制
-- download cifar data from https://github.com/allwefantasy/spark-deep-learning-toy/releases/download/v0.01/cifar.tgz
!fs -mkdir -p /tmp/cifar;
!saveUploadFileToHome /cifar.tar /tmp/cifar;

the console will show the real-time log which indicates that the system is extracting images.

image.png

This may take for a while because there are almost 60000 pictures.

Setup some paths.

We create a env.mlsql which contains variables path related:

代码语言:javascript
复制
set basePath="/tmp/cifar"; 
set labelMappingPath = "${basePath}/si";
set trainDataPath = "${basePath}/cifar_train_data";
set testDataPath = "${basePath}/cifar_test_data";
set modelPath = "${basePath}/bigdl";

And the other script will include this script to get all these paths.

Resize the pictures

We hope we can resize the images to 28*28, you can achieve it with ET ImageLoaderExt. Here are how we use it:

代码语言:javascript
复制
include store1.`alg.image_classify.env.mlsql`;

-- {} or {number} is used as parameter holder.
set imageResize='''
run command as ImageLoaderExt.`/tmp/cifar/cifar/{}` where 
and code="
    def apply(params:Map[String,String]) = {
         Resize(28, 28) ->
          MatToTensor() -> ImageFrameToSample()
      }
"
as {}
''';

-- train should be quoted because it's a keyword.
!imageResize "train" data;
!imageResize test testData;

In the above code, because we need to resize train and test dataset, in order to avoid duplicate code, we wrap the resize code as a command, then use this command to process train and test dataset separately.

Extract label

For example, When we see the following path we know that this picture contains frog. So we should extract frog from the path.

代码语言:javascript
复制
/tmp/cifar/cifar/train/38189_frog.png

Again, we wrap the SQL as a command and process the train and test data separately.

代码语言:javascript
复制
set extractLabel='''
-- convert image path to number label
select split(split(imageName,"_")[1],"\\.")[0] as labelStr,features from {} as {}
''';

!extractLabel data newdata;
!extractLabel testData newTestData;

We will convert the label to number and then plus 1(cause the bigdl needs the label starts from 1 instead of 0).

代码语言:javascript
复制
set numericLabel='''
train {0} as StringIndex.`/tmp/cifar/si` where inputCol="labelStr" and outputCol="labelIndex" as newdata1;
predict {0} as StringIndex.`/tmp/cifar/si` as newdata2;
select (cast(labelIndex as int) + 1) as label,features from newdata2 as {1}
''';

!numericLabel newdata trainData;
!numericLabel newTestData testData;

Save what we get until now

We will save all these data so we can use the processed data in future without executing repeatedly:

代码语言:javascript
复制
save overwrite trainData as parquet.`${trainDataPath}`;
save overwrite testData as parquet.`${testDataPath}`;

Train the images with DL

We create a new script file named classify_train.mlsql, and we should load the data first and convert the label to an array:

代码语言:javascript
复制
include store1.`alg.image_classify.env.mlsql`;

load parquet.`${trainDataPath}` as tmpTrainData;
load parquet.`${testDataPath}` as tmpTestData;

select array(cast(label as float)) as label,features from tmpTrainData as trainData;
select array(cast(label as float)) as label,features from tmpTestData as testData;

finally, we use our algorithm to train them:

代码语言:javascript
复制
train trainData as BigDLClassifyExt.`${modelPath}` where
disableSparkLog = "true"
and fitParam.0.featureSize="[3,28,28]"
and fitParam.0.classNum="10"
and fitParam.0.maxEpoch="300"

-- print evaluate message
and fitParam.0.evaluate.trigger.everyEpoch="true"
and fitParam.0.evaluate.batchSize="1000"
and fitParam.0.evaluate.table="testData"
and fitParam.0.evaluate.methods="Loss,Top1Accuracy"
-- for unbalanced class 
-- and fitParam.0.criterion.classWeight="[......]"
and fitParam.0.code='''
                   def apply(params:Map[String,String])={
                        val model = Sequential()
                        model.add(Reshape(Array(3, 28, 28), inputShape = Shape(28, 28, 3)))
                        model.add(Convolution2D(6, 5, 5, activation = "tanh").setName("conv1_5x5"))
                        model.add(MaxPooling2D())
                        model.add(Convolution2D(12, 5, 5, activation = "tanh").setName("conv2_5x5"))
                        model.add(MaxPooling2D())
                        model.add(Flatten())
                        model.add(Dense(100, activation = "tanh").setName("fc1"))
                        model.add(Dense(params("classNum").toInt, activation = "softmax").setName("fc2"))
                    }
'''
;

Int the code block, we use Keras-style code to build our model, and we tell our system some information e.g. how many classes and what's the feature size.

If this training stage takes too long, you can decrease fitParam.0.maxEpoch to a small value.

The console will print the message when training:

image.png

and finally the validate result:

image.png

Use model command to check the model train history:

代码语言:javascript
复制
!model history /tmp/cifar/bigdl;

Here are the result:

image.png

Register the model as a function

Since we have built our model, now let us learn how to predict the image. First, we load some data:

代码语言:javascript
复制
include store1.`alg.image_classify.env.mlsql`;

load parquet.`${trainDataPath}` as tmpTrainData;
load parquet.`${testDataPath}` as tmpTestData;

select array(cast(label as float)) as label,features from tmpTrainData as trainData;
select array(cast(label as float)) as label,features from tmpTestData as testData;

now, we can register the model as a function:

代码语言:javascript
复制
register BigDLClassifyExt.`${modelPath}` as cifarPredict;

finally, we can use the function to predict a new picture:

代码语言:javascript
复制
select
vec_argmax(cifarPredict(vec_dense(to_array_double(features)))) as predict_label,
label from testData limit 10 
as output;

Of course, you can predict a table:

代码语言:javascript
复制
predict testData as BigDLClassifyExt.`${modelPath}` as predictdata;

Why BigDL

GPU is very expensive and normally, our company already have lots of CPUs, if we can make full use of these CPUs which will save a lot of money.

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019.05.16 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Requirements
  • Project Structure
  • Upload Image
  • Setup some paths.
  • Resize the pictures
  • Extract label
  • Save what we get until now
  • Train the images with DL
  • Register the model as a function
  • Why BigDL
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档