Your Guide to NLP with MLSQL Stack (一)

用户2936994

发布于 2019-05-15 15:21:14

5660

发布于 2019-05-15 15:21:14

文章被收录于专栏：祝威廉

End2End NLP with MLSQL Stack

MLSQL stack supports a complete pipeline of train/predict. This means the following steps can be in the same script:

collect data
preprocess data
train
predict

Also, since any model and preprocess ET can be registered as function, you can reuse all these functions in Predict Service without coding any more.

Requirements

This guide requires MLSQL Stack 1.3.0-SNAPSHOT. You can setup MLSQL stack with following links. We recommend you deploy MLSQL stack in local.

If you meet any problem when deploying, please let me know, please feel free to address any issue in this link.

Data Prepare

This article we will deal with Chinese.

Download sogou news from this site: news_sohusite.

image.png

Upload file to MLSQL Stack File Server

Upload news_sohusite_xml.full.tar to MLSQL Stack file server, just drag the file to the upload area:

image.png

Once done, the web will indicate the success with showoing one file have been uploaded, it looks like this:

image.png

Download the file and save to your home

In order to read this file, we should save this file to our home. Use a command like the following:

-----------------------------------------
-- Download from file server.
-- run command as DownloadExt.`` where 
-- from="public/SogouCS.reduced.tar" and
-- to="/tmp/nlp/sogo";
-- or you can use command line.
-----------------------------------------

!saveUploadFileToHome public/SogouCS.reduced.tar /tmp/nlp/sogo;

Check if the file has been created:

!fs -ls /tmp/nlp/sogo;

Well, it has been created sucessfully.

Found 1 items
-rw-r--r--   1 allwefantasy admin 1537763850 2019-05-09 16:59 /tmp/nlp/sogo/news_sohusite_xml.dat

Load the xml data

MLSQL stack supports lots of datasource which inlcude XML and the news_sohusite_xml.dat is XML format. We can use load statement to load the data:

-- load data with xml format
load xml.`/tmp/nlp/sogo/news_sohusite_xml.dat` where rowTag="doc" and charset="GBK" as xmlData;

Notice that you can select any statement and then execute it and check the result is whether you expect:

image.png

Extract label from URL

The URL is lokk like this:

http://sports.sohu.com/20070422/n249599819.shtml

We need to extract the sports from it. It means this article belongs to sports category.

select temp.* from (select split(split(url,"/")[2],"\\.")[0] as labelStr,content from xmlData) as temp 
where temp.labelStr is not null 
as rawData;

The label we extract from URL is string, and the algorithm RandonForest requires the label an integer. Here we use StringIndex to implements get the mapping between string and number:

train rawData as StringIndex.`/tmp/nlp/label_mapping` where inputCol="labelStr"and
outputCol="label" ;

Now we can convert all string label to interger label:

predict rawData as StringIndex.`/tmp/nlp/label_mapping` as rawDataWithLabel;

Notice that we need to register this model as a function because we need to convert the number back to string in later predict stage. It's easy to do this:

register StringIndex.`/tmp/nlp/label_mapping` as convert_label;

Split the dataset

Sometimes We need to reduce the dataset because of limited resource we have. In another scenario, we may need to split the data into train/validate/test sets. They all can be implemented by ET RateSampler. In MLSQL, many ET have a more easy way to use, we call it command line style. Here are ET style and Command Line style.

ET Style:

run xmlData as RateSampler.`` 
where labelCol="url" and sampleRate="0.9,0.1" 
as xmlDataArray;

Command Line Style:

!split rawDataWithLabel by label with "0.9,0.1" named xmlDataArray;

Now, we have splitted dataset of each category into 0.9/0.1. In order to speed up the performance, we use 10% data only.

select * from xmlDataArray where __split__=1 as miniXmlData;

Save what we got until now (Optinal)

save overwrite miniXmlData as parquet.`/tmp/nlp/miniXmlData`;
load parquet.`/tmp/nlp/miniXmlData` as miniXmlData;

This will avoid computation every time when we want to get miniXmlData. In production, you may will use cache(memory and disk), you can use it like this:

!cache miniXmlData script;

You do not need to release it mannually, MLSQL Engine will take care it.

Use TF/IDF to process content

train miniXmlData as TfIdfInPlace.`/tmp/nlp/tfidf` where inputCol="content" as trainData;

Again register the model as a functioin:

register TfIdfInPlace.`/tmp/nlp/tfidf` as tfidf_predict;

Save what we got until now (Optinal)

save overwrite trainData as parquet.`/tmp/nlp/trainData`;
load parquet.`/tmp/nlp/trainData` as trainData;

Again, you can cache the trainData.

Cut the feature size

The feature size generated by ET tfidf is > 60w, this will slow down the performance, here we use vec_range to subrange the vector:

select vec_range(content,array(0,10000)) as content,label from trainData as trainData;

There are so many vector related functions in MLSQL, check here if you are interested in.

Train RandomForest

train trainData as RandomForest.`/tmp/nlp/rf` where 
keepVersion="true" 
and fitParam.0.featuresCol="content" 
and fitParam.0.labelCol="label"
and fitParam.0.maxDepth="4"
and fitParam.0.checkpointInterval="100"
and fitParam.0.numTrees="4"
;

you can use fitParam.group to configure multi-group params, like this:

train trainData as RandomForest.`/tmp/nlp/rf` where 
keepVersion="true" 
and fitParam.0.featuresCol="content" 
and fitParam.0.labelCol="label"
and fitParam.0.maxDepth="4"
and fitParam.0.checkpointInterval="100"
and fitParam.0.numTrees="4"
and fitParam.1.featuresCol="content" 
and fitParam.1.labelCol="label"
and fitParam.1.maxDepth="3"
and fitParam.1.checkpointInterval="100"
and fitParam.1.numTrees="10"
;

Then MLSQL Engine will generate two models.

image.png

register RandomForest.`/tmp/nlp/rf` as rf_predict;

Predict

End to end predict, you can also deploy this as an API service. Do not forget to subrange the tfidf feature:

select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict("新闻不错"),array(0,10000))))) as predicted as output;

As you can see, we use all functions registered before which make us convert raw data to finally string category. And the code is clear:

use tfidf_predict to generate vector
use vec_range to subrange the vector
use rf_predict to get the number category
use convert_label_r convert number to string

image.png

Most the time, you may train several times, and if you wanna see the history, use command like this:

!model history /tmp/nlp/rf;

image.png

How to deploy API service

Just start MLSQL Engine with local mode, and then you can post http://127.0.0.1:9003/model/predict with follow params:

dataType=row
data=[{"content":"新闻不错"}]
sql=select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict(content),array(0,10000))))) as predicted

That's All.

Bonus

Thanks to the include statement and the script store support, if you have set up the MLSQL stack, you can use the script from the store immediately:

set inputDir="/tmp/nlp/sogo/news_sohusite_xml.dat";
set outputDir="/tmp/nlp2";

include store.`/alg/text_classify.mlsql`;

!textClassify "${inputDir}" "${outputDir}";
!textPredict "新闻很不错";

MLSQL Engine will download script from repo.store.mlsql.tech automatically. Any script you have written can be wrap as a command and used by others.

The Final Complete Script

-----------------------------------------
-- Download from file server.
-- run command as DownloadExt.`` where 
-- from="public/SogouCS.reduced.tar" and
-- to="/tmp/nlp/sogo";
-- or you can use command line.
-----------------------------------------

!saveUploadFileToHome public/SogouCS.reduced.tar /tmp/nlp/sogo;


-- load data with xml format
load xml.`/tmp/nlp/sogo/news_sohusite_xml.dat` where rowTag="doc" and charset="GBK" as xmlData; 


--extract `sports` from url[http://sports.sohu.com/20070422/n249599819.shtml]
select temp.* from (select split(split(url,"/")[2],"\\.")[0] as labelStr,content from xmlData) as temp 
where temp.labelStr is not null 
as rawData;


-- Tips:
----------------------------------------------------------------------------------
-- Try to use the follow sql to explore how many label we have and how they looks like.
--
-- select distinct(split(split(url,"/")[2],"\\.")[0]) as labelStr from rawData as output;
-- select split(split(url,"/")[2],"\\.")[0] as labelStr,url from rawData as output;
----------------------------------------------------------------------------------

-- the label we extract from url is string, and the algorithm RandonForest requires the label is 
-- integers. here we use StringIndex to implments this.
-- train a model which can map label to number and vice versa
train rawData as StringIndex.`/tmp/nlp/label_mapping` where inputCol="labelStr"and
outputCol="label" ;

-- convert label to number 
predict rawData as StringIndex.`/tmp/nlp/label_mapping` as rawDataWithLabel;


-- you can use register to convert a model to a functioin
register StringIndex.`/tmp/nlp/label_mapping` as convert_label; 



-- we can reduce the dataset. Because if there are too much data but just get  limited resource 
-- it may take too long. you can use command line 
-- or you can use raw ET:
--
-- run xmlData as RateSampler.`` 
-- where labelCol="url" and sampleRate="0.9,0.1" 
-- as xmlDataArray;
!split rawDataWithLabel by label with "0.9,0.1" named xmlDataArray;
-- then we fetch the xmlDataArray with position one to get the 10% data.
select * from xmlDataArray where __split__=1 as miniXmlData;

-- we can save the result data, because it really take much time.
save overwrite miniXmlData as parquet.`/tmp/nlp/miniXmlData`;

load parquet.`/tmp/nlp/miniXmlData` as miniXmlData;
-- select * from miniXmlData limit 10 as output;

--convert the content to tfidf format
train miniXmlData as TfIdfInPlace.`/tmp/nlp/tfidf` where inputCol="content" as trainData;
-- again register  a model as a functioin
register TfIdfInPlace.`/tmp/nlp/tfidf` as tfidf_predict;


save overwrite trainData as parquet.`/tmp/nlp/trainData`;
load parquet.`/tmp/nlp/trainData` as trainData;

-- the feature size  generated by tfidf is  > 60w,  this will slow down the performance,
-- here we use vec_range to subrange the vector.
select vec_range(content,array(0,10000)) as content,label from trainData as trainData;

-- use algorithm RandomForest to train 
-- you can use fitParam.group to congiure multi group params
train trainData as RandomForest.`/tmp/nlp/rf` where 
keepVersion="true" 
and fitParam.0.featuresCol="content" 
and fitParam.0.labelCol="label"
and fitParam.0.maxDepth="4"
and fitParam.0.checkpointInterval="100"
and fitParam.0.numTrees="4"
;

-- register  RF model as a functioin
register RandomForest.`/tmp/nlp/rf` as rf_predict;

-- end to end predict; you can also deploy this as a API service
-- do not forget to subrange the tfidf feature
select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict("新闻不错"),array(0,10000))))) as predicted as output;

-- !model history /tmp/nlp/rf;

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2019.05.12 ，如有侵权请联系 cloudcommunity@tencent.com 删除

sql