前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Your Guide to NLP with MLSQL Stack (一)

Your Guide to NLP with MLSQL Stack (一)

作者头像
用户2936994
发布2019-05-15 15:21:14
5440
发布2019-05-15 15:21:14
举报
文章被收录于专栏:祝威廉祝威廉

End2End NLP with MLSQL Stack

MLSQL stack supports a complete pipeline of train/predict. This means the following steps can be in the same script:

  1. collect data
  2. preprocess data
  3. train
  4. predict

Also, since any model and preprocess ET can be registered as function, you can reuse all these functions in Predict Service without coding any more.

Requirements

This guide requires MLSQL Stack 1.3.0-SNAPSHOT. You can setup MLSQL stack with following links. We recommend you deploy MLSQL stack in local.

  1. Docker
  2. Mannually Compile
  3. Prebuild Distribution

If you meet any problem when deploying, please let me know, please feel free to address any issue in this link.

Data Prepare

This article we will deal with Chinese.

Download sogou news from this site: news_sohusite.

image.png

Upload file to MLSQL Stack File Server

Upload news_sohusite_xml.full.tar to MLSQL Stack file server, just drag the file to the upload area:

image.png

Once done, the web will indicate the success with showoing one file have been uploaded, it looks like this:

image.png

Download the file and save to your home

In order to read this file, we should save this file to our home. Use a command like the following:

代码语言:javascript
复制
-----------------------------------------
-- Download from file server.
-- run command as DownloadExt.`` where 
-- from="public/SogouCS.reduced.tar" and
-- to="/tmp/nlp/sogo";
-- or you can use command line.
-----------------------------------------

!saveUploadFileToHome public/SogouCS.reduced.tar /tmp/nlp/sogo;

Check if the file has been created:

代码语言:javascript
复制
!fs -ls /tmp/nlp/sogo;

Well, it has been created sucessfully.

代码语言:javascript
复制
Found 1 items
-rw-r--r--   1 allwefantasy admin 1537763850 2019-05-09 16:59 /tmp/nlp/sogo/news_sohusite_xml.dat

Load the xml data

MLSQL stack supports lots of datasource which inlcude XML and the news_sohusite_xml.dat is XML format. We can use load statement to load the data:

代码语言:javascript
复制
-- load data with xml format
load xml.`/tmp/nlp/sogo/news_sohusite_xml.dat` where rowTag="doc" and charset="GBK" as xmlData; 

Notice that you can select any statement and then execute it and check the result is whether you expect:

image.png

Extract label from URL

The URL is lokk like this:

代码语言:javascript
复制
http://sports.sohu.com/20070422/n249599819.shtml

We need to extract the sports from it. It means this article belongs to sports category.

代码语言:javascript
复制
select temp.* from (select split(split(url,"/")[2],"\\.")[0] as labelStr,content from xmlData) as temp 
where temp.labelStr is not null 
as rawData;

The label we extract from URL is string, and the algorithm RandonForest requires the label an integer. Here we use StringIndex to implements get the mapping between string and number:

代码语言:javascript
复制
train rawData as StringIndex.`/tmp/nlp/label_mapping` where inputCol="labelStr"and
outputCol="label" ;

Now we can convert all string label to interger label:

代码语言:javascript
复制
predict rawData as StringIndex.`/tmp/nlp/label_mapping` as rawDataWithLabel;

Notice that we need to register this model as a function because we need to convert the number back to string in later predict stage. It's easy to do this:

代码语言:javascript
复制
register StringIndex.`/tmp/nlp/label_mapping` as convert_label;

Split the dataset

Sometimes We need to reduce the dataset because of limited resource we have. In another scenario, we may need to split the data into train/validate/test sets. They all can be implemented by ET RateSampler. In MLSQL, many ET have a more easy way to use, we call it command line style. Here are ET style and Command Line style.

ET Style:

代码语言:javascript
复制
run xmlData as RateSampler.`` 
where labelCol="url" and sampleRate="0.9,0.1" 
as xmlDataArray;

Command Line Style:

代码语言:javascript
复制
!split rawDataWithLabel by label with "0.9,0.1" named xmlDataArray;

Now, we have splitted dataset of each category into 0.9/0.1. In order to speed up the performance, we use 10% data only.

代码语言:javascript
复制
select * from xmlDataArray where __split__=1 as miniXmlData;

Save what we got until now (Optinal)

代码语言:javascript
复制
save overwrite miniXmlData as parquet.`/tmp/nlp/miniXmlData`;
load parquet.`/tmp/nlp/miniXmlData` as miniXmlData;

This will avoid computation every time when we want to get miniXmlData. In production, you may will use cache(memory and disk), you can use it like this:

代码语言:javascript
复制
!cache miniXmlData script;

You do not need to release it mannually, MLSQL Engine will take care it.

Use TF/IDF to process content

代码语言:javascript
复制
train miniXmlData as TfIdfInPlace.`/tmp/nlp/tfidf` where inputCol="content" as trainData;

Again register the model as a functioin:

代码语言:javascript
复制
register TfIdfInPlace.`/tmp/nlp/tfidf` as tfidf_predict;

Save what we got until now (Optinal)

代码语言:javascript
复制
save overwrite trainData as parquet.`/tmp/nlp/trainData`;
load parquet.`/tmp/nlp/trainData` as trainData;

Again, you can cache the trainData.

Cut the feature size

The feature size generated by ET tfidf is > 60w, this will slow down the performance, here we use vec_range to subrange the vector:

代码语言:javascript
复制
select vec_range(content,array(0,10000)) as content,label from trainData as trainData;

There are so many vector related functions in MLSQL, check here if you are interested in.

Train RandomForest

代码语言:javascript
复制
train trainData as RandomForest.`/tmp/nlp/rf` where 
keepVersion="true" 
and fitParam.0.featuresCol="content" 
and fitParam.0.labelCol="label"
and fitParam.0.maxDepth="4"
and fitParam.0.checkpointInterval="100"
and fitParam.0.numTrees="4"
;

you can use fitParam.group to configure multi-group params, like this:

代码语言:javascript
复制
train trainData as RandomForest.`/tmp/nlp/rf` where 
keepVersion="true" 
and fitParam.0.featuresCol="content" 
and fitParam.0.labelCol="label"
and fitParam.0.maxDepth="4"
and fitParam.0.checkpointInterval="100"
and fitParam.0.numTrees="4"
and fitParam.1.featuresCol="content" 
and fitParam.1.labelCol="label"
and fitParam.1.maxDepth="3"
and fitParam.1.checkpointInterval="100"
and fitParam.1.numTrees="10"
;

Then MLSQL Engine will generate two models.

image.png

Register the model as a function:

代码语言:javascript
复制
register RandomForest.`/tmp/nlp/rf` as rf_predict;

Predict

End to end predict, you can also deploy this as an API service. Do not forget to subrange the tfidf feature:

代码语言:javascript
复制
select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict("新闻不错"),array(0,10000))))) as predicted as output;

As you can see, we use all functions registered before which make us convert raw data to finally string category. And the code is clear:

  1. use tfidf_predict to generate vector
  2. use vec_range to subrange the vector
  3. use rf_predict to get the number category
  4. use convert_label_r convert number to string

image.png

Most the time, you may train several times, and if you wanna see the history, use command like this:

代码语言:javascript
复制
!model history /tmp/nlp/rf;

image.png

How to deploy API service

Just start MLSQL Engine with local mode, and then you can post http://127.0.0.1:9003/model/predict with follow params:

代码语言:javascript
复制
dataType=row
data=[{"content":"新闻不错"}]
sql=select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict(content),array(0,10000))))) as predicted

That's All.

Bonus

Thanks to the include statement and the script store support, if you have set up the MLSQL stack, you can use the script from the store immediately:

代码语言:javascript
复制
set inputDir="/tmp/nlp/sogo/news_sohusite_xml.dat";
set outputDir="/tmp/nlp2";

include store.`/alg/text_classify.mlsql`;

!textClassify "${inputDir}" "${outputDir}";
!textPredict "新闻很不错";

MLSQL Engine will download script from repo.store.mlsql.tech automatically. Any script you have written can be wrap as a command and used by others.

The Final Complete Script

代码语言:javascript
复制
-----------------------------------------
-- Download from file server.
-- run command as DownloadExt.`` where 
-- from="public/SogouCS.reduced.tar" and
-- to="/tmp/nlp/sogo";
-- or you can use command line.
-----------------------------------------

!saveUploadFileToHome public/SogouCS.reduced.tar /tmp/nlp/sogo;


-- load data with xml format
load xml.`/tmp/nlp/sogo/news_sohusite_xml.dat` where rowTag="doc" and charset="GBK" as xmlData; 


--extract `sports` from url[http://sports.sohu.com/20070422/n249599819.shtml]
select temp.* from (select split(split(url,"/")[2],"\\.")[0] as labelStr,content from xmlData) as temp 
where temp.labelStr is not null 
as rawData;


-- Tips:
----------------------------------------------------------------------------------
-- Try to use the follow sql to explore how many label we have and how they looks like.
--
-- select distinct(split(split(url,"/")[2],"\\.")[0]) as labelStr from rawData as output;
-- select split(split(url,"/")[2],"\\.")[0] as labelStr,url from rawData as output;
----------------------------------------------------------------------------------

-- the label we extract from url is string, and the algorithm RandonForest requires the label is 
-- integers. here we use StringIndex to implments this.
-- train a model which can map label to number and vice versa
train rawData as StringIndex.`/tmp/nlp/label_mapping` where inputCol="labelStr"and
outputCol="label" ;

-- convert label to number 
predict rawData as StringIndex.`/tmp/nlp/label_mapping` as rawDataWithLabel;


-- you can use register to convert a model to a functioin
register StringIndex.`/tmp/nlp/label_mapping` as convert_label; 



-- we can reduce the dataset. Because if there are too much data but just get  limited resource 
-- it may take too long. you can use command line 
-- or you can use raw ET:
--
-- run xmlData as RateSampler.`` 
-- where labelCol="url" and sampleRate="0.9,0.1" 
-- as xmlDataArray;
!split rawDataWithLabel by label with "0.9,0.1" named xmlDataArray;
-- then we fetch the xmlDataArray with position one to get the 10% data.
select * from xmlDataArray where __split__=1 as miniXmlData;

-- we can save the result data, because it really take much time.
save overwrite miniXmlData as parquet.`/tmp/nlp/miniXmlData`;

load parquet.`/tmp/nlp/miniXmlData` as miniXmlData;
-- select * from miniXmlData limit 10 as output;

--convert the content to tfidf format
train miniXmlData as TfIdfInPlace.`/tmp/nlp/tfidf` where inputCol="content" as trainData;
-- again register  a model as a functioin
register TfIdfInPlace.`/tmp/nlp/tfidf` as tfidf_predict;


save overwrite trainData as parquet.`/tmp/nlp/trainData`;
load parquet.`/tmp/nlp/trainData` as trainData;

-- the feature size  generated by tfidf is  > 60w,  this will slow down the performance,
-- here we use vec_range to subrange the vector.
select vec_range(content,array(0,10000)) as content,label from trainData as trainData;

-- use algorithm RandomForest to train 
-- you can use fitParam.group to congiure multi group params
train trainData as RandomForest.`/tmp/nlp/rf` where 
keepVersion="true" 
and fitParam.0.featuresCol="content" 
and fitParam.0.labelCol="label"
and fitParam.0.maxDepth="4"
and fitParam.0.checkpointInterval="100"
and fitParam.0.numTrees="4"
;

-- register  RF model as a functioin
register RandomForest.`/tmp/nlp/rf` as rf_predict;

-- end to end predict; you can also deploy this as a API service
-- do not forget to subrange the tfidf feature
select convert_label_r(vec_argmax(rf_predict(vec_range(tfidf_predict("新闻不错"),array(0,10000))))) as predicted as output;

-- !model history /tmp/nlp/rf;
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019.05.12 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • End2End NLP with MLSQL Stack
  • Requirements
  • Data Prepare
  • Upload file to MLSQL Stack File Server
  • Download the file and save to your home
  • Load the xml data
  • Extract label from URL
  • Split the dataset
  • Save what we got until now (Optinal)
  • Use TF/IDF to process content
  • Save what we got until now (Optinal)
  • Cut the feature size
  • Train RandomForest
  • Predict
  • How to deploy API service
  • Bonus
  • The Final Complete Script
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档