Create a natural language classifier that identifies spam

With the advent of cognitive computing and smart machines, machine learning and its related algorithms and techniques are incredibly important. We can use machine learning to help us understand and extract useful insights from an abundance of ever-evolving data. Machine learning can be used to recognize and identify complex patterns, generate predictions, learn over time, and ultimately help us to make better, more informed decisions.

IBM now provides a number of cognitive services on Bluemix. In this article, I introduce the Watson Natural Language Classifier service. Watson Natural Language Classifier is a machine-learning classifier that combines complex convolutional neural networks with a sophisticated language model to learn and understand language. Despite its internal complexity, Watson Natural Language Classifier is very easy to use.

In this article, we will create a spam classifier app by creating a new instance of Watson Natural Language Classifier, training it to distinguish between spam and non-spam, and testing its accuracy.

What you'll need

0

To build your own spam classifier that uses Watson Natural Language Classifier and Bluemix, you'll need the following accounts or resources:

  • A Bluemix account
  • An IBM DevOps Services account.
  • curl— a command-line tool for transferring data with URL syntax.
  • A Python interpreter.

Run the app

Get the code

Got a Bluemix question?

Ask on Stack Overflow

Ask on dW Answers

What's the difference between asking on Stack Overflow and asking on dW?

Training the Watson Natural Language Classifier service

0

To use the Watson Natural Language Classifier service to sort spam, we need to train it. To train it, Watson Natural Language Classifier uses a known set of labeled observations to train a model that is capable of generating reasonable predictions. This type of algorithm is known as asupervised learning algorithm.

A labeled observation is nothing more than a feature vector and a label for that vector. In the case of Watson Natural Language Classifier, each observation consists of some text (instead of a feature vector) and a class label, as shown in Listing 1:

Listing 1. Labeled observation in Watson Natural Language Classifier
123Label       Textspam        Join xyz.com NOW!!!! And WIN $1,000,000!!!!!!ham          Hi Mom. I hope dinner went well with Auntie Jane. Love you.

Now, let's get started.

Step 1. Create a Watson Natural Language Classifier service

0

Our first step is to create an instance of Watson Natural Language Classifier, which will become our Watson Natural Language Classifier service.

  1. Log in to your Bluemix account (or sign up for a free trial).
  2. Navigate to the Bluemix Catalog, refine your search by Watson, and select Watson Natural Language Classifier:
  1. On the right, under Add Service, make sure that Leave unbound is specified for the App, specify a name for your service (for example, spam_classifier), and click CREATE.

Click to see larger image

  1. Navigate back to the Bluemix dashboard, scroll down to your services, and click the new service instance:
  1. Select Service Credentials to see the Watson Natural Language Classifier service URL and credentials:

Click to see larger image Make a note of these credentials; you'll use them later in this tutorial.

Step 2. Set up your development environment

0

Next, you'll need to gather some files from the WatsonNLCSpam project repository:

  1. Clone the WatsonNLCSpam Git repository.
    1. In your terminal, enter the following command: git clone https://hub.jazz.net/git/dimascio/WatsonNLCSpam
    2. When prompted, enter your IBM ID and password.
  2. Review the repository contents:
    • README.md describes the project
    • data contains the training data set (SpamHam-Train.csv) and test (SpamHam-Test.json) data set for the spam classifier.
    • spam.py is the script that is used to perform a basic accuracy test.
    • web contains the source code for the sample web application.

About the training data

0

The training data for the spam classifier is contained in the file SpamHam-Train.csv. It contains 90 percent of the original data set. The other 10 percent is set aside for the test set. The contents of SpamHam-Train.csv are formatted as CSVs and stored in a structure that is compatible with Watson Natural Language Classifier. Each line contains text,label.

Review the following sample data that is taken from SpamHam-Train.csv:

12345"=Bring home some Wendy =D",ham"100 dating service cal;l 09064012103 box334sk38ch",spam"Whatsup there. Dont u want to sleep",ham"""Are you comingdown later?""",ham"Alright i have a new goal now",ham

This classifier is fairly simple. Two characteristics make it simple:

  1. It is a binary classifier, in that it has two classes: spam and ham.
  2. Each observation is associated with a single class: "Alright i have a new goal now",ham

Watson Natural Language Classifier does support multi-class classification and observations, but I am not taking advantage of that capability in this tutorial.

About the training metadata

0

Training metadata is data that describes the training data. The training metadata indicates the target language (in this case, English en) and the classifier's name (in this case, Spam Ham). The training metadata must be formatted as JSON. Here's an example:

123{  "language":"en",  "name":"Spam Ham"

About the test data

0

After a classifier is trained, it is important to test its accuracy. SpamHam-Test.json is the file that contains the test data. Like the training data, the test data is a set of labeled observations, for instance, "I love you, mom!",ham. Test observations are not included in the training data set and thus can be used to evaluate the accuracy of our classifier.

It is considered bad practice to reuse training observations to test the accuracy of a classifier. Because training observations are "seen" during training, reusing them during testing will likely lead to an overly optimistic accuracy result.

Watson Natural Language Classifier is a perfect example. If all test observations are seen during training, then the accuracy of Watson Natural Language Classifier equals 100 percent. We are ultimately interested in the accuracy of the classifier given unseen observations, however. This testing will give us a better sense of how the classifier will generalize to new observations.

The provided test data, SpamHam-Test.json, is structured differently from the provided training data, SpamHam-Train.csv. Each line in the test data is a JSON object that represents a single labeled observation. The test data can just as easily be formatted as CSV. Later in this tutorial, I'll demonstrate how to use this test data in JSON format with spam.py to calculate the accuracy of the spam classifier.

Step 3. Create and train the spam classifier

0

In the first step, we cloned an instance of Watson Natural Language Classifier to create our Watson Natural Language Classifier service. Now, we'll use that service to create a spam classifier.

Creating the classifier is easy. All we need to do is POST to the /v1/classifiers endpoint by using the following curl command:

curl -X POST -u username:password -F training_data=@SpamHam-Train.csv -F training_metadata="{\"language\":\"en\",\"name\":\"My Classifier\"}" "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers"

This curl command requires a <username>, <password>, and <url>. These variables need to be replaced with the appropriate service credentials values that are described in Step 1.

After you invoke this command, make note of its classifier_id. We'll use it shortly. Training the classifier can take up to 30 minutes, so now is a good time for a break.

Step 4. Check the training progress

0

To find out whether the classifier is ready to use, we can invoke the Watson Natural Language Classifier endpoint by using the following GET request:

curl -u <username>:<password> <url>/v1/classifiers/<classifier-id>

When the classifier indicates that it's ready to use, move on to the next step.

Step 5. Try out the spam classifier

0

Now that we've trained the spam classifier service, we can take it for a trial run. As a quick exercise, make the following POST request to the /classify endpoint:

curl -X POST -u <username>:<password> -H "Content-Type:application/json" -d "{\"text\":\"I love you mom\"}" <uri>/v1/classifiers/<classifier_id>/classify

Alternatively, you can make a GET request:

curl -G -u <user>:<password> <uri>/v1/classifiers/6C76AF-nlc-43/classify" --data-urlencode "text=what is your phone number?"

The classifier_id ID is returned by the /v1/classifiers call that we made earlier. If you forgot that ID, you can retrieve it by invoking the following endpoint with this curl command option. This endpoint will return a list that contains all of your classifiers:

curl -u <username>:<password> <uri>/v1/classifiers"

Step 6. Test the classifier's accuracy

0

Finally, to test our classifier and calculate its accuracy, we will use the provided Python script,spam.py. The script invokes the same POST request that is described in the previous step, then counts the number of classified predictions that correctly match the label. Accuracy is calculated by taking the number of correct predictions and dividing by the total number of test observations.

Let's run the script.

  • Open spam.py and update YOUR_CLASSIFIER_ID, YOUR_CLASSIFIER_USERNAME, andYOUR_CLASSIFIER_PASSWORD to refer to your Watson Natural Language Classifier service credentials, found in Step 1.
  • In the project directory, run the following command: python spam.py

When the script completes, you should see the following output:

accuracy: 0.993079584775

Conclusion

0

Watson Natural Language Classifier brings a sophisticated machine-learning classifier to Bluemix. Its approachable and intuitive REST interface makes it very easy for developers of all backgrounds to quickly train, test, and apply new classifiers to real-world problems.

In this article you've learned how to use Watson Natural Language Classifier to build, train, and test a spam classifier. We can't wait to see what you do with it!

原文发布于微信公众号 - 智能计算时代(intelligentinterconn)

原文发表时间:2015-10-30

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏HansBug's Lab

1455: 罗马游戏

1455: 罗马游戏 Time Limit: 5 Sec  Memory Limit: 64 MB Submit: 721  Solved: 272 [Subm...

315100
来自专栏云计算与大数据

研发:What is a DDoS Attack?

A distributed denial-of-service (DDoS) attack is a malicious attempt to disrupt ...

13720
来自专栏互联网杂技

Angularjs中UI Router超级详细的教程{{下}}

接着上一 state间如何传字符串参数 在路由中这样设置: .state('content.photos.detail.comment',{ url:'/co...

50550
来自专栏WindCoder

Best Programming Editors? A Never Ending Battle With No Clear Winner

原文:Best Programming Editors? A Never Ending Battle With No Clear Winner

7710
来自专栏游戏杂谈

国际化语种名称的标识

国内因为版号的问题,导致很多游戏厂商选择出海。在国际化的市场要想取得好的成绩,就必须要做好深度的本地化,其中最基础的一块就是语言。

27720
来自专栏Rindew的iOS技术分享

iOS实现三列表格点选(附Demo)

21530
来自专栏码匠的流水账

spring webflux文件上传下载

使用webflux就没有之前基于servlet容器的HttpServletRequest及HttpServletReponse了,取而代之的是org.sprin...

31210
来自专栏Netkiller

Hyperledger Fabric 积分代币上链方案

中国广东省深圳市龙华新区民治街道溪山美地 518131 +86 13113668890 <netkiller@msn.com>

4.6K130
来自专栏张善友的专栏

SharpForge - Open source SourceForge / CodePlex implementation

SharpForge - Open source SourceForge / CodePlex implementation SharpForge suppo...

215100
来自专栏生信技能树

linux 命令中英文对照,收集

听说markdown排版得用浏览器打开,点击最下面的阅读原文也可以! Is Linux CLI case-sensitive? The answer is, y...

40760

扫码关注云+社区

领取腾讯云代金券