一. 数据集

```// sed 1d train.tsv > train_noheader.tsv
val records = rawData.map(line => line.split("\t"))
records.first
// Array[String] = Array("http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html", "4042", ...```

二. 线性模型

1. 提取特征

```import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

val data = records.map { r => val trimmed = r.map(_.replaceAll("\"", "")) val label = trimmed(r.size - 1).toInt val features = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble) LabeledPoint(label, Vectors.dense(features)) } data.cache

val numData = data.count

// numData: Long = 7395

// note that some of our data contains negative feature vaues. For naive Bayes we convert these to zeros```

2. 逻辑回归

`import org.apache.spark.mllib.classification.LogisticRegressionWithSGD`

`val numIterations = 10`

`val lrModel = LogisticRegressionWithSGD.train(data, numIterations)`

```// make prediction on a single data point
val dataPoint = data.first
// dataPoint: org.apache.spark.mllib.regression.LabeledPoint = LabeledPoint(0.0, [0.789131,2.055555556,0.676470588, ...
val prediction = lrModel.predict(dataPoint.features)
// prediction: Double = 1.0
val trueLabel = dataPoint.label
// trueLabel: Double = 0.0
val predictions = lrModel.predict(data.map(lp => lp.features))
predictions.take(5)
// res1: Array[Double] = Array(1.0, 1.0, 1.0, 1.0, 1.0)```

3. 线性支持向量机

`import org.apache.spark.mllib.classification.SVMWithSGD`

val svmModel = SVMWithSGD.train(data, numIterations)

三. 朴素贝叶斯模型

```val nbData = records.map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble).map(d => if (d < 0) 0.0 else d)
LabeledPoint(label, Vectors.dense(features))
}```

`import org.apache.spark.mllib.classification.NaiveBayes`
```// note we use nbData here for the NaiveBayes model training
val nbModel = NaiveBayes.train(nbData) ```

四. 决策树

```import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.Entropy```

`val maxTreeDepth = 5`
`val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth)`

136 篇文章37 人订阅

0 条评论