Using unsupervised learning to mergefeatures (PCA)
PCA的基本思想是减少问题的维度。 这对于消除维度诅咒或合并数据可能是有用的,这样您可以看到数据中的趋势,而没有相关数据的噪声。
在这个例子中,我们将使用PCA去合并2002-2012年间来自24个股票的股票价格为1.这个单一价值(随着时间的推移)表示基于这24个股票的数据的股票市场指数。将这24个股票价格合并为1大量减少了要处理的数据量,并减少了我们的数据的维度,这是一个很大的优势,如果我们后来应用其他机器学习算法,如回归预测。为了看到我们的特征从24减少到1,我们将比较同一时间段的我们的结果与道琼斯指数(DJI)。
下一步是加载数据。 为此,我们为您提供2个文件:数据文件1和数据文件2。
object PCA extends SimpleSwingApplication{
def top = newMainFrame {
title ="PCA Example"
//Get theexample data
val basePath ="/users/.../Example Data/"
valexampleDataPath = basePath + "PCA_Example_1.csv"
val trainData =getStockDataFromCSV(new File(exampleDataPath))
}
defgetStockDataFromCSV(file: File): (Array[Date],Array[Array[Double]]) = {
val source =scala.io.Source.fromFile(file)
//Get all therecords (minus the header)
val data =source
.getLines()
.drop(1)
.map(x=> getStockDataFromString(x))
.toArray
source.close()
//group allrecords by date, and sort the groups on date ascending
valgroupedByDate = data.groupBy(x => x._1).toArray.sortBy(x => x._1)
//extract thevalues from the 3-tuple and turn them into
// an array oftuples: Array[(Date, Array[Double)]
valdateArrayTuples = groupedByDate
.map(x=> (x._1, x
._2
.sortBy(x => x._2)
.map(y => y._3)
)
)
//turn thetuples into two separate arrays for easier use later on
val dateArray =dateArrayTuples.map(x => x._1).toArray
val doubleArray= dateArrayTuples.map(x => x._2).toArray
(dateArray,doubleArray)
}
defgetStockDataFromString(dataString: String): (Date,String,Double) = {
//Split thecomma separated value string into an array of strings
val dataArray:Array[String] = dataString.split(',')
val format =new SimpleDateFormat("yyyy-MM-dd")
//Extract thevalues from the strings
val date =format.parse(dataArray(0))
val stock:String = dataArray(1)
val close:Double = dataArray(2).toDouble
//And returnthe result in a format that can later
//easily beused to feed to Smile
(date,stock,close)
}
}
With this training data, and the fact that we already know that we want tomerge the 24 features into 1 single feature, we can do the PCA and retrieve thevalues for the datapoints as follows.
//Add to `def top`
val pca = new PCA(trainData._2)
pca.setProjection(1)
val points = pca.project(trainData._2)
val plotData = points
.zipWithIndex
.map(x =>Array(x._2.toDouble, -x._1(0) ))
val canvas: PlotCanvas = LinePlot.plot("MergedFeatures Index",
plotData,
Line.Style.DASH,
Color.RED);
peer.setContentPane(canvas)
size = new Dimension(400, 400)
该代码不仅是PCA,而且也绘制了结果,特征值在y轴上,单独的天数在x轴上。