# Using LDA for classification使用LDA线性判别分析来分类

Linear Discriminant Analysis (LDA) attempts to fit a linear combination of features to predict the outcome variable. LDA is often used as a preprocessing step. We'll walk through both methods in this recipe.

In this recipe, we will do the following:在这部分，我们将做如下操作。

1. Grab stock data from Yahoo.

2. Rearrange it in a shape we're comfortable with.

3. Create an LDA object to fit and predict the class labels.

4. Give an example of how to use LDA for dimensionality reduction.

1、从Yahoo抓取股票数据

2、重新排列数据为适合我们的形状

3、生成一个LDA对象来拟合和预测分类标签

4、给出一个例子来讲述如何使用LDA降维

How to do it…如何做

In this example, we will perform an analysis similar to Altman's Z-score. In this paper, Altman looked at a company's likelihood of defaulting within two years based on several financial metrics. The following is taken from the Wiki page of Altman's Z-score:

T 1 = Working Capital / Total Assets. Measures liquid assets in relation to the size of the company.

T 2 = Retained Earnings / Total Assets. Measures profitability that reflects the company's age and earning power.

T 3 = Earnings Before Interest and Taxes / Total Assets. Measures operating efficiency apart from tax and leveraging factors. It recognizes operating earnings as being important to long-term viability.

T 4 = Market Value of Equity / Book Value of Total Liabilities. Adds market dimension that can show up security price fluctuation as a possible red flag.

T 5 = Sales/ Total Assets. Standard measure for total asset turnover (varies greatly from industry to industry).

T1=流动资金/总资产，测量流动资产与公司大小的关系

T2=留存收益/总资产，测量收益性影响公司年龄和盈利能力

T3=息税前利润/总资产，测量经营效率除去税和杠杆因子。经营效率显示长期经营能力的重要性。

T4=股本市场价值/负债总额的账面价值，增加市场维度，这展示安全价格波动时的一个可能的红线

T5=销售额/总资产，营业额占总资产的标准（各行各业不同）

From Wikipedia:出自维基百科

[1]: Altman, Edward I. (September 1968). ""Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy"". Journal of Finance: 189–209.

In this analysis, we'll look at some financial data from Yahoo via pandas. We'll try to predict if a stock will be higher in exactly 6 months from today, based on the current attribute of the stock. It's obviously nowhere near as refined as Altman's Z-score. Let's use a basket of auto stocks:

```tickers = ["F", "TM", "GM", "TSLA"]
from pandas_datareader import data as external_data

This data structure is panel from pandas. It's similar to an OLAP cube or a 3D DataFrame .Let's take a look at the data to get some familiarity with closes since that's what we care about while comparing:

```stock_df = stock_panel.Close.dropna()
stock_df.plot(figsize=(7, 5))```

The following is the output:如下图所示

Ok, so now we need to compare each stock price with its price in 6 months. If it's higher,we'll code it with 1, and if not, we'll code that with 0.To do this, we'll just shift the dataframe back 180 days and compare:

#this dataframe indicates if the stock was higher in 180 days这个数据框表示股票在过去的180天里是否增长了

`classes = (stock_df.shift(-180) > stock_df).astype(int)`

The next thing we need to do is flatten out the dataset:下一步，我们需要展平数据

```classes = classes.unstack()
classes = classes.swaplevel(0, 1).sort_index()
classes = classes.to_frame()
classes.index.names = ['Date', 'minor']

X = stock_panel.unstack().swaplevel(2, 0, 1).to_frame().unstack()
data = pd.concat([X, classes], axis=1)
data.rename(columns={0: 'is_higher'}, inplace=True)

The following is the output:如小图所示

Ok, so now we need to create matrices to SciPy. To do this, we'll use the patsy library. This is a great library that can be used to create a design matrix in a fashion similar to R:

```import patsy
X = patsy.dmatrix("Open + High + Low + Close + Volume + is_higher - 1", data.reset_index(),return_type='dataframe')

The following is the output:输出如下

patsy is a very strong package, for example, suppose we want to apply some of the preprocessing from Chapter 1, Premodel Workflow. In patsy , it's possible, like in R,to modify the formula in a way that corresponds to modifications in the design matrix.

patsy是非常强大的包，例如，假如我们需要应用一些第一章提到的数据预处理过程，Premodel Workflow，在patsy，可能像R一样，来调整方程与在设计矩阵中调整相关性的方法相同。

It won't be done here, but if we want to scale the value to mean 0 and standard deviation 1, the function will be "scale(open) + scale(high)" .

Awesome! So, now that we have our dataset, let's fit the LDA object:令人惊讶的，所以，现在我们有了数据集，让我们拟合LDA对象

```import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA()
lda.fit(X.ix[:, :-1], X.ix[:, -1]);```

We can see that it's not too bad when predicting against the dataset. Certainly, we will want to improve this with other parameters and test the model:

```from sklearn.metrics import classification_report
print(classification_report(X.iloc[:,-1].values, lda.predict(X.iloc[:,:-1])))
precision    recall  f1-score   support

0.0       0.61      1.00      0.76      3083
1.0       0.00      0.00      0.00      1953

accuracy                           0.61      5036
macro avg       0.31      0.50      0.38      5036
weighted avg       0.37      0.61      0.46      5036```

These metrics describe how the model fits the data in various ways.这个句子描述模型在不同情况下拟合效果

The precision and recall parameters are fairly similar. In some ways, as shown in the following list, they can be thought of as conditional proportions:

1、 For precision , given the model predicts a positive value, what proportion of this is correct?

2、 For recall , given the state of one class is true, what proportion did we "select"? I say,select because recall is a common metric in search problems. For example, there can be a set of underlying web pages that, in fact, relate to a search term—the proportion that is returned.

The f1-score parameter attempts to summarize the relationship between recall and precision .

1、对于准确率，给模型的预测值一个积极的值，正确的比例

2、对于召回率，给出一个分类是真的状态，我们如何选择的部分，我说，选择是因为召回率在此类问题中是一个相同的矩阵，例如，这有一个基本的网页，事实上，他依靠查询期限，返回正确的比例。

How it works…怎么运行的

LDA is actually fairly similar to clustering that we did previously. We fit a basic model from the data. Then, once we have the model, we try to predict and compare the likelihoods of the data given in each class. We choose the option that's more likely.

LDA是确实像之前讲的聚类，我们用数据拟合一个基本模型，然后，当我们有了这个模型，我们尝试预测和比较每个给定类别数据的相似性，我们选择最相近的那个。

LDA is actually a simplification of QDA, which we'll talk about in the next chapter. Here, we assume that the covariance of each class is the same, but in QDA, the assumption is relaxed. Think about the connections between KNN and GMM and the relationship there and here.

LDA其实是个简单的QDA（接下来的章节要讲的），这里，我们估计每一个类有相同的协方差，但是在QDA，协方差是自由的，考虑KNN和GMM之间的联系以及关系。

• ### Label propagation with semi-supervised learning半监督学习之标签传播算法

Label propagation is a semi-supervised technique that makes use of the labeled a...

• ### Tuning a Decision Tree model调试决策树模型

If we use just the basic implementation of a Decision Tree, it will probably not...

• ### Using many Decision Trees – random forests使用多棵决策树--随机森林

In this recipe, we'll use random forests for classification tasks. random forest...

• ### 卷积神经网络反向传播推导

Disclaimer: It is assumed that the reader is familiar with terms such as Multila...

• ### 研发：What is a DDoS Attack?

A distributed denial-of-service (DDoS) attack is a malicious attempt to disrupt ...

• ### 追踪接触者以控制COVID-19大流行（CS SI）

控制 COVID-19大流行需要大量减少接触，主要是通过实施行动控制达到强制隔离的水平。 这导致了经济的大部分崩溃。这种疾病的携带者大约在接触病毒后3天具有传染...

• ### Instant Messaging at LinkedIn: Scaling to 10000 of Connections

We recently introduced Instant Messaging on LinkedIn, complete with typing indic...

• ### CodeForces 731A Night at the Museum

A. Night at the Museum time limit per test 1 second memory limit per test ...

• ### 十大革命性理论(Top 10 revolutionary scientifictheories)中英版(19k字)

本篇《十大革命性理论》(Top 10 revolutionary scientific theories |Science News)中英文对照版AB，把原文倒...

• ### What do we mean by “understanding” something?

In this chapter, we shall examine the most fundamental ideas that we have about ...