专栏首页翻译scikit-learn CookbookUsing LDA for classification使用LDA线性判别分析来分类

Using LDA for classification使用LDA线性判别分析来分类

Linear Discriminant Analysis (LDA) attempts to fit a linear combination of features to predict the outcome variable. LDA is often used as a preprocessing step. We'll walk through both methods in this recipe.

线性判别分析LDA企图拟合多条联合特征为一条线来预测输出变量。LDA经常被用于预处理步骤,我们将在这部分学习两种方法:

Getting ready准备工作

In this recipe, we will do the following:在这部分,我们将做如下操作。

1. Grab stock data from Yahoo.

2. Rearrange it in a shape we're comfortable with.

3. Create an LDA object to fit and predict the class labels.

4. Give an example of how to use LDA for dimensionality reduction.

1、从Yahoo抓取股票数据

2、重新排列数据为适合我们的形状

3、生成一个LDA对象来拟合和预测分类标签

4、给出一个例子来讲述如何使用LDA降维

How to do it…如何做

In this example, we will perform an analysis similar to Altman's Z-score. In this paper, Altman looked at a company's likelihood of defaulting within two years based on several financial metrics. The following is taken from the Wiki page of Altman's Z-score:

在这个例子,我们将要执行一个类似阿特曼z分数的分析。在本书,阿特曼通过两年对一个公司观察,并依靠几种财务指标定义相似性。以下是从wiki找到的阿特曼的z分数的定义:

T 1 = Working Capital / Total Assets. Measures liquid assets in relation to the size of the company.

T 2 = Retained Earnings / Total Assets. Measures profitability that reflects the company's age and earning power.

T 3 = Earnings Before Interest and Taxes / Total Assets. Measures operating efficiency apart from tax and leveraging factors. It recognizes operating earnings as being important to long-term viability.

T 4 = Market Value of Equity / Book Value of Total Liabilities. Adds market dimension that can show up security price fluctuation as a possible red flag.

T 5 = Sales/ Total Assets. Standard measure for total asset turnover (varies greatly from industry to industry).

T1=流动资金/总资产,测量流动资产与公司大小的关系

T2=留存收益/总资产,测量收益性影响公司年龄和盈利能力

T3=息税前利润/总资产,测量经营效率除去税和杠杆因子。经营效率显示长期经营能力的重要性。

T4=股本市场价值/负债总额的账面价值,增加市场维度,这展示安全价格波动时的一个可能的红线

T5=销售额/总资产,营业额占总资产的标准(各行各业不同)

From Wikipedia:出自维基百科

[1]: Altman, Edward I. (September 1968). ""Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy"". Journal of Finance: 189–209.

In this analysis, we'll look at some financial data from Yahoo via pandas. We'll try to predict if a stock will be higher in exactly 6 months from today, based on the current attribute of the stock. It's obviously nowhere near as refined as Altman's Z-score. Let's use a basket of auto stocks:

在这个分析中,我们通过pandas看一些从Yahoo中得来的金融数据。我们将通过一只股票6个月以来的数据,预测今天是否会增长。这很显然并不会完全接近Z分数,让我们使用几种股票。

tickers = ["F", "TM", "GM", "TSLA"]
from pandas_datareader import data as external_data
stock_panel = external_data.DataReader(tickers, "yahoo")

This data structure is panel from pandas. It's similar to an OLAP cube or a 3D DataFrame .Let's take a look at the data to get some familiarity with closes since that's what we care about while comparing:

这个数据结构是pandas类型,这很类似OLAP或者3D数据框。让我们看一看数据来得到一些当我们做比较时关心的熟悉的近似性。

stock_df = stock_panel.Close.dropna()
stock_df.plot(figsize=(7, 5))

The following is the output:如下图所示

Ok, so now we need to compare each stock price with its price in 6 months. If it's higher,we'll code it with 1, and if not, we'll code that with 0.To do this, we'll just shift the dataframe back 180 days and compare:

好的,我们需要通过六个月的价格来比较每一个股票的价格。如果它更高了,我们编码它为1,反之,我们编码它为0.这样做,我们转换过去180天的数据框并作比较。

#this dataframe indicates if the stock was higher in 180 days这个数据框表示股票在过去的180天里是否增长了

classes = (stock_df.shift(-180) > stock_df).astype(int)

The next thing we need to do is flatten out the dataset:下一步,我们需要展平数据

classes = classes.unstack()
classes = classes.swaplevel(0, 1).sort_index()
classes = classes.to_frame()
classes.index.names = ['Date', 'minor']

X = stock_panel.unstack().swaplevel(2, 0, 1).to_frame().unstack()
data = pd.concat([X, classes], axis=1)
data.rename(columns={0: 'is_higher'}, inplace=True)
data.columns = ['Adj Close','Close','High','Low','Open','Volume','is_higher']
data.head()

The following is the output:如小图所示

Ok, so now we need to create matrices to SciPy. To do this, we'll use the patsy library. This is a great library that can be used to create a design matrix in a fashion similar to R:

好了,现在我们需要生成适合Scipy的矩阵,我们使用patsy包能做到这个,这是一个很好的包常被用于生成在R中很流行的设计矩阵

import patsy
X = patsy.dmatrix("Open + High + Low + Close + Volume + is_higher - 1", data.reset_index(),return_type='dataframe')
X.head()

The following is the output:输出如下

patsy is a very strong package, for example, suppose we want to apply some of the preprocessing from Chapter 1, Premodel Workflow. In patsy , it's possible, like in R,to modify the formula in a way that corresponds to modifications in the design matrix.

patsy是非常强大的包,例如,假如我们需要应用一些第一章提到的数据预处理过程,Premodel Workflow,在patsy,可能像R一样,来调整方程与在设计矩阵中调整相关性的方法相同。

It won't be done here, but if we want to scale the value to mean 0 and standard deviation 1, the function will be "scale(open) + scale(high)" .

在这不会用没大事如果你想放缩为均值为0,标准差为1,这个函数将是“scale(open) + scale(high)”

Awesome! So, now that we have our dataset, let's fit the LDA object:令人惊讶的,所以,现在我们有了数据集,让我们拟合LDA对象

import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA()
lda.fit(X.ix[:, :-1], X.ix[:, -1]);

We can see that it's not too bad when predicting against the dataset. Certainly, we will want to improve this with other parameters and test the model:

我们看到对数据集进行预测的结果不太坏,当然,我们将使用其他参数和测试模型来提高它。

from sklearn.metrics import classification_report
print(classification_report(X.iloc[:,-1].values, lda.predict(X.iloc[:,:-1])))
              precision    recall  f1-score   support

         0.0       0.61      1.00      0.76      3083
         1.0       0.00      0.00      0.00      1953

    accuracy                           0.61      5036
   macro avg       0.31      0.50      0.38      5036
weighted avg       0.37      0.61      0.46      5036

These metrics describe how the model fits the data in various ways.这个句子描述模型在不同情况下拟合效果

The precision and recall parameters are fairly similar. In some ways, as shown in the following list, they can be thought of as conditional proportions:

准确性和召回率参数同样公平,在很多方法,如以下表中展示的,他们能被认为是条件比例

1、 For precision , given the model predicts a positive value, what proportion of this is correct?

2、 For recall , given the state of one class is true, what proportion did we "select"? I say,select because recall is a common metric in search problems. For example, there can be a set of underlying web pages that, in fact, relate to a search term—the proportion that is returned.

The f1-score parameter attempts to summarize the relationship between recall and precision .

1、对于准确率,给模型的预测值一个积极的值,正确的比例

2、对于召回率,给出一个分类是真的状态,我们如何选择的部分,我说,选择是因为召回率在此类问题中是一个相同的矩阵,例如,这有一个基本的网页,事实上,他依靠查询期限,返回正确的比例。

How it works…怎么运行的

LDA is actually fairly similar to clustering that we did previously. We fit a basic model from the data. Then, once we have the model, we try to predict and compare the likelihoods of the data given in each class. We choose the option that's more likely.

LDA是确实像之前讲的聚类,我们用数据拟合一个基本模型,然后,当我们有了这个模型,我们尝试预测和比较每个给定类别数据的相似性,我们选择最相近的那个。

LDA is actually a simplification of QDA, which we'll talk about in the next chapter. Here, we assume that the covariance of each class is the same, but in QDA, the assumption is relaxed. Think about the connections between KNN and GMM and the relationship there and here.

LDA其实是个简单的QDA(接下来的章节要讲的),这里,我们估计每一个类有相同的协方差,但是在QDA,协方差是自由的,考虑KNN和GMM之间的联系以及关系。

原文链接:http://www.packtpub.com

原文作者:Trent Hauck

相关文章

  • Label propagation with semi-supervised learning半监督学习之标签传播算法

    Label propagation is a semi-supervised technique that makes use of the labeled a...

    到不了的都叫做远方
  • Tuning a Decision Tree model调试决策树模型

    If we use just the basic implementation of a Decision Tree, it will probably not...

    到不了的都叫做远方
  • Using many Decision Trees – random forests使用多棵决策树--随机森林

    In this recipe, we'll use random forests for classification tasks. random forest...

    到不了的都叫做远方
  • 卷积神经网络反向传播推导

    Disclaimer: It is assumed that the reader is familiar with terms such as Multila...

    量化投资与机器学习微信公众号
  • 研发:What is a DDoS Attack?

    A distributed denial-of-service (DDoS) attack is a malicious attempt to disrupt ...

    heidsoft
  • 追踪接触者以控制COVID-19大流行(CS SI)

    控制 COVID-19大流行需要大量减少接触,主要是通过实施行动控制达到强制隔离的水平。 这导致了经济的大部分崩溃。这种疾病的携带者大约在接触病毒后3天具有传染...

    用户7095611
  • Instant Messaging at LinkedIn: Scaling to 10000 of Connections

    We recently introduced Instant Messaging on LinkedIn, complete with typing indic...

    首席架构师智库
  • CodeForces 731A Night at the Museum

    A. Night at the Museum time limit per test 1 second memory limit per test ...

    ShenduCC
  • 十大革命性理论(Top 10 revolutionary scientifictheories)中英版(19k字)

    本篇《十大革命性理论》(Top 10 revolutionary scientific theories |Science News)中英文对照版AB,把原文倒...

    秦陇纪
  • What do we mean by “understanding” something?

    In this chapter, we shall examine the most fundamental ideas that we have about ...

    一个会写诗的程序员

扫码关注云+社区

领取腾讯云代金券