Using linear methods for classification – logistic regression

到不了的都叫做远方

修改于 2020-04-22 15:42:18

3960

修改于 2020-04-22 15:42:18

使用线性模型解决分类问题-逻辑回归

Linear models can actually be used for classification tasks. This involves fitting a linear model to the probability of a certain class, and then using a function to create a threshold at which we specify the outcome of one of the classes.

线性模型实际上能够被用于分类问题，这涉及到拟合一个线性模型来确定一个确定的类的可能性，然后使用一个函数生成一个让我们能输出不同类型的阈值。

Getting ready准备工作

The function used here is typically the logistic function (surprise!). It's a pretty simple function:

这里使用的函数是一个常规的逻辑函数，这是一个很简单的函数：

Visually, it looks like the following:可视化一下，如下所示：

Let's use the make_classification method, create a dataset, and get to classifying:

让我们使用make_classification方法，生成一个数据集，然后使用其进行分类：

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4)

How to do it...怎么做的

The LogisticRegression object works in the same way as the other linear models:

LogisticRegression对象和其他线性模型的运行方法是一样的：

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

Since we're good data scientists, we will pull out the last 200 samples to test the trained model on. Since this is a random dataset, it's fine to hold out the last 200; if you're dealing with structured data, don't do this (for example, if you deal with time series data):

当我们成为优秀的数据科学家，我们将拿出最后200个样本来对训练的模型做测试。由于是随机数据集，最好是拿出最后200个，如果你处理结构化数据，最好别这么做（比如你在处理时间序列数据）

X_train = X[:-200]
X_test = X[-200:]
y_train = y[:-200]
y_test = y[-200:]

We'll discuss more on cross-validation later in the book. Now, we need to fit the model with logistic regression. We'll keep around the predictions on the train set, just like the test set. It's a good idea to see how often you are correct on both sets. Often, you'll be better on the train set; it's a matter of how much worse you are on the test set:

我们将在本书后面讨论更多的交叉验证，现在我们需要使用逻辑回归拟合模型。我们将围绕着训练集进行预测，同样还有测试集，在两个数据集上都验证正确性是个好方法，但是常常你在训练集上表现不俗，但在测试集上表现不好成了一个问题。

lr.fit(X_train, y_train)
y_train_predictions = lr.predict(X_train)
y_test_predictions = lr.predict(X_test)

Now that we have the predictions, let's take a look at how good our predictions were. Here, we'll simply look at the number of times we were correct; later, we'll talk about evaluating classification models in more detail.

现在我们有了预测值，让我们看一下你的预测效果怎么样。这里我们简单看一下正确的次数，稍后，我们将讨论评估分类的更多细节。

The calculation is simple; it's the number of times we were correct over the total sample:

计算很简单，就是我么预测正确的值占总样本的数量：

(y_train_predictions == y_train).sum().astype(float) /  y_train.shape[0]
0.90125

And now the test sample:现在是测试样本：

(y_test_predictions == y_test).sum().astype(float) /  y_test.shape[0]
0.925

So, here we were correct about as often in the test set as we were in the train set. Sadly, in practice, this isn't often the case.

所以，在这里我们在测试集正确预测的次数和在训练集的差不多，可惜的是，在实践中，不会总像例子这样。

The question then changes to how to move on from the logistic function to a method by which we can classify groups.

现在问题变成了怎么样从逻辑函数转换到我们能够分组的方法。

First, recall the linear regression hopes offending the linear function that fits the expected value of Y, given the values of X; this is E(Y|X) = Xβ. Here, the Y values are the probabilities of the classes. Therefore, the problem we're trying to solve is E(p|X) = \ Xβ. Then, once the threshold is applied, this becomes Logit(p) = Xβ. The idea expanded is how other forms of regression work, for example, Poisson.

首先，回忆线性回归希望违背拟合出给定X值的期望Y值的线性函数，形如E(Y|X) = Xβ，这里，Y值是可能的类型，然而我们努力解决的问题是E(p|X) = \ Xβ，然后，一旦使用了阈值，就成了Logit(p) = Xβ，想法扩展到其他形式的回归如何工作，例如泊松分布。

There's more...扩展阅读

You'll surely see this again. There will be a situation where one class is weighted differently from the other classes; for example, one class may be 99 percent of cases. This situation will pop up all over the place in the classification work. The canonical example is fraud detection, where most transactions aren't fraud, but the cost associated with misclassification is asymmetric between classes.

你可能需要在看一次，有一种情况：一个分类的数据量严重不同于另一种分类，例如，一种分类占样例的99%，这种情况在任何分类问题中都有可能随时出现，最简单的例子就是诈骗识别，大部分交易都不是诈骗，但是分错类在不同类别里的花费是不对等的。

Let's create a classification problem with 95 percent imbalance and see how the basic stock logistic regression handles this case:

让我们生成生成一个有95%不均衡样本的分类问题，来看看现有的基本的逻辑回归处理这些例子表现怎么样。

X, y = make_classification(n_samples=5000, n_features=4,  weights=[.95])
sum(y) / (len(y)*1.) #to confirm the class imbalance
0.0562

Create the train and test sets, and then fit logistic regression:生成训练集接测试集，然后拟合逻辑回归：

X_train = X[:-500]
X_test = X[-500:]
y_train = y[:-500]
y_test = y[-500:]
lr.fit(X_train, y_train)
y_train_predictions = lr.predict(X_train)
y_test_predictions = lr.predict(X_test)

Now, to see how well our model fits the data, do the following:现在为了看看我们的模型拟合情况，做如下操作：

(y_train_predictions == y_train).sum().astype(float) /  y_train.shape[0]
0.9837777777777778
(y_test_predictions == y_test).sum().astype(float) / y_test.shape[0]
0.978

At first, it looks like we did well, but it turns out that when we always guessed that a transaction was not fraud (or class 0 in general) we were right around 95 percent of the time. If we look at how well we did in classifying the 1 class, it's not nearly as good:

首先，这看起来我们做的不错，但是他的结果是当我们猜测一笔交易不是欺诈（或者0分类占总体比例），我也会有95%的正确率。如果我们我们看这模型的表现，其实并不好。

x = (y_test[y_test==1]==y_test_predictions[y_test==1]) .sum().astype(float)
x / y_test[y_test==1].shape[0]
0.6923076923076923

Hypothetically, we might care more about identifying fraud cases than non-fraud cases; this could be due to a business rule, so we might alter how we weigh the correct and incorrect values.

假设，我们更关注欺诈案例，这将被运用与商业尺度，所以我们可能会转换正确值和错误值的权重。

By default, the classes are weighted (and thus resampled) in accordance with the inverse of the class weights of the training set. However, because we care more about fraud cases, let's oversample the fraud relative to nonfraud cases.

在初始值，在训练集中一种类别的权重与相反类别的权重一致，然而，因为我们更重视欺诈例子，让我们相对非欺诈例子来说，更重视欺诈数据。

We know that our relative weighting right now is 95 percent nonfraud; let's change this to overweight fraud cases:

我们知道我们的相对权重现在是95%非欺诈数据，让我们把欺诈例子的权重调到过重。

lr = LogisticRegression(class_weight={0: .15, 1: .85})
lr.fit(X_train, y_train)

Let's predict the outputs again:现在让我们在看一次预测值的输出：

y_train_predictions = lr.predict(X_train)
y_test_predictions = lr.predict(X_test)

We can see that we did a much better job on classifying the fraud cases:我们看到我们在欺诈样例中表现的更好了。

x = (y_test[y_test==1] == y_test_predictions[y_test==1]).sum().astype(float) 
x / y_test[y_test==1].shape[0]
0.7307692307692307

But, at what expense do we do this? To find out, use the following command:但是，我们做了这些，付出了什么？让我们用下面代码查看下：

(y_test_predictions == y_test).sum().astype(float) / y_test.shape[0]
0.966

Here, there's only about 1 percent less accuracy. Whether that's okay depends on your problem. Put in the context of the problem, if the estimated cost associated with fraud is sufficiently large, it can eclipse the cost associated with tracking fraud.

这里，只有1%的不准确性，是否合适取决于你的问题，带入你的问题的背景，如果评估与欺诈相关的成本非常的大，追踪欺诈相关的成本就显得不那么重要了。

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

actionscript