前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >[scikit-learn 机器学习] 6. 逻辑回归

[scikit-learn 机器学习] 6. 逻辑回归

作者头像
Michael阿明
发布2020-07-13 15:11:52
7250
发布2020-07-13 15:11:52
举报

本文为 scikit-learn机器学习(第2版)学习笔记

逻辑回归常用于分类任务

1. 逻辑回归二分类

《统计学习方法》逻辑斯谛回归模型( Logistic Regression,LR)

定义:设 XXX 是连续随机变量, XXX 服从 logistic 分布是指 XXX 具有下列分布函数和密度函数:

F(x)=P(X≤x)=11+e−(x−μ)/γF(x) = P(X \leq x) = \frac{1}{1+e^{{-(x-\mu)} / \gamma}}F(x)=P(X≤x)=1+e−(x−μ)/γ1​

f(x)=F′(x)=e−(x−μ)/γγ(1+e−(x−μ)/γ)2f(x)=F'(x)= \frac {e^{{-(x-\mu)} / \gamma}}{\gamma {(1+e^{{-(x-\mu)}/\gamma})}^2}f(x)=F′(x)=γ(1+e−(x−μ)/γ)2e−(x−μ)/γ​

在这里插入图片描述
在这里插入图片描述

在逻辑回归中,当预测概率 >= 阈值,预测为正类,否则预测为负类

2. 垃圾邮件过滤

从信息中提取 TF-IDF 特征,并使用逻辑回归进行分类

代码语言:javascript
复制
import pandas as pd
data = pd.read_csv("SMSSpamCollection", delimiter='\t',header=None)
data
在这里插入图片描述
在这里插入图片描述
代码语言:javascript
复制
data[data[0]=='ham'][0].count() # 4825 条正常信息
data[data[0]=='spam'][0].count() # 747 条垃圾信息
代码语言:javascript
复制
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

X = data[1].values
y = data[0].values
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
y = lb.fit_transform(y)

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=520)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

pred = classifier.predict(X_test)
for i, pred_i in enumerate(pred[:5]):
    print("预测为:%s, 信息为:%s,真实为:%s" %(pred_i,X_test_raw[i],y_test[i]))
代码语言:javascript
复制
预测为:0, 信息为:Aww that's the first time u said u missed me without asking if I missed u first. You DO love me! :),真实为:[0]
预测为:0, 信息为:Poor girl can't go one day lmao,真实为:[0]
预测为:0, 信息为:Also remember the beads don't come off. Ever.,真实为:[0]
预测为:0, 信息为:I see the letter B on my car,真实为:[0]
预测为:0, 信息为:My love ! How come it took you so long to leave for Zaher's? I got your words on ym and was happy to see them but was sad you had left. I miss you,真实为:[0]

2.1 性能指标

混淆矩阵

代码语言:javascript
复制
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
confusion_matrix = confusion_matrix(y_test, pred)
plt.matshow(confusion_matrix)
plt.rcParams["font.sans-serif"]= 'SimHei' # 消除中文乱码
plt.title("混淆矩阵")
plt.ylabel('真实')
plt.xlabel('预测')
plt.colorbar()
在这里插入图片描述
在这里插入图片描述

2.2 准确率

代码语言:javascript
复制
scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracies: %s' % scores)
print('Mean accuracy: %s' % np.mean(scores))
代码语言:javascript
复制
Accuracies: [0.94976077 0.95933014 0.96650718 0.95215311 0.95688623]
Mean accuracy: 0.9569274847434318

准确率不是一个很合适的性能指标,它不能区分预测错误,是正预测为负,还是负预测为正

2.3 精准率、召回率

可以参考 [Hands On ML] 3. 分类(MNIST手写数字预测)

在这里插入图片描述
在这里插入图片描述

单独只看精准率或者召回率是没有意义的

代码语言:javascript
复制
from sklearn.metrics import precision_score, recall_score, f1_score
precisions = precision_score(y_test, pred)
print('Precision: %s' % precisions)
recalls = recall_score(y_test, pred)
print('Recall: %s' % recalls)
代码语言:javascript
复制
Precision: 0.9852941176470589
预测为垃圾信息的基本上真的是垃圾信息

Recall: 0.6979166666666666
有30%的垃圾信息预测为了非垃圾信息

2.4 F1值

F1 值是以上精准率和召回率的均衡

代码语言:javascript
复制
f1s = f1_score(y_test, pred)
print('F1 score: %s' % f1s)
# F1 score: 0.8170731707317074

2.5 ROC、AUC

  • 好的分类器AUC面积越接近1越好,随机分类器AUC面积为0.5
代码语言:javascript
复制
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

false_positive_rate, recall, thresholds = roc_curve(y_test, pred)
roc_auc_score  = roc_auc_score(y_test, pred)

plt.title('受试者工作特性')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc_score)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
在这里插入图片描述
在这里插入图片描述

3. 网格搜索调参

代码语言:javascript
复制
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score


pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5, 0.75), # 模块name__参数name
    'vect__stop_words': ('english', None),
    'vect__max_features': (2500, 5000, None),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (0.01, 0.1, 1, 10),
}

if __name__ == "__main__":
    df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
    X = df[1].values
    y = df[0].values
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
    grid_search.fit(X_train, y_train)
    
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
        
    predictions = grid_search.predict(X_test)
    print('Accuracy: %s' % accuracy_score(y_test, predictions))
    print('Precision: %s' % precision_score(y_test, predictions))
    print('Recall: %s' % recall_score(y_test, predictions))
代码语言:javascript
复制
Best score: 0.985
Best parameters set:
	clf__C: 10
	clf__penalty: 'l2'
	vect__max_df: 0.5
	vect__max_features: 5000
	vect__ngram_range: (1, 2)
	vect__stop_words: None
	vect__use_idf: True
Accuracy: 0.9791816223977028
Precision: 1.0
Recall: 0.8605769230769231

调整参数后,提高了召回率

4. 多类别分类

电影情绪评价预测

代码语言:javascript
复制
data = pd.read_csv("./chapter5_movie_train.csv",header=0,delimiter='\t')
data
在这里插入图片描述
在这里插入图片描述
代码语言:javascript
复制
data['Sentiment'].describe()
代码语言:javascript
复制
count    156060.000000
mean          2.063578
std           0.893832
min           0.000000
25%           2.000000
50%           2.000000
75%           3.000000
max           4.000000
Name: Sentiment, dtype: float64

平均都是比较中立的情绪

代码语言:javascript
复制
data["Sentiment"].value_counts()/data["Sentiment"].count()
代码语言:javascript
复制
2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64

50% 的例子都是中立的情绪

代码语言:javascript
复制
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('./chapter5_movie_train.csv', header=0, delimiter='\t')
X, y = df['Phrase'], df['Sentiment'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'clf__C': (0.1, 1, 10),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('\t%s: %r' % (param_name, best_parameters[param_name]))
代码语言:javascript
复制
Best score: 0.619
Best parameters set:
	clf__C: 10
	vect__max_df: 0.25
	vect__ngram_range: (1, 2)
	vect__use_idf: False
  • 性能指标
代码语言:javascript
复制
predictions = grid_search.predict(X_test)

print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_test, predictions))
print('Classification Report:')
print(classification_report(y_test, predictions))
代码语言:javascript
复制
Accuracy: 0.6292323465333846
Confusion Matrix:
[[ 1013  1742   682   106    11]
 [  794  5914  6275   637    49]
 [  196  3207 32397  3686   222]
 [   28   488  6513  8131  1299]
 [    1    59   548  2388  1644]]
Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.29      0.36      3554
           1       0.52      0.43      0.47     13669
           2       0.70      0.82      0.75     39708
           3       0.54      0.49      0.52     16459
           4       0.51      0.35      0.42      4640

    accuracy                           0.63     78030
   macro avg       0.55      0.48      0.50     78030
weighted avg       0.61      0.63      0.62     78030

5. 多标签分类

  • 一个实例可以被贴上多个 labels

问题转换:

  • 实例的标签(假设为L1,L2),转换成(L1 and L2),以此类推,缺点,产生很多种类的标签,且模型只能训练数据中包含的类,很多可能无法覆盖到
  • 对每个标签,训练一个二分类器(这个实例是L1吗,是L2吗?),缺点,忽略了标签之间的关系

5.1 多标签分类性能指标

  • 汉明损失:不正确标签的平均比例,0最好
  • 杰卡德相似系数:预测与真实标签的交集数量 / 并集数量,1最好
代码语言:javascript
复制
from sklearn.metrics import hamming_loss, jaccard_score
# help(jaccard_score)

print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))

print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))

print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))

print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]]),average=None))

print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]]),average=None))

print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]]),average=None))
代码语言:javascript
复制
0.0
0.25
0.5
[1. 1.]
[0.5 1. ]
[0. 1.]
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2020/06/30 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 本文为 scikit-learn机器学习(第2版)学习笔记
  • 1. 逻辑回归二分类
  • 2. 垃圾邮件过滤
    • 2.1 性能指标
      • 2.2 准确率
        • 2.3 精准率、召回率
          • 2.4 F1值
            • 2.5 ROC、AUC
            • 3. 网格搜索调参
            • 4. 多类别分类
            • 5. 多标签分类
              • 5.1 多标签分类性能指标
              领券
              问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档