使用sklearn 实现 Logistics Regression 分类

大鹅

发布于 2021-06-15 15:25:59

7350

发布于 2021-06-15 15:25:59

文章被收录于专栏：大鹅专栏：大数据到机器学习大鹅专栏：大数据到机器学习

使用Pandas 数据清洗特征选择 + sklearn 实现 Logistics Regression 分类

(记录一次Data Mining作业) 关于LR基础可以看这里

数据描述与分析

我们有这么一个数据集，记录学生在教务网站上看某学科的视频流数据来预测学生是否挂科。(这之间有关系吗..)

user_id: Identifies the individual who is performing the action. session: This 32-character value is a key that identifies the user’s session. All browser events include a value for the session. Other mobile events do not include a session value. load_video: This tag appears when the video is rendered and ready to play. play_video: This tag appears when a user selects the video player’s play control. pause_video: This tag appears when a user select the video player’s pause control. seek_video: This tag appears when a user selects a user interface control to go to a different point in the video file. stop_video: This tag appears when the video player reaches the end of the video file and play automatically stops. speed_change_video: This tag appears when a user selects a different playing speed for the video. event_time: The time that this event occurs. Gives the UTC time at which the event was emitted in ‘YYYY-MM-DDThh:mm:ss.xxxxxx’ format. new_time: The time in the video, in seconds, that the user selected as the destination point. This filed appears for seek_video action only. old_time: The time in the video, in seconds, at which the user chose to go to a different point in the file. This filed appears for seek_video action only. old_speed: The speed at which the video was playing. This filed appears for speed_change_video action only. new_speed: The speed that the user selected for the video to play. This filed appears for speed_change_video action only. grade: Final performance status, 0 for not pass and 1 for pass

训练环境

OS: Win 10 Python version:3.6.3 Scikit-learn: 0.19.1 Pandas: 0.21.0 Numpy: 1.13.3 A typical example is run as:

python lr.py

特征选择

The number of videos that student have watched.
The times that student watch the videos.
The times that student pause the videos when watching.
The times that student stop the videos when watching.
The times that student change the videos speed when watching.
the number of session of one student ( the times that student open the browser to watch the video )

PS：当然这是些很简单的特征，数据集里面的时间等都没用上。

模型选择(当然是选择LR)

Use the logistic regression model.

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.). The goal of logistic regression is to find the best fitting (yet biologically reasonable) model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of presence of the characteristic of interest. Binary class L2 penalized logistic regression minimizes the following cost function:

sklearn 中 LogisticRegression 参数默认值

class sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)

我们在训练时可以直接使用默认参数，当然也可以根据数据集合理设置theta调参

输出结果

0.860396039604 0.866336633663 0.890099009901 0.869306930693 0.869306930693 0.880198019802 0.862376237624 0.870297029703 0.892079207921 0.887128712871 precision recall f1-score support neg 0.93 0.93 0.93 827 pos 0.69 0.68 0.69 183 avg / total 0.89 0.89 0.89 1010 time spent: 7.203231573104858

绘制出P/R 图 (AUC = 0.5):

参考代码

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, roc_curve, auc
from sklearn.metrics import classification_report
from matplotlib import pyplot
from matplotlib import pylab
import pandas as pd
import numpy as np
import time

start_time = time.time()
trainDf = pd.read_csv('TrainFeatures.csv')
testDf = pd.read_csv('TestFeatures.csv')
labelDf = pd.read_csv('TrainLabel.csv')


# Draw R/P Curve
def plot_pr(auc_score, precision, recall, label=None):
    pylab.figure(num=None, figsize=(6, 5))
    pylab.xlim([0.0, 1.0])
    pylab.ylim([0.0, 1.0])
    pylab.xlabel('Recall')
    pylab.ylabel('Precision')
    pylab.title('P/R (AUC=%0.2f) / %s' % (auc_score, label))
    pylab.fill_between(recall, precision, alpha=0.5)
    pylab.grid(True, linestyle='-', color='0.75')
    pylab.plot(recall, precision, lw=1)
    pylab.show()


# do data cleaning job
def data_cleaning(df):
    # Feature for video number for one student
    video_number = df.iloc[:, 0:2].drop_duplicates().dropna()
    video_number = video_number.groupby(by=['user_id']).size().reset_index(name='watchVideoTimes')
    # Feature for session
    session_number = df.iloc[:, [0, 2]].drop_duplicates()
    session_number = session_number.groupby(by=['user_id']).size().reset_index(name='sessionCount')
    # Feature for video event type
    video_type_number = df.iloc[:, [0, 7]].dropna()
    video_type_number = video_type_number.groupby(by=['user_id', 'event_type']).size()\
        .reset_index(name='video_type_number')
    # select event_type == play_video
    play_video_times = video_type_number[video_type_number.event_type == 'play_video'].drop(['event_type'], axis=1)
    pause_video_times = video_type_number[video_type_number.event_type == 'pause_video'].drop(['event_type'], axis=1)
    seek_video_times = video_type_number[video_type_number.event_type == 'seek_video'].drop(['event_type'], axis=1)
    stop_video_times = video_type_number[video_type_number.event_type == 'stop_video'].drop(['event_type'], axis=1)
    speed_change_times = video_type_number[video_type_number.event_type == 'speed_change_video']\
        .drop(['event_type'], axis=1)
    # rename columns
    play_video_times.rename(columns={'video_type_number': 'play_video_times'}, inplace=True)
    pause_video_times.rename(columns={'video_type_number': 'pause_video_times'}, inplace=True)
    seek_video_times.rename(columns={'video_type_number': 'seek_video_times'}, inplace=True)
    stop_video_times.rename(columns={'video_type_number': 'stop_video_times'}, inplace=True)
    speed_change_times.rename(columns={'video_type_number': 'speed_change_times'}, inplace=True)
    # merger the columns by key = user_id
    feature_df = pd.merge(video_number, session_number, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, play_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, pause_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, seek_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, stop_video_times, on='user_id', how='outer')
    feature_df = pd.merge(feature_df, speed_change_times, on='user_id', how='outer')
    # replace NAN to 0
    feature_df = feature_df.fillna(0)
    return feature_df

trainingFeature = data_cleaning(trainDf)
testingFeature = data_cleaning(testDf)
trainingFeature = pd.merge(trainingFeature, labelDf, on='user_id')
# trainingFeature.to_csv('cleaning_data_training.csv')
# testingFeature.to_csv('cleaning_data_testing.csv')

# training model
average = 0
testNum = 10
for i in range(0, testNum):
    X_train, X_test, y_train, y_test = train_test_split(trainingFeature.iloc[:, 1:7], trainingFeature.iloc[:, 8],
                                                    test_size=0.2)
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    p = np.mean(y_pred == y_test)
    print(p)
    average += p

# precision and recall
answer = lr.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, answer)
report = answer > 0.5
print(classification_report(y_test, report, target_names=['neg', 'pos']))
print("average precision:", average / testNum)
print("time spent:", time.time() - start_time)
plot_pr(0.5, precision, recall, "pos")

# predict testing data
predict = lr.predict(testingFeature.iloc[:, 1:7])
output = pd.DataFrame(predict.T, columns=['grade'])
output.insert(0, 'user_id', testingFeature.iloc[:, 0])
output.to_csv('prediction.csv', index=False)

参考文献

http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression Sklearn documentation
李航, 统计学习方法
https://czep.net/stat/mlelr.pdf Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation

本文参与腾讯云自媒体分享计划，分享自作者个人站点/博客。

原始发表：2017-11-26 ，如有侵权请联系 cloudcommunity@tencent.com 删除

机器学习

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！

机器学习

登录后参与评论

0 条评论

热度

使用sklearn 实现 Logistics Regression 分类

使用sklearn 实现 Logistics Regression 分类

使用Pandas 数据清洗特征选择 + sklearn 实现 Logistics Regression 分类

数据描述与分析

训练环境

特征选择

模型选择(当然是选择LR)

输出结果

参考代码

参考文献

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐