(记录一次Data Mining作业)
关于LR基础可以看这里
我们有这么一个数据集,记录学生在教务网站上看某学科的视频流数据来预测学生是否挂科。(这之间有关系吗..)
user_id: Identifies the individual who is performing the action. session: This 32-character value is a key that identifies the user’s session. All browser events include a value for the session. Other mobile events do not include a session value. load_video: This tag appears when the video is rendered and ready to play. play_video: This tag appears when a user selects the video player’s play control. pause_video: This tag appears when a user select the video player’s pause control. seek_video: This tag appears when a user selects a user interface control to go to a different point in the video file. stop_video: This tag appears when the video player reaches the end of the video file and play automatically stops. speed_change_video: This tag appears when a user selects a different playing speed for the video. event_time: The time that this event occurs. Gives the UTC time at which the event was emitted in ‘YYYY-MM-DDThh:mm:ss.xxxxxx’ format. new_time: The time in the video, in seconds, that the user selected as the destination point. This filed appears for seek_video action only. old_time: The time in the video, in seconds, at which the user chose to go to a different point in the file. This filed appears for seek_video action only. old_speed: The speed at which the video was playing. This filed appears for speed_change_video action only. new_speed: The speed that the user selected for the video to play. This filed appears for speed_change_video action only. grade: Final performance status, 0 for not pass and 1 for pass
OS: Win 10
Python version:3.6.3
Scikit-learn: 0.19.1
Pandas: 0.21.0
Numpy: 1.13.3
A typical example is run as:
python lr.py
PS: 当然这是些很简单的特征,数据集里面的时间等都没用上。
Use the logistic regression model.
Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.). The goal of logistic regression is to find the best fitting (yet biologically reasonable) model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of presence of the characteristic of interest. Binary class L2 penalized logistic regression minimizes the following cost function:
sklearn 中 LogisticRegression 参数默认值
class sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)
我们在训练时可以直接使用默认参数,当然也可以根据数据集合理设置theta调参
0.860396039604 0.866336633663 0.890099009901 0.869306930693 0.869306930693 0.880198019802 0.862376237624 0.870297029703 0.892079207921 0.887128712871 precision recall f1-score support neg 0.93 0.93 0.93 827 pos 0.69 0.68 0.69 183 avg / total 0.89 0.89 0.89 1010 time spent: 7.203231573104858
绘制出P/R 图 (AUC = 0.5):
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, roc_curve, auc
from sklearn.metrics import classification_report
from matplotlib import pyplot
from matplotlib import pylab
import pandas as pd
import numpy as np
import time
start_time = time.time()
trainDf = pd.read_csv('TrainFeatures.csv')
testDf = pd.read_csv('TestFeatures.csv')
labelDf = pd.read_csv('TrainLabel.csv')
# Draw R/P Curve
def plot_pr(auc_score, precision, recall, label=None):
pylab.figure(num=None, figsize=(6, 5))
pylab.xlim([0.0, 1.0])
pylab.ylim([0.0, 1.0])
pylab.xlabel('Recall')
pylab.ylabel('Precision')
pylab.title('P/R (AUC=%0.2f) / %s' % (auc_score, label))
pylab.fill_between(recall, precision, alpha=0.5)
pylab.grid(True, linestyle='-', color='0.75')
pylab.plot(recall, precision, lw=1)
pylab.show()
# do data cleaning job
def data_cleaning(df):
# Feature for video number for one student
video_number = df.iloc[:, 0:2].drop_duplicates().dropna()
video_number = video_number.groupby(by=['user_id']).size().reset_index(name='watchVideoTimes')
# Feature for session
session_number = df.iloc[:, [0, 2]].drop_duplicates()
session_number = session_number.groupby(by=['user_id']).size().reset_index(name='sessionCount')
# Feature for video event type
video_type_number = df.iloc[:, [0, 7]].dropna()
video_type_number = video_type_number.groupby(by=['user_id', 'event_type']).size()\
.reset_index(name='video_type_number')
# select event_type == play_video
play_video_times = video_type_number[video_type_number.event_type == 'play_video'].drop(['event_type'], axis=1)
pause_video_times = video_type_number[video_type_number.event_type == 'pause_video'].drop(['event_type'], axis=1)
seek_video_times = video_type_number[video_type_number.event_type == 'seek_video'].drop(['event_type'], axis=1)
stop_video_times = video_type_number[video_type_number.event_type == 'stop_video'].drop(['event_type'], axis=1)
speed_change_times = video_type_number[video_type_number.event_type == 'speed_change_video']\
.drop(['event_type'], axis=1)
# rename columns
play_video_times.rename(columns={'video_type_number': 'play_video_times'}, inplace=True)
pause_video_times.rename(columns={'video_type_number': 'pause_video_times'}, inplace=True)
seek_video_times.rename(columns={'video_type_number': 'seek_video_times'}, inplace=True)
stop_video_times.rename(columns={'video_type_number': 'stop_video_times'}, inplace=True)
speed_change_times.rename(columns={'video_type_number': 'speed_change_times'}, inplace=True)
# merger the columns by key = user_id
feature_df = pd.merge(video_number, session_number, on='user_id', how='outer')
feature_df = pd.merge(feature_df, play_video_times, on='user_id', how='outer')
feature_df = pd.merge(feature_df, pause_video_times, on='user_id', how='outer')
feature_df = pd.merge(feature_df, seek_video_times, on='user_id', how='outer')
feature_df = pd.merge(feature_df, stop_video_times, on='user_id', how='outer')
feature_df = pd.merge(feature_df, speed_change_times, on='user_id', how='outer')
# replace NAN to 0
feature_df = feature_df.fillna(0)
return feature_df
trainingFeature = data_cleaning(trainDf)
testingFeature = data_cleaning(testDf)
trainingFeature = pd.merge(trainingFeature, labelDf, on='user_id')
# trainingFeature.to_csv('cleaning_data_training.csv')
# testingFeature.to_csv('cleaning_data_testing.csv')
# training model
average = 0
testNum = 10
for i in range(0, testNum):
X_train, X_test, y_train, y_test = train_test_split(trainingFeature.iloc[:, 1:7], trainingFeature.iloc[:, 8],
test_size=0.2)
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
p = np.mean(y_pred == y_test)
print(p)
average += p
# precision and recall
answer = lr.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, answer)
report = answer > 0.5
print(classification_report(y_test, report, target_names=['neg', 'pos']))
print("average precision:", average / testNum)
print("time spent:", time.time() - start_time)
plot_pr(0.5, precision, recall, "pos")
# predict testing data
predict = lr.predict(testingFeature.iloc[:, 1:7])
output = pd.DataFrame(predict.T, columns=['grade'])
output.insert(0, 'user_id', testingFeature.iloc[:, 0])
output.to_csv('prediction.csv', index=False)