本文是kaggle的一个新案例,使用是一份关于国外黑色星期五的消费数据。
西方国家的黑色星期五类似我国的“双十一”活动,会产生很多的消费数据。
本数据提供了黑色星期五当天用户精选大批量产品产生的购买信息,主要包含两部分:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches
import matplotlib
# 中文显示问题
plt.rcParams["font.sans-serif"]=["SimHei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #正常显示负号
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
plt.style.use('ggplot')
sns.set(context='notebook',
style='darkgrid',
palette='colorblind',
font='sans-serif',
font_scale=1,
rc=None)
matplotlib.rcParams['figure.figsize'] =[8,8]
matplotlib.rcParams.update({'font.size': 15})
matplotlib.rcParams['font.family'] = 'sans-serif'
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
# 忽略notebook中的警告
import warnings
warnings.filterwarnings("ignore")
df1 = pd.read_csv("train.csv")
df1.head()
数据基本信息:
In [3]:
df1.shape
Out[3]:
总共是55万+的数据:
(550068, 12)
In [4]:
columns = df1.columns # 全部的字段
columns
Out[4]:
Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
'Product_Category_2', 'Product_Category_3', 'Purchase'],
dtype='object')
In [5]:
df1.dtypes
Out[5]:
User_ID int64
Product_ID object
Gender object
Age object
Occupation int64
City_Category object
Stay_In_Current_City_Years object
Marital_Status int64
Product_Category_1 int64
Product_Category_2 float64
Product_Category_3 float64
Purchase int64
dtype: object
In [6]:
# 不同字段类型的占比
df1.dtypes.value_counts().plot.pie(explode=[0.1,0.1,0.1],
autopct='%1.2f%%',
shadow=True)
plt.title('type of our data')
plt.show()
查看整体数据的缺失值情况,后面会专门处理缺失值:
从不同的角度对数据进行数量统计和可视化的分析
In [9]:
df2 = df1["Gender"].value_counts().reset_index()
df2
Out[9]:
index | Gender | |
---|---|---|
0 | M | 414259 |
1 | F | 135809 |
In [10]:
不同性别下的数量分布统计:
colors = ["red", "blue"]
sns.countplot("Gender", data=df1, palette=colors)
plt.title("Gender Count")
plt.show()
不同性别下的数量占比统计:
# 统计不同性别的个数
size = df1['Gender'].value_counts()
labels = ['Male', 'Female']
colors = ['#C4061D', 'green']
explode = [0, 0.1]
plt.rcParams['figure.figsize'] = (10, 10)
plt.pie(size,
colors = colors,
labels = labels,
shadow = True,
explode = explode,
autopct = '%.2f%%')
plt.title('Gender Percent', fontsize = 20)
plt.axis('off')
plt.legend()
plt.show()
In [12]:
df3 = df1["Occupation"].value_counts().sort_index().reset_index()
df3.head()
Out[12]:
index | Occupation | |
---|---|---|
0 | 0 | 69638 |
1 | 1 | 47426 |
2 | 2 | 26588 |
3 | 3 | 17650 |
4 | 4 | 72308 |
In [13]:
fig = px.bar(df3,
"index",
"Occupation",
color="Occupation")
fig.show()
不同职位下的数量统计:可以看到0-4-7的职位消费人数是最多的
下面是基于seaborn的统计方法:
palette=sns.color_palette("Set2")
plt.rcParams['figure.figsize'] = (18, 9)
sns.countplot(df1['Occupation'], palette = palette)
plt.title('total number of Occupation', # 消费人次
fontsize = 20)
plt.xlabel('Occupation')
plt.ylabel('number')
plt.show()
不同职位下的消费总金额:
In [69]:
sum_by_occ = df1.groupby('Occupation')['Purchase'].sum()
plt.figure(figsize=(20, 6))
sns.barplot(x=sum_by_occ.index,y=sum_by_occ.values)
plt.title('total amount of Occupation')
plt.show()
In [16]:
df4 = df1["Age"].value_counts().reset_index().sort_values("index")
df4
Out[16]:
index | Age | |
---|---|---|
6 | 0-17 | 15102 |
2 | 18-25 | 99660 |
0 | 26-35 | 219587 |
1 | 36-45 | 110013 |
3 | 46-50 | 45701 |
4 | 51-55 | 38501 |
5 | 55+ | 21504 |
In [17]:
fig = px.pie(df4, names="index",values="Age")
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
柱状图的表示方法:
In [19]:
columns
Out[19]:
Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
'Product_Category_2', 'Product_Category_3', 'Purchase'],
dtype='object')
In [20]:
df5 = df1.groupby(["Gender", "Age"]).size().reset_index()
df5.columns = ["Gender" ,"Age", "Number"]
df5
Out[20]:
Gender | Age | Number | |
---|---|---|---|
0 | F | 0-17 | 5083 |
1 | F | 18-25 | 24628 |
2 | F | 26-35 | 50752 |
3 | F | 36-45 | 27170 |
4 | F | 46-50 | 13199 |
5 | F | 51-55 | 9894 |
6 | F | 55+ | 5083 |
7 | M | 0-17 | 10019 |
8 | M | 18-25 | 75032 |
9 | M | 26-35 | 168835 |
10 | M | 36-45 | 82843 |
11 | M | 46-50 | 32502 |
12 | M | 51-55 | 28607 |
13 | M | 55+ | 16421 |
In [21]:
fig = px.bar(df5, x="Age", y="Number", color="Gender", text="Number")
fig.show()
In [72]:
plt.rcParams['figure.figsize'] = (18, 9)
# 统计每个城市的数量
sns.countplot(df1['City_Category'], palette = palette)
plt.title('people of city', fontsize = 20)
plt.xlabel('city')
plt.ylabel('people')
plt.show()
下面是对停留年限字段的统计分析:
In [23]:
df6 = df1["Stay_In_Current_City_Years"].value_counts().reset_index()
df6
Out[23]:
index | Stay_In_Current_City_Years | |
---|---|---|
0 | 1 | 193821 |
1 | 2 | 101838 |
2 | 3 | 95285 |
3 | 4+ | 84726 |
4 | 0 | 74398 |
In [24]:
fig = px.pie(df6, names="index",values="Stay_In_Current_City_Years")
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
0-未婚 1-已婚
In [25]:
df1["Marital_Status"].value_counts(normalize=True)
Out[25]:
0 0.590347
1 0.409653
Name: Marital_Status, dtype: float64
In [26]:
df8 = df1["Product_Category_1"].value_counts(normalize=True).reset_index()
df8.head()
Out[26]:
index | Product_Category_1 | |
---|---|---|
0 | 5 | 0.274390 |
1 | 1 | 0.255201 |
2 | 8 | 0.207111 |
3 | 11 | 0.044153 |
4 | 2 | 0.043384 |
In [27]:
下面的图形显示的是Product_Category_1字段中每个取值的占比,主要集中在1-5-8
fig = px.bar(df8, x="index",y="Product_Category_1")
fig.show()
查看字段Product_Category_1的相关统计信息:
同样的方法可以查看Product_Category_2和Product_Category_3的相关分布和统计信息。
In [29]:
df1.isnull().sum() # 查看缺失值情况
Out[29]:
User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Product_Category_2 173638
Product_Category_3 383247
Purchase 0
dtype: int64
缺失值的处理方式:
In [30]:
# 针对Product_Category_2
median = df1["Product_Category_2"].median()
median
Out[30]:
9.0
In [31]:
df1["Product_Category_2"].fillna(median, inplace=True) # 填充均值
assert断言是否有缺失值:
In [32]:
df1.isnull().sum()
Out[32]:
User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Product_Category_2 0 # 没有缺失值
Product_Category_3 383247
Purchase 0
dtype: int64
In [33]:
# 查看缺失值所在的行
df1[df1.isnull().T.any()]
我们发现Product_Category_3字段在首尾都存在缺失值,因此我们使用前后项同时填充的方法:
将目标变量Purchase进行正态化
In [38]:
from scipy import stats
from scipy.stats import norm
In [39]:
plt.rcParams['figure.figsize'] = (20, 7)
sns.distplot(df1['Purchase'], # 实施变换的值
color = 'green', # 颜色
fit = norm # 拟合正态化
)
# 均值和标准差
mu, sigma = norm.fit(df1['Purchase'])
print("The mu {} and Sigma {} for the curve".format(mu, sigma))
plt.title('目标变量分布')
plt.legend(['正态分布($mu$: {}, $sigma$: {}'.format(mu, sigma)], loc = 'best')
plt.show()
In [40]:
df1.drop(["User_ID", "Product_ID"], inplace=True, axis=1)
In [41]:
df1.columns
Out[41]:
Index(['Gender', 'Age', 'Occupation', 'City_Category',
'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
'Product_Category_2', 'Product_Category_3', 'Purchase'],
dtype='object')
In [42]:
df9 = df1.select_dtypes(include="object")
cat_col = df9.columns
cat_col
Out[42]:
Index(['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years'], dtype='object')
In [43]:
df10 = df1.select_dtypes(exclude="object") # 数值型变量
对字符型字段进行独热码,生成哑变量
In [44]:
df_Gender = pd.get_dummies(df1['Gender'])
df_Age = pd.get_dummies(df1['Age'])
df_City_Category = pd.get_dummies(df1['City_Category'])
df_Stay_In_Current_City_Years = pd.get_dummies(df1['Stay_In_Current_City_Years'])
In [45]:
df11 = pd.concat([df10, df_Gender, df_Age, df_City_Category, df_Stay_In_Current_City_Years], axis=1)
In [46]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
In [47]:
X = df11.drop("Purchase", axis=1)
y = df11["Purchase"]
In [48]:
columns = X.columns
In [49]:
ss = StandardScaler()
X = ss.fit_transform(X)
# 数据还原
# origin_data = ss.inverse_transform(ss_X)
X = pd.DataFrame(X, columns=columns)
X
Out[49]:
In [50]:
# 切分比例为8:2
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10)
X_train.shape
Out[50]:
(440054, 22)
In [51]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
Out[51]:
LinearRegression()
In [52]:
LinearRegression(copy_X=True,
fit_intercept=True,
n_jobs=None,
normalize=False)
Out[52]:
LinearRegression(normalize=False)
In [53]:
print('Intercept parameter:', lr.intercept_)
coeff_df = pd.DataFrame(lr.coef_,
X.columns,
columns=['Coefficient'])
coeff_df
Intercept parameter: 9262.589703346423
对测试集的预测
In [54]:
predictions = lr.predict(X_test)
predictions
Out[54]:
array([ 7982.58970335, 9525.58970335, 7910.58970335, ...,
12058.58970335, 9946.58970335, 8621.83970335])
In [55]:
from sklearn import metrics
# 回归问题的两个重要指标
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
MAE: 3591.343500225385
MSE: 21934597.75355938
随机森林中3个重要的属性:
In [56]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(max_depth=5)
rf.fit(X_train, y_train)
Out[56]:
RandomForestRegressor(max_depth=5)
In [57]:
predictions_rf = rf.predict(X_test)
predictions_rf
Out[57]:
array([ 7895.32997516, 6153.32785418, 7895.32997516, ...,
15413.87676836, 2229.59133606, 6153.32785418])
In [58]:
from sklearn import metrics
# 回归问题的两个重要指标
print('MAE:', metrics.mean_absolute_error(y_test, predictions_rf))
print('MSE:', metrics.mean_squared_error(y_test, predictions_rf))
MAE: 2396.6110975637253
MSE: 10568250.95352542
神经网络中输入的数据一般都是比较小的,我们将输出y_train和y_test缩小10000倍,即以万为单位:
In [59]:
import tensorflow as tf
from keras import models
from keras import layers
np.random.seed(123)
pd.options.mode.chained_assignment = None
In [60]:
y_train /= 10000
y_test /= 10000
In [61]:
model = models.Sequential()
model.add(tf.keras.layers.Dense(64,
activation="relu",
input_shape=(X_train.shape[1],)))
model.add(tf.keras.layers.Dense(64,
activation="relu"))
model.add(tf.keras.layers.Dense(1))
In [62]:
model.compile(optimizer="rmsprop", # 优化器
loss="mse", # 损失函数
metrics=["mae"] # 评估指标:平均绝对误差
)
In [63]:
history = model.fit(X_train,
y_train,
epochs=100,
validation_split=0.2,
batch_size=4500,
verbose=0 # 0-静默模式 1-日志模式
)
In [64]:
mae_history = history.history["mae"]
loss_history = history.history["loss"]
In [65]:
len(mae_history)
Out[65]:
100
In [66]:
# 损失绘图
import matplotlib.pyplot as plt
epochs = range(1,len(loss_history) + 1)
plt.plot(epochs, # 循环轮数
loss_history, # loss取值
"r",
label="loss"
)
plt.plot(epochs,
mae_history,
"b",
label="mae"
)
plt.title("Loss and Mae")
plt.xlabel("Epochs")
plt.legend()
plt.show()
In [67]:
model.evaluate(X_test, y_test)
3438/3438 [==============================] - 7s 2ms/step - loss: 0.0975 - mae: 0.2382
Out[67]:
[0.09745179861783981, 0.2381529062986374]
3种方案的对比(深度学习的指标中得到的结果需要进行转化还原)
LOSS(MSE) | MAE | |
---|---|---|
线性回归 | 21,934,597 | 3591 |
随机森林回归 | 10,568,250 | 2396 |
深度学习Keras | 0.09745(9,745,000) | 0.2381(2381) |
所以来说:深度学习还是略胜一筹!