基于LightGBM的信贷数据建模

原创

皮大大

发布于 2024-01-19 11:35:36

2140

发布于 2024-01-19 11:35:36

文章被收录于专栏：机器学习/数据可视化机器学习/数据可视化

公众号：尤而小屋作者：Peter 编辑：Peter

大家好，我是Peter~

本文是UCI金融信贷数据集的第二篇文章：基于LightGBM的二分类建模。主要内容包含：

数据基本信息
缺失值信息
不同字段的统计信息
目标变量的不均衡性
变量间的相关性分析
基于QQ图的字段的正态检验
数据预处理（编码、归一化、降维等）
分类模型评估标准
基于LightGBM建立模型

1 导入库

第一步还是导入数据处理和建模所需要的各种库：

In 1:

import pandas as pd 
import numpy as np
pd.set_option('display.max_columns', 100)
from IPython.display import display_html


import plotly_express as px
import plotly.graph_objects as go

import matplotlib
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"]=["SimHei"] # 设置字体
plt.rcParams["axes.unicode_minus"]=False # 解决“-”负号的乱码问题

import seaborn as sns
%matplotlib inline 

import missingno as ms 
import gc

from datetime import datetime 
from sklearn.model_selection import train_test_split,StratifiedKFold,GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from imblearn.under_sampling import ClusterCentroids
from imblearn.over_sampling import KMeansSMOTE, SMOTE
from sklearn.model_selection import KFold

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, auc
from sklearn.metrics import roc_auc_score,precision_recall_curve, confusion_matrix,classification_report

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import tree
from pydotplus import graph_from_dot_data
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
import xgboost as xgb

from scipy import stats

import warnings 
warnings.filterwarnings("ignore")

2 导入数据

In 2:

df = pd.read_csv("UCI.csv")

df.head()

Out2:

3 数据基本信息

1、整体数据量

整理的数据量大小：30000条记录，25个字段信息

In 3:

df.shape

Out3:

(30000, 25)

2、数据字段信息

In 4:

df.columns  # 全部的字段名

Out4:

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default.payment.next.month'],
      dtype='object')

不同的字段类型统计：

In 5:

df.dtypes  # 查看数据的字段类型

Out5:

ID                              int64
LIMIT_BAL                     float64
SEX                             int64
EDUCATION                       int64
MARRIAGE                        int64
AGE                             int64
PAY_0                           int64
PAY_2                           int64
PAY_3                           int64
PAY_4                           int64
PAY_5                           int64
PAY_6                           int64
BILL_AMT1                     float64
BILL_AMT2                     float64
BILL_AMT3                     float64
BILL_AMT4                     float64
BILL_AMT5                     float64
BILL_AMT6                     float64
PAY_AMT1                      float64
PAY_AMT2                      float64
PAY_AMT3                      float64
PAY_AMT4                      float64
PAY_AMT5                      float64
PAY_AMT6                      float64
default.payment.next.month      int64
dtype: object

In 6:

pd.value_counts(df.dtypes)  # 统计不同类型的个数

Out6:

float64    13
int64      12
Name: count, dtype: int64

从结果中能够看到全部是数值型字段，几乎各占一半。最后一个字段default.payment.next.month是我们最终的目标字段。

下面对字段名称的具体含义进行解释：

ID：ID唯一值
LIMIT_BAL：可透支金额（新台币计算，包含个人或者家庭）
SEX：性别：1-男, 2-女
EDUCATION：1-研究生；2-本科；3-高中；4-其他；0/5/6-未知
MARRIAGE：婚姻状态；1-已婚，2-单身；3-其他
AGE：年龄
PAY_0：2005年9月的还款状态（-2-未消费，-1-按时还款, 1-延迟一个月还款, 2-延迟两个月还款,...,8-延迟8个月还款, 9-延迟9个月还款）
PAY_2：2005年8月的还款状态（同上）
PAY_3：2005年7月的还款状态（同上）
PAY_4：2005年6月的还款状态（同上）
PAY_5：2005年5月的还款状态（同上）
PAY_6：2005年4月的还款状态（同上）
BILL_AMT1：2005年9月的账单金额
BILL_AMT2：2005年8月的账单金额
BILL_AMT3：2005年7月的账单金额
BILL_AMT4：2005年6月的账单金额
BILL_AMT5：2005年5月的账单金额
BILL_AMT6：2005年4月的账单金额
PAY_AMT1：2005年9月之前的付款金额；
PAY_AMT2：2005年8月之前的付款金额
PAY_AMT3：2005年7月之前的付款金额
PAY_AMT4：2005年6月之前的付款金额
PAY_AMT5：2005年5月之前的付款金额
PAY_AMT6：2005年4月之前的付款金额
default.payment.next.month：最终目标变量，下个月还款违约情况（1-是，逾期；0-否，未逾期）

说明内容：

PAY_ATM如果低于银行规定的最低还款额，则视为违约；
PAY_ATM如果大于上月账单金额BILL_AMT，则视为及时还；
PAY_AMT如果大于最低还款额但低于上月账单金额，则视为延迟还款。

3、数据的描述统计信息（展示部分字段）

In 7:

df.describe().T  # 字段较多，转置后显示更直观

4、字段整体信息

In 8:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          30000 non-null  int64  
 1   LIMIT_BAL                   30000 non-null  float64
 2   SEX                         30000 non-null  int64  
 3   EDUCATION                   30000 non-null  int64  
 4   MARRIAGE                    30000 non-null  int64  
 5   AGE                         30000 non-null  int64  
 6   PAY_0                       30000 non-null  int64  
 7   PAY_2                       30000 non-null  int64  
 8   PAY_3                       30000 non-null  int64  
 9   PAY_4                       30000 non-null  int64  
 10  PAY_5                       30000 non-null  int64  
 11  PAY_6                       30000 non-null  int64  
 12  BILL_AMT1                   30000 non-null  float64
 13  BILL_AMT2                   30000 non-null  float64
 14  BILL_AMT3                   30000 non-null  float64
 15  BILL_AMT4                   30000 non-null  float64
 16  BILL_AMT5                   30000 non-null  float64
 17  BILL_AMT6                   30000 non-null  float64
 18  PAY_AMT1                    30000 non-null  float64
 19  PAY_AMT2                    30000 non-null  float64
 20  PAY_AMT3                    30000 non-null  float64
 21  PAY_AMT4                    30000 non-null  float64
 22  PAY_AMT5                    30000 non-null  float64
 23  PAY_AMT6                    30000 non-null  float64
 24  default.payment.next.month  30000 non-null  int64  
dtypes: float64(13), int64(12)
memory usage: 5.7 MB

为了数据处理方便，将原始的default.payment.next.month字段重新命名成Label：

In 9:

df.rename(columns={"default.payment.next.month":"Label"},inplace=True)

4 缺失值

4.1 缺失值统计

统计每个字段的缺失值个数：

In 10:

df.isnull().sum().sort_values(ascending=False)

Out10:

ID           0
BILL_AMT2    0
PAY_AMT6     0
PAY_AMT5     0
PAY_AMT4     0
PAY_AMT3     0
PAY_AMT2     0
PAY_AMT1     0
BILL_AMT6    0
BILL_AMT5    0
BILL_AMT4    0
BILL_AMT3    0
BILL_AMT1    0
LIMIT_BAL    0
PAY_6        0
PAY_5        0
PAY_4        0
PAY_3        0
PAY_2        0
PAY_0        0
AGE          0
MARRIAGE     0
EDUCATION    0
SEX          0
Label        0
dtype: int64

In 11:

# 缺失值个数
total = df.isnull().sum().sort_values(ascending=False)

In 12:

# 缺失值比例
percent = (df.isnull().sum() / df.isnull().count() * 100).sort_values(ascending=False) 

percent

Out12:

ID           0.0
BILL_AMT2    0.0
PAY_AMT6     0.0
PAY_AMT5     0.0
PAY_AMT4     0.0
PAY_AMT3     0.0
PAY_AMT2     0.0
PAY_AMT1     0.0
BILL_AMT6    0.0
BILL_AMT5    0.0
BILL_AMT4    0.0
BILL_AMT3    0.0
BILL_AMT1    0.0
LIMIT_BAL    0.0
PAY_6        0.0
PAY_5        0.0
PAY_4        0.0
PAY_3        0.0
PAY_2        0.0
PAY_0        0.0
AGE          0.0
MARRIAGE     0.0
EDUCATION    0.0
SEX          0.0
Label        0.0
dtype: float64

将个数和比例的合并，显示完整的缺失值信息：

In 13:

pd.concat([total, percent],axis=1,keys=["Total","Percent"]).T

4.2 缺失值可视化

In 14:

ms.bar(df,color="blue")                                                     

plt.show()

坐标轴标签的旋转：

In 15:

# ms.matrix(df, labels=True,label_rotation=45)
# plt.show()

下面进行不同字段的详细数据探索过程：

In 16:

df.columns

Out16:

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Label'],
      dtype='object')

ID字段对建模无效，直接删除：

In 17:

df.drop("ID",inplace=True,axis=1)

5 统计信息

5.1 Personal Information

查看用户的个人信息，比如信用额度、学历、婚姻状态、年龄等字段：

In 18:

df[['LIMIT_BAL', 'EDUCATION', 'MARRIAGE', 'AGE']].describe()

Out18:

	LIMIT_BAL	EDUCATION	MARRIAGE	AGE
count	30000.000000	30000.000000	30000.000000	30000.000000
mean	167484.322667	1.853133	1.551867	35.485500
std	129747.661567	0.790349	0.521970	9.217904
min	10000.000000	0.000000	0.000000	21.000000
25%	50000.000000	1.000000	1.000000	28.000000
50%	140000.000000	2.000000	2.000000	34.000000
75%	240000.000000	2.000000	2.000000	41.000000
max	1000000.000000	6.000000	3.000000	79.000000

In 19:

df["EDUCATION"].value_counts().sort_values(ascending=False)

Out19:

EDUCATION
2    14030
1    10585
3     4917
5      280
4      123
6       51
0       14
Name: count, dtype: int64

用户的学历中出现最多的是本科生EDUCATION=2

In 20:

df["MARRIAGE"].value_counts().sort_values(ascending=False)

Out20:

MARRIAGE
2    15964
1    13659
3      323
0       54
Name: count, dtype: int64

用户的婚姻状态中，出现最多的是MARRIAGE=2，已婚人群。

5.2 LIMIT_BAL

LIMIT_BAL的分布

In 21:

df["LIMIT_BAL"].value_counts().sort_values(ascending=False)

Out21:

LIMIT_BAL
50000.0      3365
20000.0      1976
30000.0      1610
80000.0      1567
200000.0     1528
             ... 
800000.0        2
1000000.0       1
327680.0        1
760000.0        1
690000.0        1
Name: count, Length: 81, dtype: int64

可以看到信用额度最为频繁的是50,000

In 22:

plt.figure(figsize = (14,6))
plt.title('Density Plot of LIMIT_BAL')

sns.set_color_codes("pastel")
sns.distplot(df['LIMIT_BAL'],kde=True,bins=200)

plt.show()

5.3 PAY0-PAY6

每月之前的对应还款状态：

In 23:

df[["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]].describe()

Out23:

	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6
count	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000
mean	-0.016700	-0.133767	-0.166200	-0.220667	-0.266200	-0.291100
std	1.123802	1.197186	1.196868	1.169139	1.133187	1.149988
min	-2.000000	-2.000000	-2.000000	-2.000000	-2.000000	-2.000000
25%	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	8.000000	8.000000	8.000000	8.000000	8.000000	8.000000

不同还款状态的对比：

In 24:

repay = df[['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'Label']]

repay = pd.melt(repay, 
                id_vars="Label",
                var_name="Payment Status",
                value_name="Delay(Month)"
               )
repay.head()

Out24:

	Label	Payment Status	Delay(Month)
0	1	PAY_0	2
1	1	PAY_0	-1
2	0	PAY_0	0
3	0	PAY_0	0
4	0	PAY_0	-1

In 25:

fig = px.box(repay, x="Payment Status", y="Delay(Month)",color="Label")

fig.show()

5.4 BILL_AMT1-BILL_AMT6

当月的账单金额

In 26:

df[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].describe()

Out26:

是否违约客户的对比：

In 27:

df.columns

Out27:

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Label'],
      dtype='object')

In 28:

BILL_AMTS = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']

plt.figure(figsize=(12,6))

for i, col in enumerate(BILL_AMTS):
    plt.subplot(2,3,i+1)
    sns.kdeplot(df.loc[(df["Label"] == 0),col], label="NO DEFAULT", color="red",shade=True)
    sns.kdeplot(df.loc[(df["Label"] == 1),col], label="DEFAULT", color="blue",shade=True)
    
    plt.xlim(-40000, 200000)
    plt.ylabel("")
    plt.xlabel(col, fontsize=12)
    plt.legend()
    plt.tight_layout()
    
plt.show()

5.5 PAY_AMT1-PAY_AMT6

每月之前的对应付款金额

In 29:

df[['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']].describe()

Out29:

	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6
count	30000.000000	3.000000e+04	30000.00000	30000.000000	30000.000000	30000.000000
mean	5663.580500	5.921163e+03	5225.68150	4826.076867	4799.387633	5215.502567
std	16563.280354	2.304087e+04	17606.96147	15666.159744	15278.305679	17777.465775
min	0.000000	0.000000e+00	0.00000	0.000000	0.000000	0.000000
25%	1000.000000	8.330000e+02	390.00000	296.000000	252.500000	117.750000
50%	2100.000000	2.009000e+03	1800.00000	1500.000000	1500.000000	1500.000000
75%	5006.000000	5.000000e+03	4505.00000	4013.250000	4031.500000	4000.000000
max	873552.000000	1.684259e+06	896040.00000	621000.000000	426529.000000	528666.000000

In 30:

PAY_AMTS = ['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

plt.figure(figsize=(12,6))

for i, col in enumerate(PAY_AMTS):
    plt.subplot(2,3,i+1)
    sns.kdeplot(df.loc[(df["Label"] == 0),col], label="NO DEFAULT", color="red", shade=True)
    sns.kdeplot(df.loc[(df["Label"] == 1),col], label="DEFAULT", color="blue", shade=True)
    
    plt.xlim(-10000, 70000)
    plt.ylabel("")
    plt.xlabel(col, fontsize=12)
    plt.legend()
    plt.tight_layout()
    
plt.show()

6 Label

是否发生违约（default.payment.next.month重命名为Label）的人数进行对比：

In 31:

df["Label"].value_counts()

Out31:

Label
0    23364
1     6636
Name: count, dtype: int64

In 32:

label = df["Label"].value_counts()
df_label = pd.DataFrame(label).reset_index()  

df_label

Out32:

	Label	count
0	0	23364
1	1	6636

In 33:

# plt.figure(figsize = (6,6))
# plt.title('Default = 0 & Not Default = 1')         
# sns.set_color_codes("pastel")

# sns.barplot(x = 'Label', y="count", data=df_label) 
# locs, labels = plt.xticks() 
# plt.show()

In 34:

plt.figure(figsize = (5,5))
graph = sns.countplot(x="Label", data=df, palette=["red","blue"])

i = 0     

for p in graph.patches:
    print(type(p))
    h = p.get_height()
    percentage = round( 100 * df["Label"].value_counts()[i] / len(df),2)
    str_percentage = f"{percentage} %"
    graph.text(p.get_x()+p.get_width()/2., h - 100, str_percentage, ha="center")  
    
    i += 1
    
plt.title("class distribution")
plt.xticks([0,1], ["Non-Default","Default"])
plt.xlabel("Default Payment Next Month",fontsize=12)
plt.ylabel("Number of Clients")

plt.show()

可以看到二者是很不均衡的。

In 35:

# value_counts = df['Label'].value_counts()

# # 计算每个值的百分比
# percentages = value_counts / len(df)
# # 使用matplotlib绘制柱状图
# plt.bar(value_counts.index, value_counts.values)    

# # 在柱状图上添加百分比标签 
# for i, v in enumerate(percentages.values):                     
#     plt.text(i, v + 1, f'{v*100:.2f}%', ha='center',va="bottom")  
    
# # 设置xy轴标签、标题
# plt.title("Class Distribution")
# plt.xticks([0,1], ["Non-Default","Default"])
# plt.xlabel("Default Payment Next Month",fontsize=12)
# plt.ylabel("Number of Clients")

# plt.show()

In 36:

value_counts = df['Label'].value_counts()  

# 计算每个值的百分比
percentages = value_counts / len(df)
# 使用matplotlib绘制柱状图
plt.bar(value_counts.index, value_counts.values)    

# 在柱状图上添加百分比标签 
for i, v in enumerate(percentages.values):
    plt.text(i, v + 1, f'{v*100:.2f}%', ha='center',va="bottom")
    
# 设置xy轴标签、标题
plt.title("Class Distribution")
plt.xticks([0,1], ["Non-Default","Default"])
plt.xlabel("Default Payment Next Month",fontsize=12)
plt.ylabel("Number of Clients")

plt.show()

7 相关性分析

7.1 相关性热力图

In 37:

numeric = ['LIMIT_BAL','AGE','PAY_0','PAY_2',
           'PAY_3','PAY_4','PAY_5','PAY_6',
           'BILL_AMT1','BILL_AMT2','BILL_AMT3',
           'BILL_AMT4','BILL_AMT5','BILL_AMT6']  # 全部数值型字段
numeric

Out37:

['LIMIT_BAL',
 'AGE',
 'PAY_0',
 'PAY_2',
 'PAY_3',
 'PAY_4',
 'PAY_5',
 'PAY_6',
 'BILL_AMT1',
 'BILL_AMT2',
 'BILL_AMT3',
 'BILL_AMT4',
 'BILL_AMT5',
 'BILL_AMT6']

In 38:

corr = df[numeric].corr()
corr.head()

Out38:

7.2 变量两两关系

In 40:

plt.figure(figsize=(12,10))

pair_plot = sns.pairplot(df[['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','Label']], 
                         hue='Label',
                         diag_kind='kde', 
                         corner=True)

pair_plot._legend.remove()

8 正态检验-QQ图

为了检查我们的数据是否为高斯分布，我们使用一种称为分位数-分位数（QQ图）图的图形方法进行定性评估。

在QQ图中，独立变量的分位数与正态分布的预期分位数相对应。如果变量是正态分布的，QQ图中的点应该沿着45度对角线排列。

In 41:

sns.set_color_codes('pastel')  # 设置样式
fig, axs = plt.subplots(5, 3, figsize=(18,18))  # 图像大小和子图设置

numeric = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5',
           'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']

i, j = 0, 0
for f in numeric:
    if j == 3:
        j = 0
        i = i + 1
    stats.probplot(df[f],  # 绘图数据：某个字段的全部取值
                   dist='norm', # 标准化
                   sparams=(df[f].mean(), df[f].std()), 
                   plot=axs[i,j])  # 子图位置
    
    axs[i,j].get_lines()[0].set_marker('.') 
    
    axs[i,j].grid() 
    axs[i,j].get_lines()[1].set_linewidth(3.0)
    j = j+1

fig.tight_layout()
axs[4,2].set_visible(False)
plt.show()

9 数据预处理

9.1 分类型数据处理

针对分类型数据的处理：

In 42:

df["EDUCATION"].value_counts()

Out42:

EDUCATION
2    14030
1    10585
3     4917
5      280
4      123
6       51
0       14
Name: count, dtype: int64

In 43:

df["GRAD_SCHOOL"] = (df["EDUCATION"] == 1).astype("category")
df["UNIVERSITY"] = (df["EDUCATION"] == 2).astype("category")
df["HIGH_SCHOOL"] = (df["EDUCATION"] == 1).astype("category")

df.drop("EDUCATION",axis=1,inplace=True)

In 44:

df['MALE'] = (df['SEX'] == 1).astype('category')
df.drop('SEX', axis=1, inplace=True)

In 45:

df['MARRIED'] = (df['MARRIAGE'] == 1).astype('category')
df.drop('MARRIAGE', axis=1, inplace=True)

9.2 数据切分

In 46:

# 划分数据

y = df['Label']
X = df.drop('Label', axis=1, inplace=False)

根据y中的类别比例进行切分：

In 47:

# 切分数据

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=24, stratify=y)

9.3 特征归一化/标准化

最值归一化：

In 48:

mm = MinMaxScaler()

X_train_norm = X_train_raw.copy()
X_test_norm = X_test_raw.copy()

In 49:

# LIMIT_BAL + AGE

X_train_norm['LIMIT_BAL'] = mm.fit_transform(X_train_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_test_norm['LIMIT_BAL'] = mm.transform(X_test_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_train_norm['AGE'] = mm.fit_transform(X_train_raw['AGE'].values.reshape(-1, 1))
X_test_norm['AGE'] = mm.transform(X_test_raw['AGE'].values.reshape(-1, 1))

In 50:

pay_list = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]

for pay in pay_list:
    X_train_norm[pay] = mm.fit_transform(X_train_raw[pay].values.reshape(-1, 1))
    X_test_norm[pay] = mm.transform(X_test_raw[pay].values.reshape(-1, 1))

In 51:

for i in range(1,7):
    X_train_norm['BILL_AMT' + str(i)] = mm.fit_transform(X_train_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
    X_test_norm['BILL_AMT' + str(i)] = mm.transform(X_test_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
    X_train_norm['PAY_AMT' + str(i)] = mm.fit_transform(X_train_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
    X_test_norm['PAY_AMT' + str(i)] = mm.transform(X_test_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))

标准化过程：

In 52:

ss = StandardScaler()
X_train_std = X_train_raw.copy()
X_test_std = X_test_raw.copy()

X_train_std['LIMIT_BAL'] = ss.fit_transform(X_train_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_test_std['LIMIT_BAL'] = ss.transform(X_test_raw['LIMIT_BAL'].values.reshape(-1, 1))
X_train_std['AGE'] = ss.fit_transform(X_train_raw['AGE'].values.reshape(-1, 1))
X_test_std['AGE'] = ss.transform(X_test_raw['AGE'].values.reshape(-1, 1))

In 53:

pay_list = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6"]

for pay in pay_list:
    X_train_std[pay] = mm.fit_transform(X_train_raw[pay].values.reshape(-1, 1))
    X_test_std[pay] = mm.transform(X_test_raw[pay].values.reshape(-1, 1))

In 54:

for i in range(1,7):
    X_train_std['BILL_AMT' + str(i)] = ss.fit_transform(X_train_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
    X_test_std['BILL_AMT' + str(i)] = ss.transform(X_test_raw['BILL_AMT' + str(i)].values.reshape(-1, 1))
    X_train_std['PAY_AMT' + str(i)] = ss.fit_transform(X_train_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))
    X_test_std['PAY_AMT' + str(i)] = ss.transform(X_test_raw['PAY_AMT' + str(i)].values.reshape(-1, 1))

绘制经过编码后的数据分布：

In 55:

sns.set_color_codes('deep')
numeric = ['LIMIT_BAL','AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5',
           'BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']

fig, axs = plt.subplots(1, 2, figsize=(24,6))

sns.boxplot(data=X_train_norm[numeric], ax=axs[0])  
axs[0].set_title('Boxplot of normalized numeric features')
axs[0].set_xticklabels(labels=numeric, rotation=25)
axs[0].set_xlabel(' ')

sns.boxplot(data=X_train_std[numeric], ax=axs[1])
axs[1].set_title('Boxplot of standardized numeric features')
axs[1].set_xticklabels(labels=numeric, rotation=25)
axs[1].set_xlabel(' ')

fig.tight_layout()
plt.show()

9.4 数据降维

In 56:

pc = len(X_train_norm.columns.values) # 25
pca = PCA(n_components=pc)  # 指定主成分个数
pca.fit(X_train_norm)

sns.reset_orig()
sns.set_color_codes('pastel') # 设置绘图颜色
plt.figure(figsize = (8,4)) # 图的大小
plt.grid()  # 网格设置
plt.title('Explained Variance of Principal Components') # 标题设置
plt.plot(pca.explained_variance_ratio_, marker='o')  # 绘制单个主成分的方差解释比例
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')  # 累计解释方差

plt.legend(["Individual Explained Variance", "Cumulative Explained Variance"])  # 图例设置
plt.xlabel('Principal Component Indexes')  # x-y轴标题
plt.ylabel('Explained Variance Ratio')  
plt.tight_layout()  # 调整布局，更紧凑
plt.axvline(12, 0, ls='--')  # 设置虚线x=12
plt.show()  # 显示图像

代码的各部分含义如下：

pc = len(X_train_norm.columns.values) # 25：计算训练集的特征数量，这里的结果是25。
pca = PCA(n_components=pc) # 指定主成分个数：创建一个PCA对象，指定主成分的数量为pc，即25。
pca.fit(X_train_norm)：对训练集X_train_norm进行PCA拟合。
sns.reset_orig()和sns.set_color_codes('pastel')：这两行代码是使用seaborn库来设置绘图的颜色。reset_orig()会重置颜色到默认设置，set_color_codes('pastel')会将颜色设置为柔和色调。
plt.figure(figsize = (8,4))：创建一个新的图形，设置其大小为8x4。
plt.grid()：在图形上显示网格。
plt.title('Explained Variance of Principal Components')：设置图形的标题为“主成分的方差解释”。
plt.plot(pca.explained_variance_ratio_, marker='o')：绘制单个主成分的方差解释比例。
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')：绘制累积方差解释比例。
plt.legend(["Individual Explained Variance", "Cumulative Explained Variance"])：为图形添加图例，分别表示单个主成分的方差解释和累积方差解释。
plt.xlabel('Principal Component Indexes')：设置x轴的标签为“主成分索引”。
plt.ylabel('Explained Variance Ratio')：设置y轴的标签为“方差解释比例”。
plt.tight_layout()：自动调整图形布局，使其看起来紧凑。
plt.axvline(12, 0, ls='--')：在x=12的位置画一条从y=0到y=1的虚线。这可能是为了标示某个特定的主成分。
plt.show()：显示图形。

根据PCA的定义，主成分的顺序是不重要的，它们只按照其方差大小进行排序。

9.4.1 计算累计解释方差

In 57:

cumsum = np.cumsum(pca.explained_variance_ratio_)  # 计算累计解释性方差
cumsum

Out57:

array([0.44924877, 0.6321187 , 0.8046163 , 0.87590932, 0.92253799,
       0.95438576, 0.96762706, 0.97773098, 0.9842774 , 0.98824928,
       0.99088299, 0.99280785, 0.99444757, 0.99576128, 0.99690533,
       0.99781622, 0.99844676, 0.99890236, 0.99924315, 0.99955744,
       0.9997182 , 0.99983861, 0.99992993, 1.        , 1.        ])

In 58:

indexes = ['PC' + str(i) for i in range(1, pc+1)]

cumsum_df = pd.DataFrame(data=cumsum, index=indexes, columns=['var1'])

cumsum_df.head()

Out58:

	var1
PC1	0.449249
PC2	0.632119
PC3	0.804616
PC4	0.875909
PC5	0.922538

In 59:

# 保留4位小数
cumsum_df['var2'] = pd.Series([round(val, 4) for val in cumsum_df['var1']], 
                              index = cumsum_df.index)
# 转成百分比
cumsum_df['Cumulative Explained Variance'] = pd.Series(["{0:.2f}%".format(val * 100) for val in cumsum_df['var2']], 
                                                       index = cumsum_df.index)

cumsum_df.head()

Out59:

In 60:

cumsum_df = cumsum_df.drop(['var1','var2'], axis=1, inplace=False)
cumsum_df.T.iloc[:,:15]

9.4.2 指定主成分个数12

In 61:

pc = 12
pca = PCA(n_components=pc)
pca.fit(X_train_norm)

X_train = pd.DataFrame(pca.transform(X_train_norm))
X_test = pd.DataFrame(pca.transform(X_test_norm))

# 列名设置
X_train.columns = ['PC' + str(i) for i in range(1, pc+1)]
X_test.columns = ['PC' + str(i) for i in range(1, pc+1)]

X_train.head()

Out61:

模型评估

交叉验证

基于 k-fold cross-validation的交叉验证：将数据分为k折，前面k-1用于训练，剩下1折用于验证。

分类模型评价指标

1、混淆矩阵

$$\begin{array}{ccc}

& \text { Predicted Negative } & \text { Predicted Positive } \

\hline \text { Actual Negative } & \text { TN } & \text { FP } \

\text { Actual Positive } & \text { FN } & \text { TP }

\end{array}$$

2、准确率

$$\text { Accuracy }=\frac{T P+T N}{T P+F P+T N+F N}$$

3、精确率

$$\text { Precision, } p=\frac{T P}{T P+F P}$$

4、召回率

$$\text { Recall, } r=\frac{T P}{T P+F N}$$

5、F1_score

$${ F1_{score} }=\frac{2}{\frac{1}{r}+\frac{1}{p}}=\frac{2 r p}{r+p}$$

11 基于LightGBM建立二分类模型

In 62:

# 模型训练
lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train, y_train)
[LightGBM] [Info] Number of positive: 4977, number of negative: 17523
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000619 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3060
[LightGBM] [Info] Number of data points in the train set: 22500, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.221200 -> initscore=-1.258687
[LightGBM] [Info] Start training from score -1.258687

Out62:

LGBMClassifier

LGBMClassifier()

In 63:

# 模型预测

y_pred = lgb_clf.predict(X_test)
y_pred

Out63:

array([1, 0, 0, ..., 0, 0, 0], dtype=int64)

基于baseline的准确率acc：

In 64:

acc = accuracy_score(y_test, y_pred)

print("模型的准确率：",acc)
模型的准确率： 0.8130666666666667

模型的分类报告：

In 65:

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.84      0.94      0.89      5841
           1       0.64      0.36      0.46      1659

    accuracy                           0.81      7500
   macro avg       0.74      0.65      0.67      7500
weighted avg       0.79      0.81      0.79      7500

模型的混淆矩阵：

In 66:

# 计算混淆矩阵
cm = confusion_matrix(y_test, y_pred)

# 将混淆矩阵转换为DataFrame
cm_df = pd.DataFrame(cm, index=['Non-Defaulters', 'Defaulters'], columns=['Non-Defaulters', 'Defaulters'])

# 使用seaborn绘制混淆矩阵热力图
plt.figure(figsize=(8, 5))
sns.heatmap(cm_df, annot=True, cmap='Blues', fmt='d')
plt.title('Confusion Metrics')
plt.xlabel('Predicted value')
plt.ylabel('True Value')
plt.show()