前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >信贷违约预测,随机森林准确率第一

信贷违约预测,随机森林准确率第一

原创
作者头像
皮大大
发布2023-08-18 10:29:55
4020
发布2023-08-18 10:29:55
举报

大家好,我是Peter~

本文是一个基于kaggle机器学习实战案例:基于机器学习的信贷违约预测实战,采用了多种模型,最终结果随机森林模型排名第一。

主要内容包含:

  • 数据基本信息与EDA
  • 数据预处理与特征工程
  • 多种模型预测及指标对比

导入库

In 1:

代码语言:txt
复制
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import missingno as mso
import seaborn as sns
import warnings
import os
import scipy

from scipy import stats
from scipy.stats import pearsonr
from scipy.stats import ttest_ind
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

数据基本信息

导入数据

In 2:

代码语言:txt
复制
df = pd.read_csv("data.csv")
df.head()

Out2:

基本信息

In 3:

代码语言:txt
复制
# 整体的数据量

df.shape

Out3:

代码语言:txt
复制
(614, 13)

In 4:

代码语言:txt
复制
# 全部字段

df.columns

Out4:

代码语言:txt
复制
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

可以看到主要是贷款人ID、性别、是否结婚、工作部门、受教育程度等字段信息

In 5:

代码语言:txt
复制
# 查看字段类型

df.dtypes

Out5:

代码语言:txt
复制
Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In 6:

代码语言:txt
复制
# 查看描述统计信息

df.describe()  

Out6:

ApplicantIncome

CoapplicantIncome

LoanAmount

Loan_Amount_Term

Credit_History

count

614.000000

614.000000

592.000000

600.00000

564.000000

mean

5403.459283

1621.245798

146.412162

342.00000

0.842199

std

6109.041673

2926.248369

85.587325

65.12041

0.364878

min

150.000000

0.000000

9.000000

12.00000

0.000000

25%

2877.500000

0.000000

100.000000

360.00000

1.000000

50%

3812.500000

1188.500000

128.000000

360.00000

1.000000

75%

5795.000000

2297.250000

168.000000

360.00000

1.000000

max

81000.000000

41667.000000

700.000000

480.00000

1.000000

缺失值情况

In 7:

代码语言:txt
复制
df.isnull().sum()

Out7:

代码语言:txt
复制
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

可以看到部分字段存在缺失值

In 8:

代码语言:txt
复制
mso.bar(df,color="blue")

plt.show()

后面会针对缺失值进行填充处理。

数据探索EDA

分类型变量

Loan_ID

In 9:

代码语言:txt
复制
df.Loan_ID.value_counts(dropna=False)

Out9:

代码语言:txt
复制
LP001002    1
LP002328    1
LP002305    1
LP002308    1
LP002314    1
           ..
LP001692    1
LP001693    1
LP001698    1
LP001699    1
LP002990    1
Name: Loan_ID, Length: 614, dtype: int64

可以看到每个Loan_ID刚好一条记录

Gender

In 10:

代码语言:txt
复制
df.Gender.value_counts(dropna=False)

Out10:

代码语言:txt
复制
Male      489
Female    112
NaN        13
Name: Gender, dtype: int64

In 11:

代码语言:txt
复制
sns.countplot(x="Gender", data=df, palette="hls")

plt.show()

针对不同性别做缺失值处理:

In 12:

代码语言:txt
复制
countMale = len(df[df.Gender == 'Male'])  # 男性数据
countFemale = len(df[df.Gender == 'Female'])  # 女性数据
countNull = len(df[df.Gender.isnull()])  # 缺失值数量

In 13:

代码语言:txt
复制
print("Percentage of Male: {:.2f}%".format((countMale / (len(df.Gender)) * 100)))
print("Percentage of Female: {:.2f}%".format((countFemale / (len(df.Gender)) * 100)))
print("Missing values percentage: {:.2f}%".format((countNull / (len(df.Gender)) * 100)))
Percentage of Male: 79.64%
Percentage of Female: 18.24%
Missing values percentage: 2.12%
Married

In 14:

代码语言:txt
复制
df.Married.value_counts(dropna=False)

Out14:

代码语言:txt
复制
Yes    398
No     213
NaN      3
Name: Married, dtype: int64

In 15:

代码语言:txt
复制
sns.countplot(x="Married", data=df, palette="Paired")

plt.show()

是否结婚的人群对比:

In 16:

代码语言:txt
复制
countMarried = len(df[df.Married == 'Yes'])
countNotMarried = len(df[df.Married == 'No'])
countNull = len(df[df.Married.isnull()])

In 17:

代码语言:txt
复制
print("Percentage of married: {:.2f}%".format((countMarried / (len(df.Married))*100)))
print("Percentage of Not married applicant: {:.2f}%".format((countNotMarried / (len(df.Married))*100)))
print("Missing values percentage: {:.2f}%".format((countNull / (len(df.Married))*100)))
Percentage of married: 64.82%
Percentage of Not married applicant: 34.69%
Missing values percentage: 0.49%
Education

In 18:

代码语言:txt
复制
df.Education.value_counts(dropna=False)

Out18:

代码语言:txt
复制
Graduate        480
Not Graduate    134
Name: Education, dtype: int64

In 19:

代码语言:txt
复制
sns.countplot(x="Education", data=df, palette="rocket")
plt.show()

不同受教育程度的人群对比:

In 20:

代码语言:txt
复制
countGraduate = len(df[df.Education == 'Graduate'])
countNotGraduate = len(df[df.Education == 'Not Graduate'])
countNull = len(df[df.Education.isnull()])

print("Percentage of graduate applicant: {:.2f}%".format((countGraduate / (len(df.Education))*100)))
print("Percentage of Not graduate applicant: {:.2f}%".format((countNotGraduate / (len(df.Education))*100)))
print("Missing percentage: {:.2f}%".format((countNull / (len(df.Education))*100)))
Percentage of graduate applicant: 78.18%
Percentage of Not graduate applicant: 21.82%
Missing percentage: 0.00%
Self Employed

In 21:

代码语言:txt
复制
df.Self_Employed.value_counts(dropna=False)

Out21:

代码语言:txt
复制
No     500
Yes     82
NaN     32
Name: Self_Employed, dtype: int64

In 22:

代码语言:txt
复制
sns.countplot(x="Self_Employed", data=df, palette="crest")
plt.show()

是否为自聘员工对比:

In 23:

代码语言:txt
复制
countNo = len(df[df.Self_Employed == 'No'])
countYes = len(df[df.Self_Employed == 'Yes'])
countNull = len(df[df.Self_Employed.isnull()])

print("Percentage of Not self employed: {:.2f}%".format((countNo / (len(df.Self_Employed))*100)))
print("Percentage of self employed: {:.2f}%".format((countYes / (len(df.Self_Employed))*100)))
print("Missing values percentage: {:.2f}%".format((countNull / (len(df.Self_Employed))*100)))
Percentage of Not self employed: 81.43%
Percentage of self employed: 13.36%
Missing values percentage: 5.21%
Credit History

In 24:

代码语言:txt
复制
df.Credit_History.value_counts(dropna=False)

Out24:

代码语言:txt
复制
1.0    475
0.0     89
NaN     50
Name: Credit_History, dtype: int64

In 25:

代码语言:txt
复制
sns.countplot(x="Credit_History", data=df, palette="viridis")
plt.show()

是否有信用卡历史的人群对比:

In 26:

代码语言:txt
复制
count1 = len(df[df.Credit_History == 1])
count0 = len(df[df.Credit_History == 0])
countNull = len(df[df.Credit_History.isnull()])

In 27:

代码语言:txt
复制
print("Percentage of Good credit history: {:.2f}%".format((count1 / (len(df.Credit_History))*100)))
print("Percentage of Bad credit history: {:.2f}%".format((count0 / (len(df.Credit_History))*100)))
print("Missing values percentage: {:.2f}%".format((countNull / (len(df.Credit_History))*100)))
Percentage of Good credit history: 77.36%
Percentage of Bad credit history: 14.50%
Missing values percentage: 8.14%
Property Area

In 28:

代码语言:txt
复制
df.Property_Area.value_counts(dropna=False)

Out28:

代码语言:txt
复制
Semiurban    233
Urban        202
Rural        179
Name: Property_Area, dtype: int64

In 29:

代码语言:txt
复制
sns.countplot(x="Property_Area", data=df, palette="cubehelix")
plt.show()

不同地区的人群对比:

In 30:

代码语言:txt
复制
countUrban = len(df[df.Property_Area == 'Urban'])
countRural = len(df[df.Property_Area == 'Rural'])
countSemiurban = len(df[df.Property_Area == 'Semiurban'])
countNull = len(df[df.Property_Area.isnull()])

In 31:

代码语言:txt
复制
print("Percentage of Urban: {:.2f}%".format((countUrban / (len(df.Property_Area))*100)))
print("Percentage of Rural: {:.2f}%".format((countRural / (len(df.Property_Area))*100)))
print("Percentage of Semiurban: {:.2f}%".format((countSemiurban / (len(df.Property_Area))*100)))
print("Missing values percentage: {:.2f}%".format((countNull / (len(df.Property_Area))*100)))
Percentage of Urban: 32.90%
Percentage of Rural: 29.15%
Percentage of Semiurban: 37.95%
Missing values percentage: 0.00%

这个字段在3个不同的取值下分布是均匀的,而且没有缺失值

Loan Status

In 32:

代码语言:txt
复制
df.Loan_Status.value_counts(dropna=False)

Out32:

代码语言:txt
复制
Y    422
N    192
Name: Loan_Status, dtype: int64

In 33:

代码语言:txt
复制
sns.countplot(x="Loan_Status", data=df, palette="YlOrBr")
plt.show()

是否贷款的人群占比对比:

In 34:

代码语言:txt
复制
countY = len(df[df.Loan_Status == 'Y'])
countN = len(df[df.Loan_Status == 'N'])
countNull = len(df[df.Loan_Status.isnull()])

print("Percentage of Approved: {:.2f}%".format((countY / (len(df.Loan_Status))*100)))
print("Percentage of Rejected: {:.2f}%".format((countN / (len(df.Loan_Status))*100)))
print("Missing values percentage: {:.2f}%".format((countNull / (len(df.Loan_Status))*100)))
Percentage of Approved: 68.73%
Percentage of Rejected: 31.27%
Missing values percentage: 0.00%
Loan Amount Term

In 35:

代码语言:txt
复制
df.Loan_Amount_Term.value_counts(dropna=False)

Out35:

代码语言:txt
复制
360.0    512
180.0     44
480.0     15
NaN       14
300.0     13
240.0      4
84.0       4
120.0      3
60.0       2
36.0       2
12.0       1
Name: Loan_Amount_Term, dtype: int64

In 36:

代码语言:txt
复制
sns.countplot(x="Loan_Amount_Term", data=df, palette="rocket")
plt.show()

不同贷款周期的人群对比:

In 37:

代码语言:txt
复制
count12 = len(df[df.Loan_Amount_Term == 12.0])
count36 = len(df[df.Loan_Amount_Term == 36.0])
count60 = len(df[df.Loan_Amount_Term == 60.0])
count84 = len(df[df.Loan_Amount_Term == 84.0])
count120 = len(df[df.Loan_Amount_Term == 120.0])
count180 = len(df[df.Loan_Amount_Term == 180.0])
count240 = len(df[df.Loan_Amount_Term == 240.0])
count300 = len(df[df.Loan_Amount_Term == 300.0])
count360 = len(df[df.Loan_Amount_Term == 360.0])
count480 = len(df[df.Loan_Amount_Term == 480.0])
countNull = len(df[df.Loan_Amount_Term.isnull()])

print("Percentage of 12: {:.2f}%".format((count12 / (len(df.Loan_Amount_Term))*100)))
print("Percentage of 36: {:.2f}%".format((count36 / (len(df.Loan_Amount_Term))*100)))
print("Percentage of 60: {:.2f}%".format((count60 / (len(df.Loan_Amount_Term))*100)))
print("Percentage of 84: {:.2f}%".format((count84 / (len(df.Loan_Amount_Term))*100)))
print("Percentage of 120: {:.2f}%".format((count120 / (len(df.Loan_Amount_Term))*100)))
print("Percentage of 180: {:.2f}%".format((count180 / (len(df.Loan_Amount_Term))*100)))
print("Percentage of 240: {:.2f}%".format((count240 / (len(df.Loan_Amount_Term))*100)))
print("Percentage of 300: {:.2f}%".format((count300 / (len(df.Loan_Amount_Term))*100)))
print("Percentage of 360: {:.2f}%".format((count360 / (len(df.Loan_Amount_Term))*100)))
print("Percentage of 480: {:.2f}%".format((count480 / (len(df.Loan_Amount_Term))*100)))
print("Missing values percentage: {:.2f}%".format((countNull / (len(df.Loan_Amount_Term))*100)))
Percentage of 12: 0.16%
Percentage of 36: 0.33%
Percentage of 60: 0.33%
Percentage of 84: 0.65%
Percentage of 120: 0.49%
Percentage of 180: 7.17%
Percentage of 240: 0.65%
Percentage of 300: 2.12%
Percentage of 360: 83.39%
Percentage of 480: 2.44%
Missing values percentage: 2.28%

数值型变量

描述统计信息

In 38:

代码语言:txt
复制
df.select_dtypes(exclude=["object"]).columns

Out38:

代码语言:txt
复制
Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History'],
      dtype='object')

In 39:

代码语言:txt
复制
df[['ApplicantIncome','CoapplicantIncome','LoanAmount']].describe()

Out39:

ApplicantIncome

CoapplicantIncome

LoanAmount

count

614.000000

614.000000

592.000000

mean

5403.459283

1621.245798

146.412162

std

6109.041673

2926.248369

85.587325

min

150.000000

0.000000

9.000000

25%

2877.500000

0.000000

100.000000

50%

3812.500000

1188.500000

128.000000

75%

5795.000000

2297.250000

168.000000

max

81000.000000

41667.000000

700.000000

字段直方图分布

In 40:

代码语言:txt
复制
sns.set(style="darkgrid")
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

sns.histplot(data=df, x="ApplicantIncome", kde=True, ax=axs[0, 0], color='green')
sns.histplot(data=df, x="CoapplicantIncome", kde=True, ax=axs[0, 1], color='skyblue')
sns.histplot(data=df, x="LoanAmount", kde=True, ax=axs[1, 0], color='orange');

可以看到这3个字段呈现一定的偏态,后面会做数据转换处理。

字段小提琴图分布

In 41:

代码语言:txt
复制
sns.set(style="darkgrid")
fig, axs1 = plt.subplots(2, 2, figsize=(10, 10))

sns.violinplot(data=df, y="ApplicantIncome", ax=axs1[0, 0], color='green')
sns.violinplot(data=df, y="CoapplicantIncome", ax=axs1[0, 1], color='skyblue')
sns.violinplot(data=df, y="LoanAmount", ax=axs1[1, 0], color='orange');

两两特征分布

两个分类型变量

分类型变量主要是基于统计分析查看分布情况:

In 42:

代码语言:txt
复制
pd.crosstab(df.Gender,df.Married).plot(kind="bar", stacked=True, figsize=(5,5), color=['#f64f59','#12c2e9'])

plt.title('Gender vs Married')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.show()

好坏信用卡历史对比:

In 43:

代码语言:txt
复制
pd.crosstab(df.Self_Employed,df.Credit_History).plot(kind="bar", stacked=True, figsize=(5,5), color=['#544a7d','#ffd452'])
plt.title('Self Employed vs Credit History')
plt.xlabel('Self Employed')
plt.ylabel('Frequency')
plt.legend(["Bad Credit", "Good Credit"])
plt.xticks(rotation=0)
plt.show()

不同地区的人群是否贷款对比:

In 44:

代码语言:txt
复制
pd.crosstab(df.Property_Area,df.Loan_Status).plot(kind="bar", stacked=True, figsize=(5,5), color=['#333333','#dd1818'])
plt.title('Property Area vs Loan Status')
plt.xlabel('Property Area')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.show()

分类型+数值型变量

In 45:

代码语言:txt
复制
sns.boxplot(x="Loan_Status", y="ApplicantIncome", data=df, palette="mako");
代码语言:python
代码运行次数:0
复制
sns.boxplot(x="CoapplicantIncome", y="Loan_Status", data=df, palette="rocket");
代码语言:python
代码运行次数:0
复制
sns.boxplot(x="Loan_Status", y="LoanAmount", data=df, palette="YlOrBr");

两个数值型变量

In 48:

代码语言:txt
复制
df.plot(x='ApplicantIncome', y='CoapplicantIncome', style='o')  
plt.title('Applicant Income - Co Applicant Income')  
plt.xlabel('ApplicantIncome')
plt.ylabel('CoapplicantIncome')  
plt.show()

相关性计算:

In 49:

代码语言:txt
复制
print('Pearson correlation:', df['ApplicantIncome'].corr(df['CoapplicantIncome']))
print('T Test and P value: \n', stats.ttest_ind(df['ApplicantIncome'], df['CoapplicantIncome']))
Pearson correlation: -0.11660458122889966
T Test and P value: 
 Ttest_indResult(statistic=13.835753259915661, pvalue=1.4609839484240346e-40)

相关性分析

In 50:

代码语言:txt
复制
plt.figure(figsize=(10,7))

sns.heatmap(df.corr(), annot=True, cmap='inferno')

plt.show()

特征工程(数据预处理)

删除无效变量Drop Unecessary Variables

In 51:

代码语言:txt
复制
df.drop("Loan_ID",axis=1, inplace=True)  

填充缺失值Data Imputation

In 52:

代码语言:txt
复制
df.isnull().sum()

Out52:

代码语言:txt
复制
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In 53:

代码语言:txt
复制
df.dtypes

Out53:

代码语言:txt
复制
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In 54:

代码语言:txt
复制
df["Credit_History"].value_counts()

Out54:

代码语言:txt
复制
1.0    475
0.0     89
Name: Credit_History, dtype: int64

In 55:

代码语言:txt
复制
df["Loan_Amount_Term"].value_counts()

Out55:

代码语言:txt
复制
360.0    512
180.0     44
480.0     15
300.0     13
240.0      4
84.0       4
120.0      3
60.0       2
36.0       2
12.0       1
Name: Loan_Amount_Term, dtype: int64
分类型变量

针对分类型变量的缺失值,我们使用众数mode进行填充:

In 56:

代码语言:txt
复制
df['Gender'].fillna(df['Gender'].mode()[0],inplace=True)
df['Married'].fillna(df['Married'].mode()[0],inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0],inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0],inplace=True)
# 信用卡历史记录  0-bad credit  1-good credit history 
df['Credit_History'].fillna(df['Credit_History'].mode()[0],inplace=True)
# 还款周期(天)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0],inplace=True)
数值型变量

数值型变量的缺失值使用均值mean进行填充

In 57:

代码语言:txt
复制
df['LoanAmount'].fillna(df['LoanAmount'].mean(),inplace=True)  # 贷款金额

独热码One-hot Encoding

In 58:

代码语言:txt
复制
df = pd.get_dummies(df)
df.head()

处理后的全部字段信息:

In 59:

代码语言:txt
复制
df.columns

Out59:

代码语言:txt
复制
Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Gender_Female', 'Gender_Male',
       'Married_No', 'Married_Yes', 'Dependents_0', 'Dependents_1',
       'Dependents_2', 'Dependents_3+', 'Education_Graduate',
       'Education_Not Graduate', 'Self_Employed_No', 'Self_Employed_Yes',
       'Property_Area_Rural', 'Property_Area_Semiurban', 'Property_Area_Urban',
       'Loan_Status_N', 'Loan_Status_Y'],
      dtype='object')

In 60:

代码语言:txt
复制
# 删除部分字段
df = df.drop(['Gender_Female', 'Married_No', 'Education_Not Graduate','Self_Employed_No', 'Loan_Status_N'], axis = 1)

In 61:

代码语言:txt
复制
# 字段重命名

new = {'Gender_Male': 'Gender', 
       'Married_Yes': 'Married', 
       'Education_Graduate': 'Education', 
       'Self_Employed_Yes': 'Self_Employed',
       'Loan_Status_Y': 'Loan_Status'}
       
df.rename(columns=new, inplace=True)

删除离群点remove outliers

以上下4分位数作为临界点:

In 62:

代码语言:txt
复制
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

偏态分布处理Skewed Distribution Treatment

对数据开平方做数据转换:np.sqrt

In 63:

代码语言:txt
复制
df.ApplicantIncome = np.sqrt(df.ApplicantIncome)
df.CoapplicantIncome = np.sqrt(df.CoapplicantIncome)
df.LoanAmount = np.sqrt(df.LoanAmount)

再次查看数据分布:

In 64:

代码语言:txt
复制
sns.set(style="darkgrid")
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

sns.histplot(data=df, x="ApplicantIncome", kde=True, ax=axs[0, 0], color='green')
sns.histplot(data=df, x="CoapplicantIncome", kde=True, ax=axs[0, 1], color='skyblue')
sns.histplot(data=df, x="LoanAmount", kde=True, ax=axs[1, 0], color='orange');

建模

特征分离

In 65:

代码语言:txt
复制
X = df.drop(["Loan_Status"], axis=1)
y = df["Loan_Status"]

SMOTE上采样

In 66:

代码语言:txt
复制
pd.value_counts(y)  # 采样前

Out66:

代码语言:txt
复制
1    112
0     24
Name: Loan_Status, dtype: int64

In 67:

代码语言:txt
复制
X, y = SMOTE().fit_resample(X, y)

In 68:

代码语言:txt
复制
pd.value_counts(y)  # 采样后

Out68:

代码语言:txt
复制
1    112
0    112
Name: Loan_Status, dtype: int64

In 69:

代码语言:txt
复制
sns.set_theme(style="darkgrid")
sns.countplot(y=y, data=df, palette="coolwarm")
plt.ylabel('Loan Status')
plt.xlabel('Total')
plt.show()

数据标准化Data Normalization

In 70:

代码语言:txt
复制
mm = MinMaxScaler()
X = mm.fit_transform(X)

切分训练集和测试集

In 71:

代码语言:txt
复制
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

模型1-Logistic Regression

In 72:

代码语言:txt
复制
pd.value_counts(y_train)

Out72:

代码语言:txt
复制
1    90
0    89
Name: Loan_Status, dtype: int64

In 73:

代码语言:txt
复制
pd.value_counts(y_test)

Out73:

代码语言:txt
复制
0    23
1    22
Name: Loan_Status, dtype: int64

In 74:

代码语言:txt
复制
LRclassifier = LogisticRegression(solver='saga', max_iter=500, random_state=1)
LRclassifier.fit(X_train, y_train)

# 模型预测
y_pred = LRclassifier.predict(X_test)

In 75:

代码语言:txt
复制
print(classification_report(y_test, y_pred))  # 分类结果报告
              precision    recall  f1-score   support

           0       0.86      0.83      0.84        23
           1       0.83      0.86      0.84        22

    accuracy                           0.84        45
   macro avg       0.84      0.84      0.84        45
weighted avg       0.85      0.84      0.84        45

In 76:

代码语言:txt
复制
print(confusion_matrix(y_test, y_pred))  # 混淆矩阵
[[19  4]
 [ 3 19]]

In 77:

代码语言:txt
复制
LRAcc = accuracy_score(y_pred,y_test)   # 准确率

print('LR accuracy: {:.2f}%'.format(LRAcc * 100))
LR accuracy: 84.44%

模型2-K-Nearest Neighbour(KNN)

新版本报错解决,参考:https://blog.csdn.net/weixin_51723388/article/details/128577782

在使用KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)时,只有在weights='uniform' 时才会用到stats.mode。其中uniform是均等权重,即邻域中的所有点的权重相等,相当于取众数。可将其改为weights='distance'

In 78:

代码语言:txt
复制
score_list = []

for i in range(1,21):
    knn = KNeighborsClassifier(n_neighbors = i,weights='distance')
    knn.fit(X_train, y_train)
    score_list.append(knn.score(X_test, y_test))  # 测试集预测得分
    
plt.plot(range(1,21), score_list)
plt.xticks(np.arange(1,21,1))
plt.xlabel("K value")
plt.ylabel("Score")
plt.show()

KNAcc = max(score_list)
print("KNN best accuracy: {:.2f}%".format(KNAcc*100))
代码语言:txt
复制
KNN best accuracy: 86.67%

模型3-支持向量机Support Vector Machine (SVM)

In 79:

代码语言:txt
复制
svc = SVC(kernel='rbf', max_iter=500)
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.95      0.78      0.86        23
           1       0.81      0.95      0.88        22

    accuracy                           0.87        45
   macro avg       0.88      0.87      0.87        45
weighted avg       0.88      0.87      0.87        45

[[18  5]
 [ 1 21]]

In 80:

代码语言:txt
复制
SVCAcc = accuracy_score(y_pred,y_test)
print('SVC accuracy: {:.2f}%'.format(SVCAcc*100))
SVC accuracy: 86.67%

模型4-高斯朴素贝叶斯Gaussian NB

In 81:

代码语言:txt
复制
NBclassifier2 = GaussianNB()
NBclassifier2.fit(X_train, y_train)

y_pred = NBclassifier2.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.68      0.83      0.75        23
           1       0.76      0.59      0.67        22

    accuracy                           0.71        45
   macro avg       0.72      0.71      0.71        45
weighted avg       0.72      0.71      0.71        45

[[19  4]
 [ 9 13]]

In 82:

代码语言:txt
复制
GNBAcc = accuracy_score(y_pred,y_test)
print('Gaussian Naive Bayes accuracy: {:.2f}%'.format(GNBAcc*100))
Gaussian Naive Bayes accuracy: 71.11%

模型5-决策树Decision Tree

In 83:

代码语言:txt
复制
scoreListDT = []
for i in range(2,21):
    DTclassifier = DecisionTreeClassifier(max_leaf_nodes=i)
    DTclassifier.fit(X_train, y_train)
    scoreListDT.append(DTclassifier.score(X_test, y_test))
    
plt.plot(range(2,21), scoreListDT)
plt.xticks(np.arange(2,21,1))
plt.xlabel("Leaf")
plt.ylabel("Score")
plt.show()
DTAcc = max(scoreListDT)
print("Decision Tree Accuracy: {:.2f}%".format(DTAcc*100))
代码语言:txt
复制
Decision Tree Accuracy: 84.44%

模型6-随机森林Random Forest

In 84:

代码语言:txt
复制
scoreListRF = []
for i in range(2,25):
    RFclassifier = RandomForestClassifier(n_estimators = 1000, 
                                          random_state = 1, 
                                          max_leaf_nodes=i)
    RFclassifier.fit(X_train, y_train)
    scoreListRF.append(RFclassifier.score(X_test, y_test))
    
plt.plot(range(2,25), scoreListRF)
plt.xticks(np.arange(2,25,1))
plt.xlabel("RF Value")
plt.ylabel("Score")
plt.show()
RFAcc = max(scoreListRF)

print("Random Forest Accuracy:  {:.2f}%".format(RFAcc*100))
代码语言:txt
复制
Random Forest Accuracy:  91.11%

模型7-梯度提升树Gradient Boosting

In 85:

代码语言:txt
复制
# 设置参数
params={'n_estimators':[100,200,300,400,500],
      'max_depth':[1,2,3,4,5],
      'subsample':[0.5,1],
      'max_leaf_nodes':[2,5,10,20,30,40,50]}

In 86:

代码语言:txt
复制
# 基于随机搜索查找参数组合

GB = RandomizedSearchCV(GradientBoostingClassifier(), params, cv=20)
GB.fit(X_train, y_train)

Out86:

代码语言:txt
复制
RandomizedSearchCV(cv=20, estimator=GradientBoostingClassifier(),
                   param_distributions={'max_depth': [1, 2, 3, 4, 5],
                                        'max_leaf_nodes': [2, 5, 10, 20, 30, 40,
                                                           50],
                                        'n_estimators': [100, 200, 300, 400,
                                                         500],
                                        'subsample': [0.5, 1]})

In 87:

代码语言:txt
复制
print(GB.best_estimator_)
print(GB.best_score_)
GradientBoostingClassifier(max_depth=4, max_leaf_nodes=10, n_estimators=500,
                           subsample=1)
0.7993055555555555

In 88:

代码语言:txt
复制
print(GB.best_params_)  # 最佳参数组合
{'subsample': 1, 'n_estimators': 500, 'max_leaf_nodes': 10, 'max_depth': 4}

In 89:

代码语言:txt
复制
GB.best_params_["subsample"]

Out89:

代码语言:txt
复制
1

基于查找到的参数再重新建模:

In 90:

代码语言:txt
复制
gbc = GradientBoostingClassifier(subsample=GB.best_params_["subsample"], 
                                 n_estimators=GB.best_params_["n_estimators"], 
                                 max_depth=GB.best_params_["max_depth"], 
                                 max_leaf_nodes=GB.best_params_["max_leaf_nodes"], 
                                )
gbc.fit(X_train, y_train)

y_pred = gbc.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.78      0.91      0.84        23
           1       0.89      0.73      0.80        22

    accuracy                           0.82        45
   macro avg       0.83      0.82      0.82        45
weighted avg       0.83      0.82      0.82        45

[[21  2]
 [ 6 16]]

In 91:

代码语言:txt
复制
GBAcc = accuracy_score(y_pred,y_test)

print('Gradient Boosting accuracy: {:.2f}%'.format(GBAcc*100))
Gradient Boosting accuracy: 82.22%

模型比较

In 92:

代码语言:txt
复制
models = pd.DataFrame({'Model': ['Logistic Regression',
                                 'K Neighbors', 
                                  'Support Vector Machine', 
                                  'Gaussian NB',
                                  'Decision Tree', 
                                  'Random Forest', 
                                 'Gradient Boost'], 
                        'Accuracy': [LRAcc*100, KNAcc*100, 
                                     SVCAcc*100,GNBAcc*100, 
                                     DTAcc*100, RFAcc*100, 
                                     GBAcc*100]})
models.sort_values(by='Accuracy', ascending=False)

Out92:

Model

Accuracy

5

Random Forest

91.111111

1

K Neighbors

86.666667

2

Support Vector Machine

86.666667

0

Logistic Regression

84.444444

4

Decision Tree

84.444444

6

Gradient Boost

82.222222

3

Gaussian NB

71.111111

In :

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 导入库
  • 数据基本信息
    • 导入数据
      • 基本信息
        • 缺失值情况
        • 数据探索EDA
          • 分类型变量
            • Loan_ID
            • Gender
            • Married
            • Education
            • Self Employed
            • Credit History
            • Property Area
            • Loan Status
            • Loan Amount Term
          • 数值型变量
            • 描述统计信息
            • 字段直方图分布
            • 字段小提琴图分布
        • 两两特征分布
          • 两个分类型变量
            • 分类型+数值型变量
              • 两个数值型变量
              • 相关性分析
              • 特征工程(数据预处理)
                • 删除无效变量Drop Unecessary Variables
                  • 填充缺失值Data Imputation
                    • 分类型变量
                    • 数值型变量
                  • 独热码One-hot Encoding
                    • 删除离群点remove outliers
                      • 偏态分布处理Skewed Distribution Treatment
                      • 建模
                        • 特征分离
                          • SMOTE上采样
                            • 数据标准化Data Normalization
                              • 切分训练集和测试集
                                • 模型1-Logistic Regression
                                  • 模型2-K-Nearest Neighbour(KNN)
                                    • 模型3-支持向量机Support Vector Machine (SVM)
                                      • 模型4-高斯朴素贝叶斯Gaussian NB
                                        • 模型5-决策树Decision Tree
                                          • 模型6-随机森林Random Forest
                                            • 模型7-梯度提升树Gradient Boosting
                                            • 模型比较
                                            领券
                                            问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档