一个基于自动机器学习的企业级实战项目

数据STUDIO

发布于 2023-12-26 15:39:53

1850

发布于 2023-12-26 15:39:53

文章被收录于专栏：数据STUDIO数据STUDIO

本文系数据挖掘实战系列文章，云朵君跟大家分享一个数据挖掘实战，与以往的数据实战不同的是，用自动机器学习方法完成模型构建与调优部分工作，深入理解由此带来的便利与效果。

1. Introduction

本文是一篇数据挖掘实战案例，详细探索了从台湾经济杂志收集的1999年到2009年的数据，看看在数据探索过程中，可以洞察出哪些有用的信息，判断哪一个模型能够最准确地预测公司是否破产。

公司破产的定义是根据台湾证券交易所的商业规则而定的。

该建模将尝试使用自动机器学习库pycaret来构建机器学习模型，pycaret是一个用python编写的开源低代码机器学习库，它将机器学习工作流程自动化。如果你想探索这个库并更好地理解它的功能。推荐查看

官方文档：https://pycaret.gitbook.io/docs/

设置环境并读取数据

import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns

bankruptcy_df = pd.read_csv("Bankruptcy.csv")  
# 完整数据集获取在文末
bankruptcy_df.head()

2. 理解数据

bankruptcy_df.info()

上下滑动查看更多
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6819 entries, 0 to 6818
Data columns (total 96 columns):
 #   Column                                                    Non-Null Count  Dtype  

---  ------                                                    --------------  -----

 0   Bankrupt?                                                 6819 non-null   int64  
 1    ROA(C) before interest and depreciation before interest  6819 non-null   float64
 2    ROA(A) before interest and % after tax                   6819 non-null   float64
 3    ROA(B) before interest and depreciation after tax        6819 non-null   float64
 4    Operating Gross Margin                                   6819 non-null   float64
 5    Realized Sales Gross Margin                              6819 non-null   float64
 6    Operating Profit Rate                                    6819 non-null   float64
 7    Pre-tax net Interest Rate                                6819 non-null   float64
 8    After-tax net Interest Rate                              6819 non-null   float64
 9    Non-industry income and expenditure/revenue              6819 non-null   float64
 10   Continuous interest rate (after tax)                     6819 non-null   float64
 11   Operating Expense Rate                                   6819 non-null   float64
 12   Research and development expense rate                    6819 non-null   float64
 13   Cash flow rate                                           6819 non-null   float64
 14   Interest-bearing debt interest rate                      6819 non-null   float64
 15   Tax rate (A)                                             6819 non-null   float64
 16   Net Value Per Share (B)                                  6819 non-null   float64
 17   Net Value Per Share (A)                                  6819 non-null   float64
 18   Net Value Per Share (C)                                  6819 non-null   float64
 19   Persistent EPS in the Last Four Seasons                  6819 non-null   float64
 20   Cash Flow Per Share                                      6819 non-null   float64
 21   Revenue Per Share (Yuan ¥)                               6819 non-null   float64
 22   Operating Profit Per Share (Yuan ¥)                      6819 non-null   float64
 23   Per Share Net profit before tax (Yuan ¥)                 6819 non-null   float64
 24   Realized Sales Gross Profit Growth Rate                  6819 non-null   float64
 25   Operating Profit Growth Rate                             6819 non-null   float64
 26   After-tax Net Profit Growth Rate                         6819 non-null   float64
 27   Regular Net Profit Growth Rate                           6819 non-null   float64
 28   Continuous Net Profit Growth Rate                        6819 non-null   float64
 29   Total Asset Growth Rate                                  6819 non-null   float64
 30   Net Value Growth Rate                                    6819 non-null   float64
 31   Total Asset Return Growth Rate Ratio                     6819 non-null   float64
 32   Cash Reinvestment %                                      6819 non-null   float64
 33   Current Ratio                                            6819 non-null   float64
 34   Quick Ratio                                              6819 non-null   float64
 35   Interest Expense Ratio                                   6819 non-null   float64
 36   Total debt/Total net worth                               6819 non-null   float64
 37   Debt ratio %                                             6819 non-null   float64
 38   Net worth/Assets                                         6819 non-null   float64
 39   Long-term fund suitability ratio (A)                     6819 non-null   float64
 40   Borrowing dependency                                     6819 non-null   float64
 41   Contingent liabilities/Net worth                         6819 non-null   float64
 42   Operating profit/Paid-in capital                         6819 non-null   float64
 43   Net profit before tax/Paid-in capital                    6819 non-null   float64
 44   Inventory and accounts receivable/Net value              6819 non-null   float64
 45   Total Asset Turnover                                     6819 non-null   float64
 46   Accounts Receivable Turnover                             6819 non-null   float64
 47   Average Collection Days                                  6819 non-null   float64
 48   Inventory Turnover Rate (times)                          6819 non-null   float64
 49   Fixed Assets Turnover Frequency                          6819 non-null   float64
 50   Net Worth Turnover Rate (times)                          6819 non-null   float64
 51   Revenue per person                                       6819 non-null   float64
 52   Operating profit per person                              6819 non-null   float64
 53   Allocation rate per person                               6819 non-null   float64
 54   Working Capital to Total Assets                          6819 non-null   float64
 55   Quick Assets/Total Assets                                6819 non-null   float64
 56   Current Assets/Total Assets                              6819 non-null   float64
 57   Cash/Total Assets                                        6819 non-null   float64
 58   Quick Assets/Current Liability                           6819 non-null   float64
 59   Cash/Current Liability                                   6819 non-null   float64
 60   Current Liability to Assets                              6819 non-null   float64
 61   Operating Funds to Liability                             6819 non-null   float64
 62   Inventory/Working Capital                                6819 non-null   float64
 63   Inventory/Current Liability                              6819 non-null   float64
 64   Current Liabilities/Liability                            6819 non-null   float64
 65   Working Capital/Equity                                   6819 non-null   float64
 66   Current Liabilities/Equity                               6819 non-null   float64
 67   Long-term Liability to Current Assets                    6819 non-null   float64
 68   Retained Earnings to Total Assets                        6819 non-null   float64
 69   Total income/Total expense                               6819 non-null   float64
 70   Total expense/Assets                                     6819 non-null   float64
 71   Current Asset Turnover Rate                              6819 non-null   float64
 72   Quick Asset Turnover Rate                                6819 non-null   float64
 73   Working capitcal Turnover Rate                           6819 non-null   float64
 74   Cash Turnover Rate                                       6819 non-null   float64
 75   Cash Flow to Sales                                       6819 non-null   float64
 76   Fixed Assets to Assets                                   6819 non-null   float64
 77   Current Liability to Liability                           6819 non-null   float64
 78   Current Liability to Equity                              6819 non-null   float64
 79   Equity to Long-term Liability                            6819 non-null   float64
 80   Cash Flow to Total Assets                                6819 non-null   float64
 81   Cash Flow to Liability                                   6819 non-null   float64
 82   CFO to Assets                                            6819 non-null   float64
 83   Cash Flow to Equity                                      6819 non-null   float64
 84   Current Liability to Current Assets                      6819 non-null   float64
 85   Liability-Assets Flag                                    6819 non-null   int64  
 86   Net Income to Total Assets                               6819 non-null   float64
 87   Total assets to GNP price                                6819 non-null   float64
 88   No-credit Interval                                       6819 non-null   float64
 89   Gross Profit to Sales                                    6819 non-null   float64
 90   Net Income to Stockholder's Equity                       6819 non-null   float64
 91   Liability to Equity                                      6819 non-null   float64
 92   Degree of Financial Leverage (DFL)                       6819 non-null   float64
 93   Interest Coverage Ratio (Interest expense to EBIT)       6819 non-null   float64
 94   Net Income Flag                                          6819 non-null   int64  
 95   Equity to Liability                                      6819 non-null   float64
dtypes: float64(93), int64(3)
memory usage: 5.0 MB

bankruptcy_df.shape

(6819, 96)

bankruptcy_df.describe()

3. 数据探索与清洗

3.1 缺失值处理

bankruptcy_df.columns[bankruptcy_df.isna().any()]

Index([], dtype='object')

从结果看，改数据集非常完整，没有缺失值！

.any() 指的是有没有(缺失值)，而与之对应的.all()指的是是否都是(缺失值)

调整数据列名

def clean_col_names(col_name):
    col_name = (
        col_name.strip()
        .replace("?", "_")
        .replace("(", "_")
        .replace(")", "_")
        .replace(" ", "_")
        .replace("/", "_")
        .replace("-", "_")
        .replace("__", "_")
        .replace("'", "")
        .lower()
    )
    return col_name

bank_columns = list(bankruptcy_df.columns)
bank_columns = [clean_col_names(col_name) for col_name in bank_columns]
bankruptcy_df.columns = bank_columns
display(bankruptcy_df.columns)

上下滑动查看更多
Index(['bankrupt_', 'roa_c_before_interest_and_depreciation_before_interest',
       'roa_a_before_interest_and_%_after_tax',
       'roa_b_before_interest_and_depreciation_after_tax',
       'operating_gross_margin', 'realized_sales_gross_margin',
       'operating_profit_rate', 'pre_tax_net_interest_rate',
       'after_tax_net_interest_rate',
       'non_industry_income_and_expenditure_revenue',
       'continuous_interest_rate_after_tax_', 'operating_expense_rate',
       'research_and_development_expense_rate', 'cash_flow_rate',
       'interest_bearing_debt_interest_rate', 'tax_rate_a_',
       'net_value_per_share_b_', 'net_value_per_share_a_',
       'net_value_per_share_c_', 'persistent_eps_in_the_last_four_seasons',
       'cash_flow_per_share', 'revenue_per_share_yuan_¥_',
       'operating_profit_per_share_yuan_¥_',
       'per_share_net_profit_before_tax_yuan_¥_',
       'realized_sales_gross_profit_growth_rate',
       'operating_profit_growth_rate', 'after_tax_net_profit_growth_rate',
       'regular_net_profit_growth_rate', 'continuous_net_profit_growth_rate',
       'total_asset_growth_rate', 'net_value_growth_rate',
       'total_asset_return_growth_rate_ratio', 'cash_reinvestment_%',
       'current_ratio', 'quick_ratio', 'interest_expense_ratio',
       'total_debt_total_net_worth', 'debt_ratio_%', 'net_worth_assets',
       'long_term_fund_suitability_ratio_a_', 'borrowing_dependency',
       'contingent_liabilities_net_worth', 'operating_profit_paid_in_capital',
       'net_profit_before_tax_paid_in_capital',
       'inventory_and_accounts_receivable_net_value', 'total_asset_turnover',
       'accounts_receivable_turnover', 'average_collection_days',
       'inventory_turnover_rate_times_', 'fixed_assets_turnover_frequency',
       'net_worth_turnover_rate_times_', 'revenue_per_person',
       'operating_profit_per_person', 'allocation_rate_per_person',
       'working_capital_to_total_assets', 'quick_assets_total_assets',
       'current_assets_total_assets', 'cash_total_assets',
       'quick_assets_current_liability', 'cash_current_liability',
       'current_liability_to_assets', 'operating_funds_to_liability',
       'inventory_working_capital', 'inventory_current_liability',
       'current_liabilities_liability', 'working_capital_equity',
       'current_liabilities_equity', 'long_term_liability_to_current_assets',
       'retained_earnings_to_total_assets', 'total_income_total_expense',
       'total_expense_assets', 'current_asset_turnover_rate',
       'quick_asset_turnover_rate', 'working_capitcal_turnover_rate',
       'cash_turnover_rate', 'cash_flow_to_sales', 'fixed_assets_to_assets',
       'current_liability_to_liability', 'current_liability_to_equity',
       'equity_to_long_term_liability', 'cash_flow_to_total_assets',
       'cash_flow_to_liability', 'cfo_to_assets', 'cash_flow_to_equity',
       'current_liability_to_current_assets', 'liability_assets_flag',
       'net_income_to_total_assets', 'total_assets_to_gnp_price',
       'no_credit_interval', 'gross_profit_to_sales',
       'net_income_to_stockholders_equity', 'liability_to_equity',
       'degree_of_financial_leverage_dfl_',
       'interest_coverage_ratio_interest_expense_to_ebit_', 'net_income_flag',
       'equity_to_liability'],
      dtype='object')

统计并绘制目标变量

该步骤的目的是查看目标变量是否平衡，如果不平衡，则需要针对性处理。

class_bar=sns.countplot(data=bankruptcy_df,x="bankrupt_")
ax = plt.gca()
for p in ax.patches:
        ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()+500))
class_bar

3.2 特征分布

检查偏态

# Return true/false if skewed
import scipy.stats
skew_df = pd.DataFrame(bankruptcy_df.select_dtypes(np.number).columns, columns = ['Feature'])

skew_df['Skew'] = skew_df['Feature'].apply(lambda feature: scipy.stats.skew(bankruptcy_df[feature])) 

skew_df['Absolute Skew'] = skew_df['Skew'].apply(abs)  
# 得到与方向无关的倾斜幅度
skew_df['Skewed']= skew_df['Absolute Skew'].apply(lambda x: True if x>= 0.5 else False)
with pd.option_context("display.max_rows", 1000):
    display(skew_df)

可视化分布

cols = list(bankruptcy_df.columns)
ncols = 8
nrows = math.ceil(len(cols) / ncols)

fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))
for i in range(len(cols)):
    sns.kdeplot(bankruptcy_df[cols[i]], ax = ax[i // ncols, i % ncols])
    if i % ncols != 0:
        ax[i // ncols, i % ncols].set_ylabel(" ")
plt.tight_layout()
plt.show()

查看有偏态的特征

query_skew=skew_df.query("Skewed == True")["Feature"]
with pd.option_context("display.max_rows", 1000):
    display(query_skew)

上下滑动查看更多

0                                             bankrupt_
2                 roa_a_before_interest_and_%_after_tax
3      roa_b_before_interest_and_depreciation_after_tax
4                                operating_gross_margin
5                           realized_sales_gross_margin
6                                 operating_profit_rate
7                             pre_tax_net_interest_rate
8                           after_tax_net_interest_rate
9           non_industry_income_and_expenditure_revenue
10                  continuous_interest_rate_after_tax_
11                               operating_expense_rate
12                research_and_development_expense_rate
13                                       cash_flow_rate
14                  interest_bearing_debt_interest_rate
15                                          tax_rate_a_
16                               net_value_per_share_b_
17                               net_value_per_share_a_
18                               net_value_per_share_c_
19              persistent_eps_in_the_last_four_seasons
20                                  cash_flow_per_share
21                            revenue_per_share_yuan_¥_
22                   operating_profit_per_share_yuan_¥_
23              per_share_net_profit_before_tax_yuan_¥_
24              realized_sales_gross_profit_growth_rate
25                         operating_profit_growth_rate
26                     after_tax_net_profit_growth_rate
27                       regular_net_profit_growth_rate
28                    continuous_net_profit_growth_rate
29                              total_asset_growth_rate
30                                net_value_growth_rate
31                 total_asset_return_growth_rate_ratio
32                                  cash_reinvestment_%
33                                        current_ratio
34                                          quick_ratio
35                               interest_expense_ratio
36                           total_debt_total_net_worth
37                                         debt_ratio_%
38                                     net_worth_assets
39                  long_term_fund_suitability_ratio_a_
40                                 borrowing_dependency
41                     contingent_liabilities_net_worth
42                     operating_profit_paid_in_capital
43                net_profit_before_tax_paid_in_capital
44          inventory_and_accounts_receivable_net_value
45                                 total_asset_turnover
46                         accounts_receivable_turnover
47                              average_collection_days
48                       inventory_turnover_rate_times_
49                      fixed_assets_turnover_frequency
50                       net_worth_turnover_rate_times_
51                                   revenue_per_person
52                          operating_profit_per_person
53                           allocation_rate_per_person
57                                    cash_total_assets
58                       quick_assets_current_liability
59                               cash_current_liability
60                          current_liability_to_assets
61                         operating_funds_to_liability
62                            inventory_working_capital
63                          inventory_current_liability
64                        current_liabilities_liability
65                               working_capital_equity
66                           current_liabilities_equity
67                long_term_liability_to_current_assets
68                    retained_earnings_to_total_assets
69                           total_income_total_expense
70                                 total_expense_assets
71                          current_asset_turnover_rate
72                            quick_asset_turnover_rate
73                       working_capitcal_turnover_rate
74                                   cash_turnover_rate
75                                   cash_flow_to_sales
76                               fixed_assets_to_assets
77                       current_liability_to_liability
78                          current_liability_to_equity
79                        equity_to_long_term_liability
81                               cash_flow_to_liability
83                                  cash_flow_to_equity
84                  current_liability_to_current_assets
85                                liability_assets_flag
86                           net_income_to_total_assets
87                            total_assets_to_gnp_price
88                                   no_credit_interval
89                                gross_profit_to_sales
90                    net_income_to_stockholders_equity
91                                  liability_to_equity
92                    degree_of_financial_leverage_dfl_
93    interest_coverage_ratio_interest_expense_to_ebit_
95                                  equity_to_liability
Name: Feature, dtype: object

接下来，我们对数据集（关注@公众号：数据STUDIO，联系云朵君获取）进行下采样，直至样本集中的破产与非破产比例为50/50。完成之后再次对数据进行偏态检查，决定是否需要做log转换，另外进行相关矩阵分析。

3.3 下采样

首先对数据集进行下采样，目标比例为bankrupt vs non bankrupt = 50 vs 50。

bankruptcy_df2 = bankruptcy_df.sample(frac=1) #Shuffle Bankruptcy df

bankruptcy_df_b = bankruptcy_df2.loc[bankruptcy_df2["bankrupt_"] == 1]
bankruptcy_df_nb = bankruptcy_df2.loc[bankruptcy_df2["bankrupt_"] == 0][:220]

bankruptcy_subdf_comb = pd.concat([bankruptcy_df_b,bankruptcy_df_nb])
bankruptcy_subdf = bankruptcy_subdf_comb.sample(frac=1,random_state=42)

bankruptcy_subdf

再次绘图查看正负样本数。

sns.countplot(bankruptcy_subdf["bankrupt_"])

随机选择220家非破产公司和220家破产公司。

4. 特征工程

bankruptcy_subdf2 = bankruptcy_subdf.drop(["net_income_flag"],axis=1)
bankruptcy_subdf2.shape

(440, 95)

4.1 相关矩阵

fig = plt.figure(figsize=(30,20))
ax1 = fig.add_subplot(1,1,1)
sns.heatmap(bankruptcy_subdf2.corr(),ax=ax1,cmap="coolwarm")

4.1.1 找出与破产相关的最高特征

根据对破产企业的基本认识，破产企业资产少、负债高、盈利能力低、现金流少。可以朝这个方向分析我们的数据集。

corr=bankruptcy_subdf2[bankruptcy_subdf2.columns[:-1]].corr()['bankrupt_'][:]

corr_df = pd.DataFrame(corr)

print("Correlations to Bankruptcy:")
for index, row in corr_df["bankrupt_"].iteritems():
    if row!=1.0 and row>=0.5:
        print(f'Positive Correlation: {index}')
    elif row!=1.0 and row<=-0.5:
        print(f'Negative Correlation: {index}')

Correlations to Bankruptcy:
Negative Correlation: roa_c_before_interest_and_depreciation_before_interest
Negative Correlation: roa_b_before_interest_and_depreciation_after_tax
Negative Correlation: net_value_per_share_b_
Negative Correlation: net_value_per_share_a_
Negative Correlation: net_value_per_share_c_
Negative Correlation: persistent_eps_in_the_last_four_seasons
Negative Correlation: per_share_net_profit_before_tax_yuan_¥_
Positive Correlation: debt_ratio_%
Negative Correlation: net_worth_assets
Negative Correlation: net_profit_before_tax_paid_in_capital
Negative Correlation: total_income_total_expense

这些特征代表什么

roa_c_before_interest_and_depreciation_before_interest息前资产收益率和息前折旧：总资产收益率--如果总资产收益率低，破产风险高
roa_a_before_interest_and_after_tax息前和税后利润：总资产回报率--如果总资产回报率较低，破产风险较高
roa_b_before_interest_and_depreciation_after_tax利润不计利息及税后折旧：总资产回报率--如果总资产回报率较低，破产风险较高
debt_ratio负债率：负债占总资产的比例--价值越高，负债占资产的比例越高，导致破产风险越高
net_worth_assets净资产：净资产越少，破产风险越高
retained_earnings_to_total_assets留存收益与总资产之比：留存收益越少，破产风险越高
total_income_total_expense总费用：收入与费用之比较低，破产风险较高
net_income_to_total_assets净收入与总资产之比：净收入越低，破产风险越高

从结果看，导致公司违约风险越高的特征，似乎与背景知识一致。

4.2 下采样后特征分布可视化

# Visualisation of distributions after sub-sampling
cols = list(bankruptcy_subdf2.columns)
ncols = 8
nrows = math.ceil(len(cols) / ncols)

fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))
for i in range(len(cols)):
    sns.kdeplot(bankruptcy_subdf2[cols[i]], ax = ax[i // ncols, i % ncols])
    if i % ncols != 0:
        ax[i // ncols, i % ncols].set_ylabel(" ")
plt.tight_layout()
plt.show()

4.3 所有特征的箱线图

plt.figure(figsize=(30,20))
boxplot=sns.boxplot(data=bankruptcy_subdf2,orient="h")
boxplot.set(xscale="log")
plt.show()

4.4 异常值处理

quartile1 = bankruptcy_subdf2.quantile(q=0.25,axis=0)
# display(quartile1)
quartile3 = bankruptcy_subdf2.quantile(q=0.75,axis=0)
# display(quartile3)
IQR = quartile3 -quartile1
lower_limit = quartile1-1.5*IQR
upper_limit = quartile3+1.5*IQR

lower_limit = lower_limit.drop(["bankrupt_"])
upper_limit = upper_limit.drop(["bankrupt_"])
# print(lower_limit)
# print(" ")
# print(upper_limit)

bankruptcy_subdf2_out = bankruptcy_subdf2[((bankruptcy_subdf2<lower_limit) | (bankruptcy_subdf2>upper_limit)).any(axis=1)]
display(bankruptcy_subdf2_out.shape)
display(bankruptcy_subdf2.shape)

(423, 95)

(440, 95)

额外复制一份表，供后续分析处理。

bankruptcy_subdf3 = bankruptcy_subdf2_out.copy()
bankruptcy_subdf3

下采样后且去除离群值后的分布可视化。

# Visualisation of distributions after sub-sampling after outlier removal
cols = list(bankruptcy_subdf3.columns)
ncols = 8
nrows = math.ceil(len(cols) / ncols)

fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))
for i in range(len(cols)):
    sns.kdeplot(bankruptcy_subdf3[cols[i]], ax = ax[i // ncols, i % ncols],fill=True,color="red")
    sns.kdeplot(bankruptcy_subdf2[cols[i]], ax = ax[i // ncols, i % ncols],color="green")
    if i % ncols != 0:
        ax[i // ncols, i % ncols].set_ylabel(" ")
plt.tight_layout()
plt.show()

5 数据预处理

5.1 特征编码

所有类别在基础数据中都已编码完成，因此这里不需要再次编码列。在实际工作中，这一步大概率是必不可少的，编码技术也是尤其重要，需要好好掌握。如果你还不了解或不是很了解，推荐查看：

5.2 Log转换

这一步是为了去除数据中的偏态分布。

# Log transform to remove skews
target = bankruptcy_subdf3['bankrupt_']
bankruptcy_subdf4 = bankruptcy_subdf3.drop(["bankrupt_"],axis=1)

def log_trans(data):
    for col in data:
        skew = data[col].skew()
        if skew>=0.5 or skew<=0.5:
            data[col] = np.log1p(data[col])
        else:
            continue
    return data

bankruptcy_subdf4_log = log_trans(bankruptcy_subdf4)
bankruptcy_subdf4_log.head()

5.2.1 Log转换数据的箱线图

plt.figure(figsize=(30,20))
boxplot=sns.boxplot(data=bankruptcy_subdf4_log,orient="h")
boxplot.set(xscale="log")
plt.show()

5.2.2 Log转换后的数据分布可视化

# 在下采样后、去除离群值及log变换后的数据分布的可视化
compare_subdf2 = bankruptcy_subdf2.drop(["bankrupt_"],axis=1)

cols = list(bankruptcy_subdf4.columns)
ncols = 8
nrows = math.ceil(len(cols) / ncols)

fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))
for i in range(len(cols)):
    sns.kdeplot(bankruptcy_subdf4_log[cols[i]], ax = ax[i // ncols, i % ncols],fill=True,color="red")
    sns.kdeplot(bankruptcy_subdf2[cols[i]], ax = ax[i // ncols, i % ncols],color="green")
    if i % ncols != 0:
        ax[i // ncols, i % ncols].set_ylabel(" ")
plt.tight_layout()
plt.show()
print("Red represents distributions after log transforms, green represents before log transform")

红色表示Log变换后的分布，绿色表示Log变换前的分布。（完整数据集：关注@公众号：数据STUDIO，联系云朵君获取）

6 使用Pycaret构建模型

本次模型构建使用的是自动机器学习框架pycaret，如果你还没有安装，可使用下述命令安装即可。

pip install -U --ignore-installed --pre pycaret

在pycaret中自动完成训练及测试数据的切分工作。

from pycaret.classification import *
exp_name = setup(data = bankruptcy_subdf4,  target = bankruptcy_subdf3["bankrupt_"])

compare_models()

Pycaret显示，3种模型的准确性最高的是

LightGBM分类器
梯度提升GBC分类器
XGBoost分类器

接下来将使用这5个模型进行超参数调优。

6.1 选定模型交叉验证

LightGBM

print("LGBM Model")
lgb_clf = create_model("lightgbm")
lgb_clf_scoregrid = pull()

LGBM Model

GBC

print("GBC Model")
gbc_clf = create_model("gbc")
gbc_clf_scoregrid = pull()

GBC Model

XGBoost

print("XGB Model")
xgb_clf = create_model("xgboost")
xgb_clf_scoregrid = pull()

XGB Model

7 使用Pycaret进行超参数调优

7.1 模型调优

LightGBM

print("Before Tuning")
print(lgb_clf_scoregrid.loc[["Mean","Std"]])
print("")
lgb_clf = tune_model(lgb_clf,choose_better=True)
print(lgb_clf)

Before Tuning
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
Fold                                                          
Mean    0.8433  0.9233  0.8562  0.8497  0.8495  0.6866  0.6929
Std     0.0524  0.0429  0.0802  0.0681  0.0506  0.1046  0.1048

GBC

print("Before Tuning")
print(gbc_clf_scoregrid.loc[["Mean","Std"]])
print("")
gbc_clf = tune_model(gbc_clf,choose_better=True)
print(gbc_clf)

Before Tuning
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
Fold                                                       
Mean    0.8329  0.9242  0.8558  0.8324  0.8419  0.6649  0.6691
Std     0.0599  0.0403  0.0634  0.0750  0.0557  0.1204  0.1198

XGBoost

print("Before Tuning")
print(xgb_clf_scoregrid.loc[["Mean","Std"]])
print("")
xgb_clf = tune_model(xgb_clf,choose_better = True)
print(xgb_clf)

Before Tuning
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
Fold                                                          
Mean    0.8400  0.9270  0.8562  0.8410  0.8460  0.6797  0.6852
Std     0.0582  0.0382  0.0906  0.0586  0.0583  0.1161  0.1187

7.2 模型集成

Bagged & Boosting 方法
Blending
Stacking

LightGBM

# Original
print(lgb_clf_scoregrid.loc[['Mean', 'Std']])

# Compare the original against bagged and boosted

# Bagged
lgb_clf = ensemble_model(lgb_clf,fold =5,choose_better = True)
# Boosted
lgb_clf = ensemble_model(lgb_clf,method="Boosting",choose_better = True)

      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
Fold                                                          
Mean    0.8433  0.9233  0.8562  0.8497  0.8495  0.6866  0.6929
Std     0.0524  0.0429  0.0802  0.0681  0.0506  0.1046  0.1048

GBC

# Original
print(gbc_clf_scoregrid.loc[['Mean', 'Std']])

# Compare the original against bagged and boosted

# Bagged
gbc_clf = ensemble_model(gbc_clf,fold =5,choose_better = True)
# Boosted
gbc_clf = ensemble_model(gbc_clf,method="Boosting",choose_better = True)

      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
Fold                                                          
Mean    0.8329  0.9242  0.8558  0.8324  0.8419  0.6649  0.6691
Std     0.0599  0.0403  0.0634  0.0750  0.0557  0.1204  0.1198

XGBoost

# Original
print(xgb_clf_scoregrid.loc[['Mean', 'Std']])

# Compare the original and boosted against bagged and boosted

# Bagged
xgb_clf = ensemble_model(xgb_clf,fold =5,choose_better = True)
# Boosted
xgb_clf = ensemble_model(xgb_clf,method="Boosting",choose_better = True)

      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
Fold                                                          
Mean    0.8400  0.9270  0.8562  0.8410  0.8460  0.6797  0.6852
Std     0.0582  0.0382  0.0906  0.0586  0.0583  0.1161  0.1187

7.3.1 Blend Models

blend_models([lgb_clf, gbc_clf, xgb_clf],choose_better=True)

7.3.2 Stacking

stacker = stack_models(lgb_clf,gbc_clf)  #remove xgb as some issues

print(stacker)

8 模型评估

# evaluate_model(lgb_clf)
# evaluate_model(gbc_clf)
# evaluate_model(xgb_clf)

8.1 ROC-AUC

plot_model(stacker, plot = 'auc')   
# Stacked classifier from ensembling
plot_model(lgb_clf, plot = 'auc')   
# lgb最适合Bagging集成并被选中
plot_model(gbc_clf, plot = 'auc')   
# gbc最适合Boosting集成并被选中
plot_model(xgb_clf, plot = 'auc')   
# 基本的xgb分类器在经过调优和集成后仍然表现最好，因此选择了它

8.2 混淆矩阵

plot_model(stacker, 
           plot = 'confusion_matrix', 
           plot_kwargs = {'percent' : True})
plot_model(lgb_clf, 
           plot = 'confusion_matrix', 
           plot_kwargs = {'percent' : True})
plot_model(gbc_clf, 
           plot = 'confusion_matrix', 
           plot_kwargs = {'percent' : True})
plot_model(xgb_clf,
           plot = 'confusion_matrix', 
           plot_kwargs = {'percent' : True})