类别变量-卡方分箱

建模中遇到类别变量时,经常将其转为哑变量进行处理,但若类别变量的属性过多,会生成过多的哑变量,从而导致维度增加,并且很多情况下,只有部分哑变量进入模型,可能损失类别变量的部分信息。除了转为哑变量的进行处理,还可以对类别变量进行分箱,减小属性的数量。

分箱算法

类别变量没有取值大小之分,所以不需要像连续变量那样保证相邻分箱之间的大小关系。在分箱前计算类别变量每个属性的坏样本率,按照坏样本率对属性进行排序,然后不断合并相邻的属性,直至达到终止条件。

算法如下:

(1)计算类别变量每个属性的总样本数、好样本数、坏样本数、样本占比、坏样本率,然后按照坏样本率进行排序,此时为初始分箱,每个属性分为一组;

(2)计算相邻两组的卡方值,合并卡方值最小的相邻两组;

(3)重复步骤(2),直至分组个数<=BinMax

(4)检查每个分组是否同时含有坏样本和好样本,若某分组只含有坏样本或好样本,则合并与该组卡方值最小的一组;

(5)重复步骤(4),直至每个组同时含有坏样本和好样本;

(6)检查每个分组的样本占比是否>=BinPcntMin,如某组的样本占比<BinPcntMin,则合并与该组卡方值最小的一组;

(7)重复步骤(6),直至每个分组的样本占比>=BinPcntMin。

代码实现

1、载入模块

import pandas as pd
import numpy as np
from pandas import Series

2、计算变量总样本、好样本、坏样本、坏样本率函数

def BinBadRate(df, col, target, BadRateIndicator = True):
    
    # df: 需要计算好坏比率的数据集
    # col: 需要计算好坏比率的变量
    # target: 好坏标签
    # BadRateIndicator: 是否计算好坏比
    
    group = df.groupby([col])[target].agg(['count', 'sum'])
    group.columns = ['total', 'bad']
    group.reset_index(inplace=True)
    group['good'] = group['total'] - group['bad']
    
    if BadRateIndicator:
        group['BadRate'] = group['bad']/group['total']
         
    return group 

3、计算卡方值函数

def calcChi2(df, total_col, bad_col, good_col):

    # df: 包含各属性的全部样本个数、坏样本个数、好样本个数的数据框
    # total_col: 全部样本的个数
    # bad_col: 坏样本的个数
    # good_col:好样本的个数

    df2 = df.copy()
    # 求出总体的坏样本率和好样本率
    badRate = df2[bad_col].sum() * 1.0 / df2[total_col].sum()
    goodRate = df2[good_col].sum() * 1.0 / df2[total_col].sum()
    
    # 当全部样本只有好或者坏样本时,卡方值为0
    if badRate in [0,1]:
        return 0

    # 计算期望坏样本和期望好样本的个数
    df2['bad_Exp'] = df2[total_col].map(lambda x: x*badRate)
    df2['good_Exp'] = df2[total_col].map(lambda x: x*goodRate)
    
    # 计算卡方值
    badzip = zip(df2['bad_Exp'], df2[bad_col])
    goodzip = zip(df2['good_Exp'], df2[good_col])
    badChi2 = [(elem[1]-elem[0])**2/elem[0] for elem in badzip]
    goodChi2 = [(elem[1] - elem[0])**2/elem[0] for elem in goodzip]
    chi2 = sum(badChi2) + sum(goodChi2)
    
    return chi2

4、下面是单变量分箱函数,其中会调用上面的2个函数,返回单变量分箱的结果。按照前面描述的算法,分箱函数分三个部分,合并相邻两个分箱、检查是否每个分箱同时含有好和坏、检查每个分箱的占比是否大于等于BinPcntMin。其中spe_attri是特殊属性值,单独作为一组不参与卡方分箱。

################# split the category variable using Chi2 value #################
def CateVarChi2Bin(df, col, target, BinMax, BinPcntMin, spe_attri = []):
    
    # df: 包含目标变量与分箱变量的数据框
    # col: 需要分箱的变量
    # target: 目标变量,取值0或1
    # BinMax: 最大分箱数
    # BinPcntMin:每箱的最小占比
    # spe_attri:特殊属性,单独作为一组不参与卡方分箱
    # 返回:单变量分箱结果的数据框
        
    if len(spe_attri)>=1:
        df1 = df.loc[df[col].isin(spe_attri)]
        df2 = df.loc[~df[col].isin(spe_attri)]
        BinMax -= len(set(df1[col]))
    else:
        df2 = df.copy()
     
    binBadRate = BinBadRate(df2, col, target, BadRateIndicator = True)
    binBadRate = binBadRate.sort_values(by='BadRate', ascending=True)
    binBadRate.reset_index(inplace=True, drop=True)

    binBadRate = binBadRate.drop('BadRate', axis=1)
    bindf = binBadRate.copy()
    
    # 1、合并相邻两个分箱,直至分箱数<=BinMax
    while binBadRate.shape[0] > BinMax:
        chi2List = []
        for i in range(0, binBadRate.shape[0]-1):
            col_binBadRate = binBadRate.loc[i:i+1, :]
            chi2 = calcChi2(col_binBadRate, 'total', 'bad', 'good')
            chi2List.append(chi2)
            
        combineIndex = chi2List.index(min(chi2List))
        combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]
        
        binBadRate['total'][combineIndex+1] = combine_binBadRate['total'].sum()
        binBadRate['bad'][combineIndex+1] = combine_binBadRate['bad'].sum()
        binBadRate['good'][combineIndex+1] = combine_binBadRate['good'].sum()
        binBadRate[col][combineIndex+1] = binBadRate[col][combineIndex] + ',' + binBadRate[col][combineIndex+1]
        
        binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]
        binBadRate.reset_index(drop=True, inplace=True)

    # 2、检查是否每个分箱同时含有好和坏
    binBadRate.loc[0, 'BadRate'] = 0
    binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total']
    minBadRate, maxBadRate = min(binBadRate['BadRate']), max(binBadRate['BadRate'])
    while minBadRate == 0 or maxBadRate == 1:
        BadRate_01 = binBadRate[col][binBadRate['BadRate'].isin([0, 1])]
        index_01 = BadRate_01.index[0]
        
        if index_01 == 0:
            
            combineIndex = 0
            combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]
        
            binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])
            binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])
            binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])
            binBadRate[col][combineIndex+1] = binBadRate[col][combineIndex] + ',' + binBadRate[col][combineIndex+1]
            
            binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]
            binBadRate.reset_index(drop=True, inplace=True)
            
        elif index_01 == binBadRate.shape[0]-1:
            
            combineIndex = binBadRate.shape[0]-2
            combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]
        
            binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])
            binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])
            binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])
            binBadRate[col][combineIndex+1] = binBadRate[col][combineIndex] + ',' + binBadRate[col][combineIndex+1]
            
            binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]
            binBadRate.reset_index(drop=True, inplace=True)
            
        else:
            
            temp1_binBadRate = binBadRate.loc[index_01-1:index_01, :]
            chi2_1 = calcChi2(temp1_binBadRate, 'total', 'bad', 'good')
            
            temp2_binBadRate = binBadRate.loc[index_01:index_01+1, :]
            chi2_2 = calcChi2(temp2_binBadRate, 'total', 'bad', 'good')
            
            if chi2_1 < chi2_2:
                combineIndex = index_01-1
            else:
                combineIndex = index_01
                
            combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]
        
            binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])
            binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])
            binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])
            binBadRate[col][combineIndex+1] = binBadRate[col][combineIndex] + ',' + binBadRate[col][combineIndex+1]
            
            binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]
            binBadRate.reset_index(drop=True, inplace=True)
        
        binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total']
        minBadRate, maxBadRate = min(binBadRate['BadRate']), max(binBadRate['BadRate'])

    # 3、检查每个分箱的占比是否大于等于BinPcntMin
    binBadRate['Percent'] = binBadRate['total']/binBadRate['total'].sum()       
    minPercent = min(binBadRate['Percent'])
    while minPercent < BinPcntMin:
        minPercent_temp = binBadRate[col][binBadRate['Percent']==minPercent]
        index_minPercent = minPercent_temp.index[0]
        
        if index_minPercent == 0:
          
            combineIndex = 0
            combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]
        
            binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])
            binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])
            binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])
            binBadRate[col][combineIndex+1] = binBadRate[col][combineIndex] + ',' + binBadRate[col][combineIndex+1]
            
            binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]
            binBadRate.reset_index(drop=True, inplace=True)
            
        elif index_minPercent == binBadRate.shape[0]-1:
            
            combineIndex = binBadRate.shape[0]-2
            combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]
        
            binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])
            binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])
            binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])
            binBadRate[col][combineIndex+1] = binBadRate[col][combineIndex] + ',' + binBadRate[col][combineIndex+1]
            
            binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]
            binBadRate.reset_index(drop=True, inplace=True)
            
        else:
            
            temp1_binBadRate = binBadRate.loc[index_minPercent-1:index_minPercent, :]
            chi2_1 = calcChi2(temp1_binBadRate, 'total', 'bad', 'good')
            
            temp2_binBadRate = binBadRate.loc[index_minPercent:index_minPercent+1, :]
            chi2_2 = calcChi2(temp2_binBadRate, 'total', 'bad', 'good')
            
            if chi2_1 < chi2_2:
                combineIndex = index_minPercent-1
            else:
                combineIndex = index_minPercent
                
            combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]
        
            binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])
            binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])
            binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])
            binBadRate[col][combineIndex+1] = binBadRate[col][combineIndex] + ',' + binBadRate[col][combineIndex+1]
            
            binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]
            binBadRate.reset_index(drop=True, inplace=True)
        
        binBadRate['Percent'] = binBadRate['total']/binBadRate['total'].sum()
        minPercent = min(binBadRate['Percent'])
        
    binBadRate = binBadRate.drop(['BadRate', 'Percent'], axis=1)
    
    if len(spe_attri)>=1:
        binBadRate0 = BinBadRate(df1, col, target, BadRateIndicator = False) 
        binBadRate = pd.concat([binBadRate0, binBadRate])
        binBadRate.reset_index(drop=True, inplace=True)
        bindf = pd.concat([binBadRate0, bindf])
        bindf.reset_index(drop=True, inplace=True)
    
    binBadRate['Percent'] = binBadRate['total']/binBadRate['total'].sum()
    binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total']
    binBadRate['bin'] = range(1, len(binBadRate)+1)
    
    bindf['each_Percent'] = bindf['total']/sum(bindf['total'])
    bindf['each_BadRate'] = bindf['bad']/bindf['total']
    
    for i in binBadRate.index:
        bindf.loc[bindf[col].map(lambda x: x in binBadRate[col][i].split(',')), 'bin'] = binBadRate['bin'][i]
        bindf.loc[bindf[col].map(lambda x: x in binBadRate[col][i].split(',')), 'Percent'] = binBadRate['Percent'][i]
        bindf.loc[bindf[col].map(lambda x: x in binBadRate[col][i].split(',')), 'BadRate'] = binBadRate['BadRate'][i]
        
    return bindf

以类别变量marr_sex为例,train_cate是包含类别变量的数据框,y是目标变量,‘miss’代表类别变量的缺失值。

marr_sex_bin1 = CateVarChi2Bin(train_cate, 'marr_sex', 'y', BinMax=6, BinPcntMin=0.05, spe_attri = ['miss'])

分箱结果如下,缺失值‘miss’单独分为一箱。

下面不设置特殊属性,缺失值‘miss’参与卡方分箱,看看结果有什么不同。

marr_sex_bin2= CateVarChi2Bin(train_cate, 'marr_sex', 'y', BinMax=6, BinPcntMin=0.05, spe_attri = [])

结果如下,缺失值‘miss’和‘已婚男性’分在了一箱。

5、批量分箱函数,将所有要分箱的类别变量进行批量分箱处理,函数返回的是存放每个变量分箱结果的字典。

########### split the category variable using Chi2 value by batch ############
def CateVarChi2BinBatch(df, key, target, BinMax, BinPcntMin, spe_attri = []):
    
    # df: 包含主键、目标变量与待分箱变量的数据框
    # key:主键
    # target: 目标变量
    
    df_Xvar = df.drop([key, target], axis=1)
    x_vars = df_Xvar.columns.tolist()
    
    dict_bin = {}
    for col in x_vars:
        dict_bin[col] = CateVarChi2Bin(df, col, target, BinMax, BinPcntMin, spe_attri)
    
    return dict_bin

以训练样本train_cate为例,其包含主键cus_num、目标变量y、身份证地址id_prov、手机归属地cell_prov、婚姻性别marr_sex。train_cate如下:

对train_cate批量分箱如下:

dict_train_cate = CateVarChi2BinBatch(train_cate, 'cus_num', 'y', BinMax=6, BinPcntMin=0.05, spe_attri=['miss'])

字典dict_train_cate里存放了各类别变量的分箱结果:

看看变量cell_prov的分箱结果:

6、将变量值替换为分箱值的函数

def txtCateVarBin(df, key, target, dict_bin, testIndicator):
    # df: 需要将变量值替换为分箱值的数据框
    # key:主键
    # target:目标变量
    # dict_bin:包含各变量分箱结果的字典
    # testIndicator:是否为测试数据框,True:计算测试数据分箱后的占比、坏样本率等,并存放在字典中
    
    df_bin = df[[key, target]]
    df_Xvar = df.drop([key, target], axis=1)
    DictBin = {}
    for col in df_Xvar.columns:

        Bin = dict_bin[col]
        ls = Series([np.nan] * len(df))
        for i in range(len(Bin)):
            ls[(df[col] == Bin[col][i]).tolist()] = Bin.bin[i]
        df_bin[col] = ls.tolist()
        
        if testIndicator:
            
            col_bin = BinBadRate(df, col, target, BadRateIndicator = False)
            col_bin = Bin[[col]].merge(col_bin, on=col, how='left')
            col_bin['each_Percent'] = col_bin['total']/col_bin['total'].sum()
            col_bin['each_BadRate'] = col_bin['bad']/col_bin['total']
            col_bin = col_bin.merge(Bin[[col, 'bin']], on=col, how='left')
            
            col_bin_BadRate = BinBadRate(df_bin, col, target, BadRateIndicator = False)
            col_bin_BadRate['Percent'] = col_bin_BadRate['total']/col_bin_BadRate['total'].sum()
            col_bin_BadRate['BadRate'] = col_bin_BadRate['bad']/col_bin_BadRate['total']
            col_bin_BadRate = col_bin_BadRate[[col, 'Percent', 'BadRate']]
            col_bin_BadRate.columns = ['bin', 'Percent', 'BadRate']
        
            col_bin = col_bin.merge(col_bin_BadRate[['bin', 'Percent', 'BadRate']], on='bin', how='left')
         
            DictBin[col] = col_bin
        
    if testIndicator:
        return df_bin, DictBin

    return df_bin

前面,对训练样本train_cate批量分箱后,得到分箱结果字典dict_train_cate,然后用该字典将训练样本train_cate中类别变量的取值映射为分箱值,此时testIndicator=0,只返回映射后的训练样本,代码如下:

train_cate_bin = txtCateVarBin(train_cate, 'cus_num', 'y', dict_train_cate, testIndicator=0)

看看映射后的训练样本:

对于测试样本,也需要用训练样本上的分箱结果,将其映射成分箱值,同时testIndicator=1,还需要返回测试样本按照训练样本的分箱结果映射后的变量的风险分布。

test_cate_bin, dict_test_cate = txtCateVarBin(test_cate, 'cus_num', 'y', dict_train_cate, testIndicator=1)

test_cate_bin是映射后的测试样本:

dict_test_cate是测试样本分箱后的风险分布,也是字典:

下面以变量marr_sex为例,对比其在训练样本和测试样本上风险分布的区别,下图中,上幅图为训练样本的结果,下幅图是测试样本上的结果。

为了展示方便,上面使用了地区特征。通过算法分箱后的结果,一般还需要人工结合业务理解进行调整。比如,上面的手机归属地cell_prov的分箱结果显示,新疆、西藏、宁夏、内蒙等地的坏样本率是比较低的,再看看这些地区所占的样本比例也是极低的,所以非常有可能,在之前的策略中有地区因素相关的规则,并对这些地区设置了更严格的审批规则,所以如果在建模中仍然使用这样的特征,会在一定程度上抵消前面相关规则的效果。大多情况下,还是避免使用这样的特征为好。

还有,如果类别变量的属性比较多,这样的类别变量也要慎重使用,因为属性太多,每个属性所占样本可能会很少,所计算出来的风险特征偏差也会很大,即使分箱,也不是很准确。

本文非常详细地介绍了类别变量的卡方分箱方法,连续变量的卡方分箱和类别变量的卡方分箱算法过程基本一致,只是在特征处理时有些差异,以后再分享。

本文分享自微信公众号 - 大数据建模的一点一滴(bigdatamodeling)

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2019-04-07

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • R语言----PCA分析,热图(楔子)

    在转录组的分析当中,主成分分析(PCA)往往是成果体现的一个很重要的手段。 在《R语言做主成分分析实例》里的降解非常的好--网址 :https://sheng...

    liu_ll
  • K-Means算法实例

    版权声明:本文为博主原创文章,欢迎转载。 ...

    程裕强
  • Python工程师薪资再次刷出新高度,预计3-4年成为世界上最流行的编程语言

    随着Python的不断崛起,TIOBE预计它最终将获得第一名。TIOBE在其2019年6月的文章中说:“如果Python能保持这样的速度,它可能在3到4年内取代...

    一墨编程学习
  • 埃航失事!纵观历史空难数据!

    3月10日上午,一架从埃塞俄比亚首都亚的斯亚贝巴前往肯尼亚内罗毕的埃航波音737 MAX 8客机坠毁,机上载有149名乘客和8名机组人员,157人全部遇难。

    不羁的程序员小王
  • 做数据只知道Excel?Jupyter Notebook也要学起来了

    如果你是一名交易员或者从事金融服务行业,那么 Excel 就是你的生计之本。有了它,你可以分析价格和实时数据、评估交易组合、计算 VaR、执行回测等等;有了它,...

    机器之心
  • NEUROLOGY:局灶性肌张力障碍患者的感觉运动网络中功能连接的自上而下改变

    确定孤立性局灶性肌张力障碍患者功能异常的感觉运动脑网络的相互作用方向与脑区间的影响。

    用户1279583
  • .NET 机器学习生态调查

    机器学习是一种允许计算机使用现有数据预测未来行为、结果和趋势的数据科学方法。 使用机器学习,计算机可以在未显式编程的情况下进行学习。机器学习的预测可以使得应用和...

    张善友
  • 2019 年最新 Elasticsearch 7.1 版本使用教程

    搜索是现代软件必备的一项基础功能,而 Elasticsearch 就是一款功能强大的开源分布式搜索与数据分析引擎。

    猿天地
  • 蒙特卡洛 VS 自举法 | 在投资组合中的应用(附代码)

    在这篇文章中,我们将比较蒙特卡洛分析(Monte Carlo analysis)和自举法(Bootstrapping)中的一些概念,这些概念与模拟收益序列以及生...

    量化投资与机器学习微信公众号
  • 热图,PCA画图网站推荐--------- ClustVis

      在生信的分析学习过程中,对结果的可视化是非常重要的,在很多生信文章常见的就是热图,PCA等图。    但是在画图之前,我们需要知道,我们这么做的目的是什么...

    liu_ll

扫码关注云+社区

领取腾讯云代金券