简单易学的机器学习算法——梯度提升决策树GBDT

felixzhao

发布于 2018-03-20 15:59:30

2K0

发布于 2018-03-20 15:59:30

文章被收录于专栏：null的专栏

梯度提升决策树（Gradient Boosting Decision Tree，GBDT）算法是近年来被提及比较多的一个算法，这主要得益于其算法的性能，以及该算法在各类数据挖掘以及机器学习比赛中的卓越表现，有很多人对GBDT算法进行了开源代码的开发，比较火的是陈天奇的XGBoost和微软的LightGBM。

一、监督学习

1、监督学习的主要任务

2、梯度下降法

梯度下降法的具体过程如下图所示：

3、在函数空间的优化

二、Boosting

1、集成方法之Boosting

Boosting方法是集成学习中重要的一种方法，在集成学习方法中最主要的两种方法为Bagging和Boosting，在Bagging中，通过对训练样本重新采样的方法得到不同的训练样本集，在这些新的训练样本集上分别训练学习器，最终合并每一个学习器的结果，作为最终的学习结果，Bagging方法的具体过程如下图所示：

在Bagging方法中，最重要的算法为随机森林Random Forest算法。由以上的图中可以看出，在Bagging方法中，bb个学习器之间彼此是相互独立的，这样的特点使得Bagging方法更容易并行。与Bagging方法不同，在Boosting算法中，学习器之间是存在先后顺序的，同时，每一个样本是有权重的，初始时，每一个样本的权重是相等的。首先，第11个学习器对训练样本进行学习，当学习完成后，增大错误样本的权重，同时减小正确样本的权重，再利用第22个学习器对其进行学习，依次进行下去，最终得到bb个学习器，最终，合并这bb个学习器的结果，同时，与Bagging中不同的是，每一个学习器的权重也是不一样的。Boosting方法的具体过程如下图所示：

在Boosting方法中，最重要的方法包括：AdaBoost和GBDT。

2、Gradient Boosting

上建立模型，由于上述是一个求解梯度的过程，因此也称为基于梯度的Boost方法，其具体过程如下所示：

三、Gradient Boosting Decision Tree

在上面简单介绍了Gradient Boost框架，梯度提升决策树Gradient Boosting Decision Tree是Gradient Boost框架下使用较多的一种模型，在梯度提升决策树中，其基学习器是分类回归树CART，使用的是CART树中的回归树。

1、分类回归树CART

分类回归树CART算法是一种基于二叉树的机器学习算法，其既能处理回归问题，又能处理分类为题，在梯度提升决策树GBDT算法中，使用到的是CART回归树算法，对于CART树算法的更多信息，可以参考简单易学的机器学习算法——分类回归树CART。

注意：对于上述最优划分标准的选择，以上的计算过程可以进一步优化。

2、GBDT——二分类

以参考文献3 Idiots’ Approach for Display Advertising Challenge中提供的代码为例：

GBDT训练的主要代码为：

void GBDT::fit(Problem const &Tr, Problem const &Va)
{
        bias = calc_bias(Tr.Y); //用于初始化的F

        std::vector<float> F_Tr(Tr.nr_instance, bias), F_Va(Va.nr_instance, bias);

        Timer timer;
        printf("iter     time    tr_loss    va_loss\n");
        // 开始训练每一棵CART树
        for(uint32_t t = 0; t < trees.size(); ++t)
        {
                timer.tic();

                std::vector<float> const &Y = Tr.Y;
                std::vector<float> R(Tr.nr_instance), F1(Tr.nr_instance); // 记录残差和F

                #pragma omp parallel for schedule(static)
                for(uint32_t i = 0; i < Tr.nr_instance; ++i)
                        R[i] = static_cast<float>(Y[i]/(1+exp(Y[i]*F_Tr[i]))); //计算残差，或者称为梯度下降的方向

                // 利用上面的残差值，在此函数中构造一棵树
                trees[t].fit(Tr, R, F1); // 分类树的生成

                double Tr_loss = 0;
                // 用上面训练的结果更新F_Tr，并计算log_loss
                #pragma omp parallel for schedule(static) reduction(+: Tr_loss)
                for(uint32_t i = 0; i < Tr.nr_instance; ++i)
                {
                        F_Tr[i] += F1[i];
                        Tr_loss += log(1+exp(-Y[i]*F_Tr[i]));
                }
                Tr_loss /= static_cast<double>(Tr.nr_instance);

                // 用上面训练的结果预测测试集，打印log_loss
                #pragma omp parallel for schedule(static)
                for(uint32_t i = 0; i < Va.nr_instance; ++i)
                {
                        std::vector<float> x = construct_instance(Va, i);
                        F_Va[i] += trees[t].predict(x.data()).second;
                }

                double Va_loss = 0;
                #pragma omp parallel for schedule(static) reduction(+: Va_loss)
                for(uint32_t i = 0; i < Va.nr_instance; ++i)
                        Va_loss += log(1+exp(-Va.Y[i]*F_Va[i]));
                Va_loss /= static_cast<double>(Va.nr_instance);

                printf("%4d %8.1f %10.5f %10.5f\n", t, timer.toc(), Tr_loss, Va_loss);
                fflush(stdout);
        }
}

CART回归树的训练代码为：

void CART::fit(Problem const &prob, std::vector<float> const &R, std::vector<float> &F1){
    uint32_t const nr_field = prob.nr_field; // 特征的个数
    uint32_t const nr_sparse_field = prob.nr_sparse_field;
    uint32_t const nr_instance = prob.nr_instance; // 样本的个数

    std::vector<Location> locations(nr_instance); // 样本信息

    #pragma omp parallel for schedule(static)
    for(uint32_t i = 0; i < nr_instance; ++i)
        locations[i].r = R[i]; // 记录每一个样本的残差

    for(uint32_t d = 0, offset = 1; d < max_depth; ++d, offset *= 2){// d:深度

        uint32_t const nr_leaf = static_cast<uint32_t>(pow(2, d)); // 叶子节点的个数


        std::vector<Meta> metas0(nr_leaf); // 叶子节点的信息

        for(uint32_t i = 0; i < nr_instance; ++i){

            Location &location = locations[i]; //第i个样本的信息

            if(location.shrinked)
                continue;

            Meta &meta = metas0[location.tnode_idx-offset]; //找到对应的叶子节点

            meta.s += location.r; //残差之和
            ++meta.n;
        }

        std::vector<Defender> defenders(nr_leaf*nr_field); //记录每一个叶节点的每一维特征
        std::vector<Defender> defenders_sparse(nr_leaf*nr_sparse_field);
        // 针对每一个叶节点

        for(uint32_t f = 0; f < nr_leaf; ++f){

            Meta const &meta = metas0[f]; // 叶子节点

            double const ese = meta.s*meta.s/static_cast<double>(meta.n); //该叶子节点的ese

            for(uint32_t j = 0; j < nr_field; ++j)
                defenders[f*nr_field+j].ese = ese;

            for(uint32_t j = 0; j < nr_sparse_field; ++j)
                defenders_sparse[f*nr_sparse_field+j].ese = ese;
        }

        std::vector<Defender> defenders_inv = defenders;

        std::thread thread_f(scan, std::ref(prob), std::ref(locations),
                std::ref(metas0), std::ref(defenders), offset, true);
        std::thread thread_b(scan, std::ref(prob), std::ref(locations),
                std::ref(metas0), std::ref(defenders_inv), offset, false);
        scan_sparse(prob, locations, metas0, defenders_sparse, offset, true);
        thread_f.join();
        thread_b.join();

        // 找出最佳的ese，scan里是每个字段的最佳ese，这里是所有字段的最佳ese，赋值给相应的tnode
        for(uint32_t f = 0; f < nr_leaf; ++f){
            // 对于每一个叶节点都找到最好的划分
            Meta const &meta = metas0[f];
            double best_ese = meta.s*meta.s/static_cast<double>(meta.n);

            TreeNode &tnode = tnodes[f+offset];
            for(uint32_t j = 0; j < nr_field; ++j){

                Defender defender = defenders[f*nr_field+j];//每一个叶节点都对应着所有的特征

                if(defender.ese > best_ese)
                {
                    best_ese = defender.ese;
                    tnode.feature = j;
                    tnode.threshold = defender.threshold;
                }

                defender = defenders_inv[f*nr_field+j];
                if(defender.ese > best_ese)
                {
                    best_ese = defender.ese;
                    tnode.feature = j;
                    tnode.threshold = defender.threshold;
                }
            }
            for(uint32_t j = 0; j < nr_sparse_field; ++j)
            {
                Defender defender = defenders_sparse[f*nr_sparse_field+j];
                if(defender.ese > best_ese)
                {
                    best_ese = defender.ese;
                    tnode.feature = nr_field + j;
                    tnode.threshold = defender.threshold;
                }
            }
        }

        // 把每个instance都分配给树里的一个叶节点下
        #pragma omp parallel for schedule(static)
        for(uint32_t i = 0; i < nr_instance; ++i){

            Location &location = locations[i];
            if(location.shrinked)
                continue;

            uint32_t &tnode_idx = location.tnode_idx;
            TreeNode &tnode = tnodes[tnode_idx];
            if(tnode.feature == -1){
                location.shrinked = true;
            }else if(static_cast<uint32_t>(tnode.feature) < nr_field){

                if(prob.Z[tnode.feature][i].v < tnode.threshold)
                    tnode_idx = 2*tnode_idx; 
                else
                    tnode_idx = 2*tnode_idx+1; 
            }else{
                uint32_t const target_feature = static_cast<uint32_t>(tnode.feature-nr_field);
                bool is_one = false;
                for(uint64_t p = prob.SJP[i]; p < prob.SJP[i+1]; ++p) 
                {
                    if(prob.SJ[p] == target_feature)
                    {
                        is_one = true;
                        break;
                    }
                }
                if(!is_one)
                    tnode_idx = 2*tnode_idx; 
                else
                    tnode_idx = 2*tnode_idx+1; 
            }
        }
    }

    // 用于计算gamma
    std::vector<std::pair<double, double>> 
        tmp(max_tnodes, std::make_pair(0, 0));
    for(uint32_t i = 0; i < nr_instance; ++i)
    {
        float const r = locations[i].r;
        uint32_t const tnode_idx = locations[i].tnode_idx;
        tmp[tnode_idx].first += r;
        tmp[tnode_idx].second += fabs(r)*(1-fabs(r));
    }

    for(uint32_t tnode_idx = 1; tnode_idx <= max_tnodes; ++tnode_idx)
    {
        double a, b;
        std::tie(a, b) = tmp[tnode_idx];
        tnodes[tnode_idx].gamma = (b <= 1e-12)? 0 : static_cast<float>(a/b);
    }

#pragma omp parallel for schedule(static)
    for(uint32_t i = 0; i < nr_instance; ++i)
        F1[i] = tnodes[locations[i].tnode_idx].gamma;// 重新更新F1的值
}

在参考文献A simple GBDT in Python中提供了Python实现的GBDT的版本。