问RandomForestClassifier中的feature_importances是如何确定的？
EN

Stack Overflow用户

提问于 2013-04-04 19:53:04

回答 3查看 71.4K关注 0票数 134

我有一个以时间序列作为数据输入的分类任务，其中每个属性(n=23)表示一个特定的时间点。除了绝对分类结果之外，我还想找出哪些属性/日期对结果有多大的影响。因此，我只使用feature_importances_，它对我来说工作得很好。

但是，我想知道它们是如何计算出来的，以及使用了哪种度量/算法。不幸的是，我找不到任何关于这个主题的文档。

random-forest

feature-selection

scikit-learn

回答 3

Stack Overflow用户

发布于 2013-04-05 05:16:55

确实有几种方法可以获得特性的“重要性”。就像往常一样，对于这个词的含义没有严格的共识。

在scikit learn中，我们实现了1中描述的重要性。它有时被称为“基尼重要性”或“平均减少杂质”，并被定义为在集成的所有树上平均的节点杂质的总减少(由到达该节点的概率加权(由到达该节点的样本的比例近似))。

在文献或其他一些包中，您还可以找到实现为“平均减少精度”的特征重要性。基本上，我们的想法是在随机排列该特征的值时，测量OOB数据的准确性下降。如果降幅较低，则功能不重要，反之亦然。

(请注意，这两种算法都可以在randomForest R包中找到。)

1: Breiman，Friedman，“分类和回归树”，1984。

票数 165

Stack Overflow用户

发布于 2018-08-06 15:59:50

代码：

iris = datasets.load_iris()  
X = iris.data  
y = iris.target  
clf = DecisionTreeClassifier()  
clf.fit(X, y)

decision_tree图：

enter image description here

我们会得到

compute_feature_importance:[0. ,0.01333333,0.06405596,0.92261071]

检查源代码：

cpdef compute_feature_importances(self, normalize=True):
    """Computes the importance of each feature (aka variable)."""
    cdef Node* left
    cdef Node* right
    cdef Node* nodes = self.nodes
    cdef Node* node = nodes
    cdef Node* end_node = node + self.node_count

    cdef double normalizer = 0.

    cdef np.ndarray[np.float64_t, ndim=1] importances
    importances = np.zeros((self.n_features,))
    cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data

    with nogil:
        while node != end_node:
            if node.left_child != _TREE_LEAF:
                # ... and node.right_child != _TREE_LEAF:
                left = &nodes[node.left_child]
                right = &nodes[node.right_child]

                importance_data[node.feature] += (
                    node.weighted_n_node_samples * node.impurity -
                    left.weighted_n_node_samples * left.impurity -
                    right.weighted_n_node_samples * right.impurity)
            node += 1

    importances /= nodes[0].weighted_n_node_samples

    if normalize:
        normalizer = np.sum(importances)

        if normalizer > 0.0:
            # Avoid dividing by zero (e.g., when root is pure)
            importances /= normalizer

    return importances

尝试计算功能重要性：

print("sepal length (cm)",0)
print("sepal width (cm)",(3*0.444-(0+0)))
print("petal length (cm)",(54* 0.168 - (48*0.041+6*0.444)) +(46*0.043 -(0+3*0.444)) + (3*0.444-(0+0)))
print("petal width (cm)",(150* 0.667 - (0+100*0.5)) +(100*0.5-(54*0.168+46*0.043))+(6*0.444 -(0+3*0.444)) + (48*0.041-(0+0)))

我们得到feature_importance：np.array([0,1.332,6.418,92.30])。

归一化后，我们得到array ([0., 0.01331334, 0.06414793, 0.92253873])，这与clf.feature_importances_相同。

注意，所有的类都应该有一个权重。

票数 2

Stack Overflow用户

发布于 2018-08-14 03:41:00

对于那些正在寻找关于此主题的scikit-learn文档的参考或@GillesLouppe的答案的人：

在RandomForestClassifier中，estimators_属性是DecisionTreeClassifier的列表(如documentation中所述)。为了计算RandomForestClassifier的feature_importances_，在scikit-learn's source code中，它对集成中的所有估计器(所有决策树分类)的feature_importances_属性求平均。

在决策树分类器的documentation中，提到“一个特征的重要性被计算为该特征带来的标准的(归一化)总减少量。它也被称为基尼重要性1。”

Here是关于变量和基尼重要性的更多信息的直接链接，由下面的scikit learn参考提供。

1 L. Breiman和A. Cutler，“随机森林”，http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/15810339

复制

相似问题

问RandomForestClassifier中的feature_importances是如何确定的？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问RandomForestClassifier中的feature_importances是如何确定的？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问RandomForestClassifier中的feature_importances是如何确定的？
EN