# 统计建模：两种文化

Abstract

1. Introduction

3. Projects in consulting

5. The use of data models

6. The limitations of data models

7. Algorithmic modeling

8. Rashomon and the multiplicity of good models

9. Occam and simplicity vs. accuracy

10. Bellman and the curse of dimensionality

11. Information from a black box

12. Final remarks

Statistical Modeling: The Two Cultures

Leo Breiman

Professor, Department of Statistics, University of California, Berkeley, California

Abstract

There are two cultures in the use of statistical modeling toreach conclusions from data. One assumes that the data are generatedby a given stochastic data model. The other uses algorithmic models andtreats the data mechanism as unknown. The statistical community hasbeen committed to the almost exclusive use of data models. This commitmenthas led to irrelevantheory, questionable conclusions, and has keptstatisticians from working on a large range of interesting current problems.Algorithmic modeling, both in theory and practice, has developedrapidly in fields outside statistics. It can be used both on large complexdata sets and as a more accurate and informative alternative to datamodeling on smaller data sets. If our goal as a field is to use data tosolve problems, then we need to move away from exclusive dependenceon data models and adopt a more diverse set of tools.

1. INTRODUCTION

Statistics starts with data. Think of the data asbeing generated by a black box in which a vector ofinput variables x (independent variables) go in oneside, and on the other side the response variables ycome out. Inside the black box, nature functions toassociate the predictor variables with the responsevariables, so the picture is like this:

1. 介绍

There are two goals in analyzing the data:

Prediction. To be able to predict what the responsesare going to be to future input variables;Information. To extract some information abouthow nature is associating the response variablesto the input variables.

There are two different approaches toward these goals:

The Data Modeling Culture

The analysis in this culture starts with assuminga stochastic data model for the inside of the blackbox. For example, a common data model is that dataare generated by independent draws fromresponse variables = f(predictor variables,random noise, parameters)

The values of the parameters are estimated fromthe data and the model then used for informationand/or prediction. Thus the black box is filled in likethis:

Model validation. Yes-no using goodness-of-fittests and residual examination.Estimated culture population. 98% of all statisticians.

Cox 回归模型

proportional hazards regression

COX回归模型，又称“比例风险回归模型(proportional hazards model，简称Cox模型)”，是由英国统计学家D.R.Cox(1972)年提出的一种半参数回归模型。该模型以生存结局和生存时间为应变量，可同时分析众多因素对生存期的影响，能分析带有截尾生存时间的资料，且不要求估计资料的生存分布类型。由于上述优良性质，该模型自问世以来，在医学随访研究中得到广泛的应用，是迄今生存分析中应用最多的多因素分析方法

——摘自百度百科《COX回归模型》【1】

The Algorithmic Modeling Culture

The analysis in this culture considers the inside ofthe box complex and unknown. Their approach is tofind a function f(x)-an algorithm that operates onx to predict the responses y. Their black box lookslike this:

Model validation. Measured by predictive accuracy.Estimated culture population. 2% of statisticians,many in other fields.In this paper I will argue that the focus in thestatistical community on data models has:* Led to irrelevant theory and questionable scientificconclusions;

* Kept statisticians from using more suitablealgorithmic models;* Prevented statisticians from working on excitingnew problems;I will also review some of the interesting newdevelopments in algorithmic modeling in machinelearning and look at applications to three data sets.

• 发表于:
• 原文链接https://kuaibao.qq.com/s/20180630G1JMSC00?refer=cp_1026
• 腾讯「云+社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
• 如有侵权，请联系 yunjia_community@tencent.com 删除。

2020-11-29

2020-11-29

2020-11-29

2020-11-29

2020-11-29

2018-05-31

2018-05-24

2018-05-08

2020-11-29

2020-11-29

2020-11-29