数据科学家用得最多的十种数据挖掘算法

Latest KDnuggets Poll asked Which methods/algorithms you used in the past 12 months for an actual Data Science-related application? .

Here are the results, based on 844 voters.

The top 10 algorithms and their share of voters are:

Fig. 1: Top 10 algorithms used by Data Scientists. See full table of all algorithms at the end of the post.

The average respondent used 8.1 algorithms, a big increase vs a similar poll in 2011.

Comparing with 2011 Poll Algorithms for data analysis / data mining we note that the top methods are still Regression, Clustering, Decision Trees/Rules, and Visualization. The biggest relative increases, measured by (pct2016 /pct2011 - 1) are for

  • Boosting, up 40% to 32.8% share in 2016 from 23.5% share in 2011
  • Text Mining, up 30% to 35.9% from 27.7%
  • Visualization, up 27% to 48.7% from 38.3%
  • Time series/Sequence analysis, up 25% to 37.0% from 29.6%
  • Anomaly/Deviation detection, up 19% to 19.5% from 16.4%
  • Ensemble methods, up 19% to 33.6% from 28.3%
  • SVM, up 18% to 33.6% from 28.6%
  • Regression, up 16% to 67.1% from 57.9%

Most popular among new options added in 2016 are

  • K-nearest neighbors, 46% share
  • PCA, 43%
  • Random Forests, 38%
  • Optimization, 24%
  • Neural networks - Deep Learning, 19%
  • Singular Value Decomposition, 16%

The biggest declines are for

  • Association rules, down 47% to 15.3% from 28.6%
  • Uplift modeling, down 36% to 3.1% from 4.8% (that is a surprise, given strong results published)
  • Factor Analysis, down 24% to 14.2% from 18.6%
  • Survival Analysis, down 15% to 7.9% from 9.3%

The following table shows usage of different algorithms types: Supervised, Unsupervised, Meta, and other by Employment type. We excluded NA (4.5%) and Other (3%) employment types.

Table 1: Algorithm usage by Employment Type

Employment Type

% Voters

Avg Num Algorithms Used

% Used Super-vised

% Used Unsuper-vised

% Used Meta

% Used Other Methods

Industry

59%

8.4

94%

81%

55%

83%

Government/Non-profit

4.1%

9.5

91%

89%

49%

89%

Student

16%

8.1

94%

76%

47%

77%

Academia

12%

7.2

95%

81%

44%

77%

All

8.3

94%

82%

48%

81%

We note that almost everyone uses supervised learning algorithms. Government and Industry Data Scientists used more different types of algorithms than students or academic researchers, and Industry Data Scientists were more likely to use Meta-algorithms.

Next, we analyzed the usage of top 10 algorithms + Deep Learning by employment type.

Table 2: Top 10 Algorithms + Deep Learning usage by Employment Type

Algorithm

Industry

Government/Non-profit

Academia

Student

All

Regression

71%

63%

51%

64%

67%

Clustering

58%

63%

51%

58%

57%

Decision

59%

63%

38%

57%

55%

Visualization

55%

71%

28%

47%

49%

K-NN

46%

54%

48%

47%

46%

PCA

43%

57%

48%

40%

43%

Statistics

47%

49%

37%

36%

43%

Random Forests

40%

40%

29%

36%

38%

Time series

42%

54%

26%

24%

37%

Text Mining

36%

40%

33%

38%

36%

Deep Learning

18%

9%

24%

19%

19%

To make the differences easier to see, we compute the algorithm bias for a particular employment type relative to average algorithm usage as Bias(Alg,Type)=Usage(Alg,Type)/Usage(Alg,All) - 1.

?

Fig. 2: Algorithm usage bias by Employment.

We note that Industry Data Scientists are more likely to use Regression, Visualization, Statistics, Random Forests, and Time Series. Government/non-profit are more likely to use Visualization, PCA, and Time Series. Academic researchers are more likely to use PCA and Deep Learning. Students generally use fewer algorithms, but do more text mining and Deep Learning.

Next, we look at regional participation which was representative of overall KDnuggets visitors.

Regional distribution of poll participants.

  • US/Canada, 40%
  • Europe, 32%
  • Asia, 18%
  • Latin America, 5.0%
  • Africa/Middle East, 3.4%
  • Australia/NZ, 2.2%

As in 2011 poll, we combined Industry/Government in one group and Academic researchers/Students into a second group, and computed the "affinity" of the algorithm to Industry/Gov as

N(Alg,Ind_Gov) / N(Alg,Aca_Stu) ------------------------------- - 1 N(Ind_Gov) / N(Aca_Stu)

Thus algorithm with affinity 0 is used equally in Industry/Government and by Academic Researchers or students. The higher IG affinity the more "industrial" is the algorithms, and the lower it is the more "academic" is the algorithm.

The most "Industrial Algorithms" were:

  • Uplift modeling, 2.01
  • Anomaly Detection, 1.61
  • Survival Analysis, 1.39
  • Factor Analysis, 0.83
  • Time series/Sequences, 0.69
  • Association Rules, 0.5

While the uplift modeling was again the most "industrial algorithm", the surprising finding is that it is used by so few - only 3.1% - the lowest of any algorithm in this poll.

The most academic algorithms were

  • Neural networks - regular, -0.35
  • Naive Bayes, -0.35
  • SVM, -0.24
  • Deep Learning, -0.19
  • EM, -0.17

Next figure shows all the algorithms and their Industry/Academic affinity.

?

Fig. 3. KDnuggets Poll: Top Algorithms used by Data Scientists: Industry vs Academia

Next table has the details on the algorithms, % respondents who used them in 2016 and 2011 Poll, change (%2016 / %2011 - 1), and Industry affinity as explained above.

Table 3: KDnuggets 2016 Poll: Algorithms Used by Data Scientists Next table has the details on the algorithms, with columns

  • N: Rank according to share of usage
  • Algorithm: algorithm name,
  • Type: S - Supervised, U - Unsupervised, M - Meta, Z - Other,
  • % respondents who used this algorithm in 2016 Poll
  • % respondents who used this algorithm in 2011 Poll
  • change (%2016 / %2011 - 1), and
  • Industry affinity as explained above.

Table 4: KDnuggets 2016 Poll: Algorithms Used by Data Scientists

N

Algorithm

Type

2016 % used

2011 % used

% Change

Industry Affinity

1

Regression

S

67%

58%

16%

0.21

2

Clustering

U

57%

52%

8.7%

0.05

3

Decision Trees/Rules

S

55%

60%

-7.3%

0.21

4

Visualization

Z

49%

38%

27%

0.44

5

K-nearest neighbors

S

46%

0.32

6

PCA

U

43%

0.02

7

Statistics

Z

43%

48%

-11.0%

1.39

8

Random Forests

S

38%

0.22

9

Time series/Sequence analysis

Z

37%

30%

25.0%

0.69

10

Text Mining

Z

36%

28%

29.8%

0.01

11

Ensemble methods

M

34%

28%

18.9%

-0.17

12

SVM

S

34%

29%

17.6%

-0.24

13

Boosting

M

33%

23%

40%

0.24

14

Neural networks - regular

S

24%

27%

-10.5%

-0.35

15

Optimization

Z

24%

0.07

16

Naive Bayes

S

24%

22%

8.9%

-0.02

17

Bagging

M

22%

20%

8.8%

0.02

18

Anomaly/Deviation detection

Z

20%

16%

19%

1.61

19

Neural networks - Deep Learning

S

19%

-0.35

20

Singular Value Decomposition

U

16%

0.29

21

Association rules

Z

15%

29%

-47%

0.50

22

Graph / Link / Social Network Analysis

Z

15%

14%

8.0%

-0.08

23

Factor Analysis

U

14%

19%

-23.8%

0.14

24

Bayesian networks

S

13%

-0.10

25

Genetic algorithms

Z

8.8%

9.3%

-6.0%

0.83

26

Survival Analysis

Z

7.9%

9.3%

-14.9%

-0.15

27

EM

U

6.6%

-0.19

28

Other methods

Z

4.6%

-0.06

29

Uplift modeling

S

3.1%

4.8%

-36.1%

2.01

原文发布于微信公众号 - 智能计算时代(intelligentinterconn)

原文发表时间:2016-09-28

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏机器之心

教程 | 如何使用TensorFlow中的高级API:Estimator、Experiment和Dataset

选自Medium 作者:Peter Roelants 机器之心编译 参与:李泽南、黄小天 近日,背景调查公司 Onfido 研究主管 Peter Roelant...

7217
来自专栏数值分析与有限元编程

可视化 | Tecplot作三角形单元后处理工具

对于有限元分析的后处理,除了单的信息,还包括单元信息,比如一个单元由哪些结点组成。Tecplot可以处理的单元类型有三角形单元,四边形单元,四面体单元和六面体单...

4044
来自专栏郭耀华‘s Blog

知乎问题代码

1162
来自专栏数据结构与算法

洛谷P1722 矩阵 II

题目背景 usqwedf 改编系列题。 题目描述 如果你在百忙之中抽空看题,请自动跳到第六行。 众所周知,在中国古代算筹中,红为正,黑为负…… 给定一个1*(2...

3325
来自专栏一心无二用,本人只专注于基础图像算法的实现与优化。

SSE图像算法优化系列七:基于SSE实现的极速的矩形核腐蚀和膨胀(最大值和最小值)算法。

  因未测试其他作者的算法时间和效率,本文不敢自称是最快的,但是速度也可以肯定说是相当快的,在一台I5机器上占用单核的资源处理 3000 * 2000的灰度...

4179
来自专栏数据结构与算法

2039 骑马修栅栏

题目描述 Description Farmer John每年有很多栅栏要修理。他总是骑着马穿过每一个栅栏并修复它破损的地方。 John是一个与其他农民一样懒的人...

38111
来自专栏大数据挖掘DT机器学习

用R语言做数据清理(详细教程)

数据的清理 如同列夫托尔斯泰所说的那样:“幸福的家庭都是相似的,不幸的家庭各有各的不幸”,糟糕的恶心的数据各有各的糟糕之处,好的数据集都是相似的。一份好的,干净...

8255
来自专栏数据结构与算法

BZOJ 5248: [2018多省省队联测]一双木棋(对抗搜索)

1530
来自专栏量化投资与机器学习

【代码+论文】通过ML、Time Series模型学习股价行为

今天编辑部给大家带来的是来自Jeremy Jordan的论文,主要分析论文的建模步骤和方法,具体内容大家可以自行查看。 # Standard imports i...

4798
来自专栏文武兼修ing——机器学习与IC设计

基于sklearn的K邻近分类器概念代码实现

概念 KNN(K临近)分类器应该算是概率派的机器学习算法中比较简单的。基本的思想为在预测时,计算输入向量到每个训练样本的欧氏距离(几何距离),选取最近的K个训练...

3536

扫码关注云+社区

领取腾讯云代金券