机器学习的前期入门汇总

量化投资与机器学习微信公众号

发布于 2018-01-29 10:57:52

1.2K0

发布于 2018-01-29 10:57:52

机器学习机器学习是近20多年兴起的一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。机器学习理论主要是设计和分析一些让计算机可以自动“学习”的算法。机器学习算法是一类从数据中自动分析获得规律，并利用规律对未知数据进行预测的算法。因为学习算法中涉及了大量的统计学理论，机器学习与统计推断学联系尤为密切，也被称为统计学习理论。算法设计方面，机器学习理论关注可以实现的，行之有效的学习算法。

下面从微观到宏观试着梳理一下机器学习的范畴：一个具体的算法，领域进一步细分，实战应用场景，与其他领域的关系。

图1:机器学习的例子：NLTK监督学习的工作流程图 (source:http://www.nltk.org/book/ch06.html)

图2:机器学习概要图 by Yaser Abu-Mostafa (Caltech) (source: Map of Machine Learning (Abu-Mostafa))

图3:机器学习实战：在python scikit learn 中选择机器学习算法 by Nishant Chandra (source: In pursuit of happiness!: Picking theright Machine Learning Algorithm)

图4:机器学习和其他学科的关系：数据科学的地铁图 by Swami Chandrasekaran (source:Becoming a Data Scientist)

机器学习入门资源不完全汇总入门攻略

大致分三类：起步体悟，实战笔记，行家导读

机器学习入门者学习指南 @果壳网 (2013) 作者白马 -- [起步体悟] 研究生型入门者的亲身经历
有没有做机器学习的哥们？能否介绍一下是如何起步的 @ourcoders -- [起步体悟] 研究生型入门者的亲身经历，尤其要看reyoung的建议
tornadomeet 机器学习笔记 (2013) -- [实战笔记] 学霸的学习笔记，看看小伙伴是怎样一步一步地掌握“机器学习”
Machine Learning Roadmap: Your Self-Study Guide to Machine Learning (2014) Jason Brownlee -- [行家导读] 虽然是英文版，但非常容易读懂。对Beginner,Novice,Intermediate,Advanced读者都有覆盖。
- A Tour of Machine Learning Algorithms （2013）这篇关于机器学习算法分类的文章也非常好
- Best Machine Learning Resources for Getting Started（2013）这片有中文翻译机器学习的最佳入门学习资源 @伯乐在线译者 programmer_lin
门主的几个建议
- 既要有数学基础，也要编程实践
- 别怕英文版，你不懂的大多是专业名词，将来不论写文章还是读文档都是英文为主
- [我是小广告][我是小广告]订阅机器学习日报，跟踪业内热点资料。

机器学习入门资源不完全汇总更多攻略

机器学习该怎么入门 @知乎 (2014)
What's the easiest way to learn machine learning @quora (2013)
What is the best way to study machine learning @quora (2012)
Is there any roadmap for learning Machine Learning (ML) and its related courses at CMU Is there any roadmap for learning Machine Learning (ML) and its related courses at CMU(2014)

机器学习入门资源不完全汇总课程资源

Tom Mitchell 和 Andrew Ng 的课都很适合入门

机器学习入门资源不完全汇总入门课程机器学习入门资源不完全汇总2011Tom Mitchell(CMU)机器学习

英文原版视频与课件PDF他的《机器学习》在很多课程上被选做教材，有中文版。

Decision Trees
Probability and Estimation
Naive Bayes
Logistic Regression
Linear Regression
Practical Issues: Feature selection，Overfitting ...
Graphical models: Bayes networks, EM，Mixture of Gaussians clustering ...
Computational Learning Theory: PAC Learning, Mistake bounds ...
Semi-Supervised Learning
Hidden Markov Models
Neural Networks
Learning Representations: PCA, Deep belief networks, ICA, CCA ...
Kernel Methods and SVM
Active Learning
Reinforcement Learning 以上为课程标题节选

机器学习入门资源不完全汇总2014Andrew Ng (Stanford)机器学习

英文原版视频这就是针对自学而设计的，免费还有修课认证。“老师讲的是深入浅出，不用太担心数学方面的东西。而且作业也非常适合入门者，都是设计好的程序框架，有作业指南，根据作业指南填写该完成的部分就行。”（参见白马同学的入门攻略）"推荐报名，跟着上课，做课后习题和期末考试。(因为只看不干，啥都学不会)。"(参见reyoung的建议）

Introduction (Week 1)
Linear Regression with One Variable (Week 1)
Linear Algebra Review (Week 1, Optional)
Linear Regression with Multiple Variables (Week 2)
Octave Tutorial (Week 2)
Logistic Regression (Week 3)
Regularization (Week 3)
Neural Networks: Representation (Week 4)
Neural Networks: Learning (Week 5)
Advice for Applying Machine Learning (Week 6)
Machine Learning System Design (Week 6)
Support Vector Machines (Week 7)
Clustering (Week 8)
Dimensionality Reduction (Week 8)
Anomaly Detection (Week 9)
Recommender Systems (Week 9)
Large Scale Machine Learning (Week 10)
Application Example: Photo OCR
Conclusion

机器学习入门资源不完全汇总进阶课程

2013年Yaser Abu-Mostafa (Caltech) Learningfrom Data -- 内容更适合进阶课程视频,课件PDF@Caltech

The Learning Problem
Is Learning Feasible?
The Linear Model I
Error and Noise
Training versus Testing
Theory of Generalization
The VC Dimension
Bias-Variance Tradeoff
The Linear Model II
Neural Networks
Overfitting
Regularization
Validation
Support Vector Machines
Kernel Methods
Radial Basis Functions
Three Learning Principles
Epilogue

2014年林軒田(国立中国台湾大学)機器學習基石 (Machine Learning Foundations) -- 内容更适合进阶，華文的教學講解课程主页

When Can Machines Learn? [何時可以使用機器學習]The Learning Problem [機器學習問題] -- Learning to AnswerYes/No [二元分類] -- Types of Learning [各式機器學習問題] -- Feasibility of Learning [機器學習的可行性]

Why Can Machines Learn? [為什麼機器可以學習] --Training versus Testing [訓練與測試] -- Theory of Generalization[舉一反三的一般化理論] -- The VC Dimension [VC 維度] -- Noise and Error [雜訊一錯誤]

How Can Machines Learn? [機器可以怎麼樣學習] --Linear Regression [線性迴歸] -- Linear `Soft'Classification [軟性的線性分類] -- Linear Classificationbeyond Yes/No [二元分類以外的分類問題] -- Nonlinear Transformation[非線性轉換]

How Can Machines Learn Better? [機器可以怎麼樣學得更好] -- Hazard of Overfitting [過度訓練的危險] --Preventing Overfitting I: Regularization [避免過度訓練一：控制調適]-- Preventing Overfitting II: Validation [避免過度訓練二：自我檢測]-- Three Learning Principles [三個機器學習的重要原則]

机器学习入门资源不完全汇总更多选择

2008年Andrew Ng CS229 机器学习 -- 这组视频有些年头了，主讲人这两年也高大上了.当然基本方法没有太大变化，所以课件PDF可下载是优点。中文字幕视频@网易公开课 | 英文版视频@youtube|课件PDF@Stanford

第1集.机器学习的动机与应用第2集.监督学习应用.梯度下降第3集.欠拟合与过拟合的概念第4集.牛顿方法第5集.生成学习算法第6集.朴素贝叶斯算法第7集.最优间隔分类器问题第8集.顺序最小优化算法第9集.经验风险最小化第10集.特征选择第11集.贝叶斯统计正则化第12集.K-means算法第13集.高斯混合模型第14集.主成分分析法第15集.奇异值分解第16集.马尔可夫决策过程第17集.离散与维数灾难第18集.线性二次型调节控制第19集.微分动态规划第20集.策略搜索

2012年余凯(百度)张潼(Rutgers) 机器学习公开课 -- 内容更适合进阶课程主页@百度文库｜课件PDF@龙星计划

第1节Introduction to ML and review of linear algebra, probability,statistics (kai) 第2节linearmodel (tong) 第3节overfitting andregularization(tong) 第4节linearclassification (kai) 第5节basisexpansion and kernelmethods (kai) 第6节model selection and evaluation(kai) 第7节model combination (tong) 第8节boosting and bagging (tong) 第9节overview of learning theory(tong) 第10节optimization in machinelearning (tong) 第11节online learning (tong) 第12节sparsity models (tong) 第13节introduction to graphicalmodels (kai) 第14节structured learning (kai) 第15节feature learning and deeplearning (kai) 第16节transfer learning and semi supervised learning (kai) 第17节matrix factorization and recommendations(kai) 第18节learning on images(kai) 第19节learning on the web(tong)

机器学习入门资源不完全汇总论坛网站机器学习入门资源不完全汇总中文

我爱机器学习我爱机器学习

http://www.mitbbs.com/bbsdoc/DataSciences.htmlMITBBS－电脑网络 - 数据科学版

机器学习小组果壳 > 机器学习小组

http://cos.name/cn/forum/22统计之都 » 统计学世界 » 数据挖掘和机器学习

北邮人论坛-北邮人的温馨家园北邮人论坛 >> 学术科技 >> 机器学习与数据挖掘

机器学习入门资源不完全汇总英文

josephmisiti/awesome-machine-learning· GitHub 机器学习资源大全

Machine Learning Video LibraryCaltech 机器学习视频教程库，每个课题一个视频

Analytics, Data Mining, and DataScience 数据挖掘名站

http://www.datasciencecentral.com/数据科学中心网站

机器学习入门资源不完全汇总东拉西扯

一些好东西，入门前未必看得懂，要等学有小成时再看才能体会。

机器学习与数据挖掘的区别

机器学习关注从训练数据中学到已知属性进行预测
数据挖掘侧重从数据中发现未知属性

Dan Levin, What is the differencebetween statistics, machine learning, AI and data mining?

If there are up to 3 variables, it is statistics.
If the problem is NP-complete, it is machine learning.
If the problem is PSPACE-complete, it is AI.
If you don't know what is PSPACE-complete, it is data mining.

几篇高屋建瓴的机器学习领域概论, 参见原文

The Discipline of Machine LearningTom Mitchell 当年为在CMU建立机器学习系给校长写的东西。
A Few Useful Things to Know about Machine Learning Pedro Domingos教授的大道理，也许入门时很多概念还不明白，上完公开课后一定要再读一遍。

几本好书

李航博士的《统计学习方法》。

1. 数学基础

机器学习必要的数学基础主要包括：多元微积分，线性代数

Calculus: Single Variable | Calculus One （可选）
Multivariable Calculus
Linear Algebra

2. 统计基础

Introduction to Statistics: Descriptive Statistics
Probabilistic Systems Analysis and Applied Probability | 概率 ( 可选)
Introduction to Statistics: Inference

3. 编程基础

Programming for Everybody (Python)
DataCamp: Learn R with R tutorials and coding challenges(R)
Introduction to Computer Science:Build a Search Engine & a Social Network

4. 机器学习

Statistical Learning(R)
Machine Learning
机器学习基石
机器学习技法

下面是近期的给外行人读的泛数学科普书籍，由浅至深，作用除了感受数学之美之外，更重要的是可以作用每天学习的鸡血，因为这些书都比较好读……

1.《数学之美》作者：吴军 2.《 Mathematician's Lament | 数学家的叹息》作者：by Paul Lockhart 3.《 Think Stats: Probability and Statistics forProgrammers | 统计思维：程序员数学之概率统计》作者：Allen B. Downey 4.《 A History of Mathematics | 数学史》作者：Carl B. Boyer 5.《 Journeys Through Genius | 天才引导的历程：数学中的伟大定理》作者：William Dunham 6.《 The Mathematical Experience | 数学经验》作者 Philip J.Davis、Reuben Hersh 7.《 Proofs from the Book | 数学天书中的证明》作者：Martin Aigner、Günter M. Ziegler 8.《 Proofs and Refutations | 证明与反驳－数学发现的逻辑》作者：Imre Lakatos

1. Python/C++/R/Java - you will probably want to learnall of these languages at some point if you want a job in machine-learning.Python's Numpy and Scipy libraries [2] are awesome because they have similarfunctionality to MATLAB, but can be easily integrated into a web service andalso used in Hadoop (see below). C++ will be needed to speed code up. R [3] isgreat for statistics and plots, and Hadoop [4] is written in Java, so you mayneed to implement mappers and reducers in Java (although you could use ascripting language via Hadoop streaming [5])

首先，你要熟悉这四种语言。Python因为开源的库比较多，可以看看Numpy和Scipy这两个库，这两个都可以很好的融入网站开发以及Hadoop。C++可以让你的代码跑的更快，R则是一个很好地统计工具。而你想很好地使用Hadoop你也必须懂得java，以及如何实现map reduce

2. Probability and Statistics: A good portion oflearning algorithms are based on this theory. Naive Bayes [6], Gaussian MixtureModels [7], Hidden Markov Models [8], to name a few. You need to have a firmunderstanding of Probability and Stats to understand these models. Go nuts andstudy measure theory [9]. Use statistics as an model evaluation metric:confusion matrices, receiver-operator curves, p-values, etc.

我推荐统计学习方法李航写的，这算的上我mentor的mentor了。理解一些概率的理论，比如贝叶斯，SVM，CRF，HMM，决策树，AdaBoost，逻辑斯蒂回归，然后再稍微看看怎么做evaluation 比如P R F。也可以再看看假设检验的一些东西。

3. Applied Math + Algorithms: For discriminatemodels like SVMs [10], you need to have a firm understanding of algorithmtheory. Even though you will probably never need to implement an SVM fromscratch, it helps to understand how the algorithm works. You will need tounderstand subjects like convex optimization [11], gradient decent [12],quadratic programming [13], lagrange [14], partial differential equations [15],etc. Get used to looking at summations [16].

机器学习毕竟是需要极强极强数学基础的。我希望开始可以深入的了解一些算法的本质，SVM是个很好的下手点。可以从此入手，看看拉格朗日，凸优化都是些什么

4. Distributed Computing: Most machine learningjobs require working with large data sets these days (see Data Science) [17].You cannot process this data on a single machine, you will have to distributeit across an entire cluster. Projects like Apache Hadoop [4] and cloud serviceslike Amazon's EC2 [18] makes this very easy and cost-effective. Although Hadoopabstracts away a lot of the hard-core, distributed computing problems, youstill need to have a firm understanding of map-reduce [22], distribute-filesystems [19], etc. You will most likely want to check out Apache Mahout [20]and Apache Whirr [21].

熟悉分布计算，机器学习当今必须是多台机器跑大数据，要不然没啥意义。请熟悉Hadoop，这对找工作有很大很大的意义。百度等公司都需要hadoop基础。

5. Expertise in Unix Tools: Unless you are veryfortunate, you are going to need to modify the format of your data sets so theycan be loaded into R,Hadoop,HBase [23],etc. You can use a scripting languagelike python (using re) to do this but the best approach is probably just masterall of the awesome unix tools that were designed for this: cat [24], grep [25],find [26], awk [27], sed [28], sort [29], cut [30], tr [31], and many more.Since all of the processing will most likely be on linux-based machine (Hadoopdoesnt run on Window I believe), you will have access to these tools. Youshould learn to love them and use them as much as possible. They certainly havemade my life a lot easier. A great example can be found here [1].

熟悉Unix的Tool以及命令。百度等公司都是依靠Linux工作的，可能现在依靠Windows的Service公司已经比较少了。所以怎么也要熟悉Unix操作系统的这些指令吧。我记得有个百度的面试题就是问文件复制的事情。

6. Become familiar with the Hadoop sub-projects:HBase, Zookeeper [32], Hive [33], Mahout, etc. These projects can help youstore/access your data, and they scale.

机器学习终究和大数据息息相关，所以Hadoop的子项目要关注，比如HBase Zookeeper Hive等等

7. Learn about advanced signal processing techniques:feature extraction is one of the most important parts of machine-learning. Ifyour features suck, no matter which algorithm you choose, your going to seehorrible performance. Depending on the type of problem you are trying to solve,you may be able to utilize really cool advance signal processing algorithmslike: wavelets [42], shearlets [43], curvelets [44], contourlets [45], bandlets[46]. Learn about time-frequency analysis [47], and try to apply it to yourproblems. If you have not read about Fourier Analysis[48] and Convolution[49],you will need to learn about this stuff too. The ladder is signal processing101 stuff though.

这里主要是在讲特征的提取问题。无论是分类（classification）还是回归（regression）问题，都要解决特征选择和抽取（extraction）的问题。他给出了一些基础的特征抽取的工具如小波等，同时说需要掌握傅里叶分析和卷积等等。这部分我不大了解，大概就是说信号处理你要懂，比如傅里叶这些。。。

Finally, practice and read as much as you can. In yourfree time, read papers like Google Map-Reduce [34], Google File System [35],Google Big Table [36], The Unreasonable Effectiveness of Data [37],etc Thereare great free machine learning books online and you should read those also.[38][39][40]. Here is an awesome course I found and re-posted on github [41].Instead of using open source packages, code up your own, and compare theresults. If you can code an SVM from scratch, you will understand the conceptof support vectors, gamma, cost, hyperplanes, etc. It's easy to just load somedata up and start training, the hard part is making sense of it all.

总之机器学习如果想要入门分为两方面：一方面是去看算法，需要极强的数理基础（真的是极强的），从SVM入手，一点点理解。另一方面是学工具，比如分布式的一些工具以及Unix。

量化投资与机器学习

知识、能力、深度、专业

勤奋、天赋、耐得住寂寞

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2016-02-16，如有侵权请联系 cloudcommunity@tencent.com 删除

机器学习