机器学习该怎么入门?

机器学习该怎么入门?

本人大学本科,对机器学习很感兴趣,想从事这方面的研究。在网上看到机器学习有一些经典书如Bishop的PRML, Tom Mitchell的machine learning,还有pattern classification,不知该如何入门?哪本书比较容易理解?

自然语言处理民工

我要翻译一把quora了(Quora - The best answer to any question),再加点我的理解,我相信会是一个好答案。

1. Python/C++/R/Java - you will probably want to learn all of these languages at some point if you want a job in machine-learning. Python's Numpy and Scipy libraries are awesome because they have similar functionality to MATLAB, but can be easily integrated into a web service and also used in Hadoop (see below). C++ will be needed to speed code up. R is great for statistics and plots, and Hadoop is written in Java, so you may need to implement mappers and reducers in Java (although you could use a scripting language via Hadoop streaming )

首先,你要熟悉这四种语言。Python因为开源的库比较多,可以看看Numpy和Scipy这两个库,这两个都可以很好的融入网站开发以及Hadoop。C++可以让你的代码跑的更快,R则是一个很好地统计工具。而你想很好地使用Hadoop你也必须懂得java,以及如何实现map reduce

2. Probability and Statistics: A good portion of learning algorithms are based on this theory. Naive Bayes, Gaussian Mixture Models , Hidden Markov Models , to name a few. You need to have a firm understanding of Probability and Stats to understand these models. Go nuts and study measure theory . Use statistics as an model evaluation metric: confusion matrices, receiver-operator curves, p-values, etc.

我推荐统计学习方法 李航写的,这算的上我mentor的mentor了。理解一些概率的理论,比如贝叶斯,SVM,CRF,HMM,决策树,AdaBoost,逻辑斯蒂回归,然后再稍微看看怎么做evaluation 比如P R F。也可以再看看假设检验的一些东西。

3. Applied Math + Algorithms: For discriminate models like SVMs , you need to have a firm understanding of algorithm theory. Even though you will probably never need to implement an SVM from scratch, it helps to understand how the algorithm works. You will need to understand subjects like convex optimization , gradient decent , quadratic programming , lagrange , partial differential equations , etc. Get used to looking at summations .

机器学习毕竟是需要极强极强数学基础的。我希望开始可以深入的了解一些算法的本质,SVM是个很好的下手点。可以从此入手,看看拉格朗日,凸优化都是些什么

4. Distributed Computing: Most machine learning jobs require working with large data sets these days (see Data Science) . You cannot process this data on a single machine, you will have to distribute it across an entire cluster. Projects like Apache Hadoop and cloud services like Amazon's EC2 makes this very easy and cost-effective. Although Hadoop abstracts away a lot of the hard-core, distributed computing problems, you still need to have a firm understanding of map-reduce , distribute-file systems , etc. You will most likely want to check out Apache Mahout and Apache Whirr .

熟悉分布计算,机器学习当今必须是多台机器跑大数据,要不然没啥意义。请熟悉Hadoop,这对找工作有很大很大的意义。百度等公司都需要hadoop基础。

5. Expertise in Unix Tools: Unless you are very fortunate, you are going to need to modify the format of your data sets so they can be loaded into R,Hadoop,HBase,etc. You can use a scripting language like python (using re) to do this but the best approach is probably just master all of the awesome unix tools that were designed for this: cat , grep , find , awk , sed , sort , cut, tr , and many more. Since all of the processing will most likely be on linux-based machine (Hadoop doesnt run on Window I believe), you will have access to these tools. You should learn to love them and use them as much as possible. They certainly have made my life a lot easier. A great example can be found here .

熟悉Unix的Tool以及命令。百度等公司都是依靠Linux工作的,可能现在依靠Windows的Service公司已经比较少了。所以怎么也要熟悉Unix操作系统的这些指令吧。我记得有个百度的面试题就是问文件复制的事情。

6. Become familiar with the Hadoop sub-projects: HBase, Zookeeper, Hive , Mahout, etc. These projects can help you store/access your data, and they scale.

机器学习终究和大数据息息相关,所以Hadoop的子项目要关注,比如HBase Zookeeper Hive等等

7. Learn about advanced signal processing techniques: feature extraction is one of the most important parts of machine-learning. If your features suck, no matter which algorithm you choose, your going to see horrible performance. Depending on the type of problem you are trying to solve, you may be able to utilize really cool advance signal processing algorithms like: wavelets , shearlets , curvelets, contourlets, bandlets [46]. Learn about time-frequency analysis , and try to apply it to your problems. If you have not read about Fourier Analysis and Convolution[, you will need to learn about this stuff too. The ladder is signal processing 101 stuff though.

这里主要是在讲特征的提取问题。无论是分类(classification)还是回归(regression)问题,都要解决特征选择和抽取(extraction)的问题。他给出了一些基础的特征抽取的工具如小波等,同时说需要掌握傅里叶分析和卷积等等。这部分我不大了解,大概就是说信号处理你要懂,比如傅里叶这些。。。

Finally, practice and read as much as you can. In your free time, read papers like Google Map-Reduce, Google File System, Google Big Table , The Unreasonable Effectiveness of Data ,etc There are great free machine learning books online and you should read those also. Here is an awesome course I found and re-posted on github. Instead of using open source packages, code up your own, and compare the results. If you can code an SVM from scratch, you will understand the concept of support vectors, gamma, cost, hyperplanes, etc. It's easy to just load some data up and start training, the hard part is making sense of it all.

  • 总之机器学习如果想要入门分为两方面: 一方面是去看算法,需要极强的数理基础(真的是极强的),从SVM入手,一点点理解。 另一方面是学工具,比如分布式的一些工具以及Unix~

原文发布于微信公众号 - 大数据挖掘DT数据分析(datadw)

原文发表时间:2015-07-08

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏CDA数据分析师

做数据分析,Python和R究竟哪个更强?

几十年来,研究人员和开发人员一直在争论,对于进行数据科学和数据分析,Python和R语言哪个才是更好的选择?近年来,数据科学在生物技术、金融和社交媒体等多个行业...

411
来自专栏新智元

最新数据科学和机器学习 MOOC 资源,成为高手的个性化课程组合

【新智元导读】今天我们要介绍的主人公叫 David Venturi。一年前他还没有编程背景,凭着对数学的爱好开始上网自学。后来他被加拿大一所大学的计算机科学专业...

34810
来自专栏专知

【干货】台大李宏毅两个小时带你纵览自然语言处理和语音内容机器理解,附全程视频PPT下载

【导读】李宏毅11月25日在Dosudo硅谷工程师读书会上两个办小时的演讲。 在这次演讲中李宏毅老师以语音机器理解为例纵览自然语言处理各种最新技术。李宏毅老师演...

6205
来自专栏专知

谷歌2018博士生奖研金出炉:39人上榜,8位华人学生入选

【导读】Google 博士生奖研金项目(Google Ph.D Fellowship Program)创立于2009年,以奖励表彰在计算机学科及其相关学科或者其...

3306
来自专栏PPV课数据科学社区

【学习】50+数据科学与机器学习速查表

关于Python、R和Numpy、Scipy以及Pandas的速查表 有了这些和R语言、python、Django、MySQL、SQL、Hadoop、Apach...

3505
来自专栏目标检测和深度学习

全球最全计算机视觉资料(7:领域专家)

651
来自专栏有趣的Python

TensorFlow应用实战-1- 课程介绍及项目展示

带你开发TensorFlow人工智能应用 舆论热点 & 朋友圈 阿尔法狗 人工智能Dota2 Jarvis智能管家(取自钢铁侠中) 工业应用: 无人驾驶汽车 语...

3779
来自专栏量子位

解密600年前的秘密,科学家利用AI成功破译“伏尼契手稿”第一句

唐旭 编译整理 量子位 出品 | 公众号 QbitAI 1912年,一份残余240页、从头至尾由未知文字与奇异插图写成的手稿在罗马附近的一所耶稣会大学图书馆中被...

33613
来自专栏挖数

数据分析告诉你,鲁迅的文章真的是匕首投枪

我们读一篇文章时,很容易感受到作者的情绪,作者是悲伤的,笔下的文字可能字字泣血,作者是快乐的,笔下的文字也会跳舞。

733
来自专栏生信技能树

学习三维基因组数据处理前的准备工作

毫无疑问,处理数据的首要条件是理解数据从产生,对应到我们这个系列,也就是了解三维基因组的背景知识,如下:

552

扫描关注云+社区