# 【数据科学家】数据科学家修炼之路

1. 了解统计学与数据预处理知识。
2. 理解统计陷阱。你必须明白在统计分析过程中偏差与常见错误都将影响统计分析人员。
3. 了解几个机器学习与统计技术的工作原理。
4. 时间序列分析。
5. 编程技巧 (R, Java, Python, Scala)。
6. 数据库 (SQL and NoSQL Databases)。
7. 网页爬虫 (Apache Nutch, Scrapy, Jsoup).
8. 文本数据。

## 数据预处理

• Data Preparation for Data Mining by Dorian Pyle.
• Mining Imperfect Data: Dealing with Contamination and Incomplete Records by Pearson.
• Exploratory Data Mining and Data Cleaning by Johnson and Dasu.

## 了解陷阱

• Statistical Truisms in the Age of Big Data
• The Hidden biases of Big Data.

• Quora Question : What are common fallacies or mistakes made by beginners in Statistics / Machine Learning / Data Analysis.
• Identifying and Overcoming Common Data Mining Mistakes by SAS Institute.

• Common Errors in Statistics (and how to avoid them) by P. Good and J. Harding.

## 理解常用机器学习与统计算法工作机制

• Practical Selection of SVM Parameters and Noise Estimation for SVM Regression.

• Applied Predictive Modelling by Khun. Johnson 通过 caret R包给出了大量的实例，该宏包增强了参数优化能力。

• Data Mining : Practical Machine Learning Tools and Techniques by Witten and Frank.
• The Elements of Statistical Learning by Friedman, Hasting, Tibishirani.

## 时间序列预测

• Forecasting : Principles and Practice by Hyndman and Athanasopoulos 是一本介绍预测的优秀数据。
• Time Series Analysis and its Applications with R Examples by Shumway and Stoffer 是另一本关于时间序列预测 R 实践的书籍。
• 假设你对时间序列非常感兴趣，那么我还会推荐 ForeCA 的 R宏包，该宏包会告诉你如何预测时间序列。

• Scrapy
• Apache Nutch
• Jsoup

## 文本数据

• GATE
• UIMA 文本分析：
• “tm” R 包
• LingPipe
• NLTK

• Introduction to Information Retrieval by Manning, Raghavan and Schütze.
• Handbook of Natural Language Processing by Indurkhya, Damerau (Editors).
• The Text Mining HandBook – Advanced Approaches in Analyzing Unstructured Data by Feldman and Sanger. 结语 最后，这里还有一些数据科学家不该错过的书籍：
• Data Mining and Statistics for Decision Making by Stéphane Tufféry (A personal favorite).
• Introduction to Data Mining by Tan, Steinbach, Kumar. Applied Predictive Modelling by Khun, Johnson.
• Data Mining with R – Learning with Case Studies by Torgo. Principles of Data Mining by Bramer.

711 篇文章115 人订阅

0 条评论

3316

3085

501

1813

2036

3556

3526

3336

5425

3714