这是一篇数据科学领域的翻译文章,名称 Common Patterns for Analyzing Data 数据分析的通用模式。
文章比较长,需要一点点数据科学领域的前置认识,更重要的是需要一点点耐心。
data science 数据科学,对于数据分析,数据挖掘相关工作的泛指,一般会涉及统计学和计算机科学与技术两门学科
feature engineering 特征工程
行业类型:数据分析相关
数据集来源 Kaggle
Kaggle is the place to do data science projects
impart 给予 告知 传授 in handy for 方便的 rated 认定 认为 slice 切片 potential 潜力 潜能,潜在的 interactive 相互影响的,互相作用
boasts 宣扬 Each home boasts an unprecedented level of quality throughout 每家的质量都堪称一流
surviv 生还 survivor 幸存者 acknowledgment 承认,感谢 comply 遵守 同意 complicated 结构复杂的;混乱的,麻烦的 discrete 分离的
以下是原文对应的翻译
Data Scientists spend [the] vast majority of their time by [doing] data preparation, not model optimization. — lorinc
数据总是混乱的,当我前几个月自学机器学习时,我不知道该如何更好的理解数据。构建一个准确的模型的关键步骤是对将要操作的数据的全面理解。
数据科学家们会花费大量时间在数据预处理过程中,而不是在模型优化。
数据科学家们会花费大量时间在数据预处理过程中,而不是在模型优化。
用代码描述数据集
在数据预处理中处理空值和缺失值,是一个严肃的步骤
In this article, I chose a number of Exploratory Data Analyses (or EDAs) that were made publicly available on Kaggle, a website for data science. These analyses mix interactive code snippets alongside prose, and can help offer a birds-eye view of the data or tease out patterns in the data.
本篇文章的数据来源于kaggle,可以认为是探索性数据分析。kaggle是一个专门用于数据科学的网站。对数据结合代码片段进行分析,可以对原有数据形态进行一个鸟瞰。
I simultaneously looked at feature engineering, a technique for taking existing data and transforming it in such a way as to impart additional meaning (for example, taking a timestamp and pulling out a DAY_OF_WEEK
column, which might come in handy for predicting sales in a store).
我同时查看了这个地址(https://www.quora.com/Does-deep-learning-reduce-the-importance-of-feature-engineering)下的数据分析文章,关于提取已有数据,追加更多的含义,比如把时间戳timestamp 单独提取到DAY_OF_WEEK列中,在一个商店的销售额预测中,可能会派上用场。
I wanted to look at a variety of different kinds of datasets, so I chose: Structured Data NLP (Natural Language) Image
我希望查看不同种类的数据集,所以我从以下分类中进行选择 结构化数据 自然语言处理 图像数据
Feel free to jump ahead to the conclusions below, or read on to dive into the datasets.
Criteria For each category I chose two competitions where the submission date had passed, and sorted (roughly) by how many teams had submitted.
对于每个分类,我从已通过的提交中选择两个竞赛项目,根据有多少个团队提交
For each competition I searched for EDA tags, and chose three kernels that were highly rated or well commented. Final scores did not factor in (some EDAs didn’t even submit a score).
在每个竞赛项目中,我以EDA为标签进行选择,被很好的推荐和高认可
Structured Data 结构化数据
A structured data dataset is characterized by spreadsheets containing training and test data. The spreadsheets may contain categorical variables (colors, like green, red, and blue), continuous variables (ages, like 4, 15, and 67) and ordinal variables (educational level, like elementary, high school, college).
Imputation — Filling in missing values in the data Binning — Combining continuous data into buckets, a form of feature engineering
结构化数据是固定为训练数据和测试数据的电子表格。【测试数据和训练数据是数据科学领域的一个专有名词】
数据可能包含分类数据,如颜色,连续变量,顺序变量,如(学历水平,高中,大学)
The training spreadsheet has a target column that you’re trying to solve for, which will be missing in the test data. The majority of the EDAs I examined focused on teasing out potential correlations between the target variable and the other columns.
训练数据包含目标列,目标列就是需要预测的列,这一列在测试数据中并不包含。EDA的主要目的集中在目标变量和其它列之间的潜在关联关系。
装箱 压缩连续的数据,进入管道或者容器
Because you’re mostly looking for correlations between different variables, there’s only so many ways you can slice and dice the data.
For visualizations, there’s more options, but even so, some techniques seem better suited for a task at hand than others, resulting in a lot of similar-looking notebooks.
Where you can really let your imagination run wild is with feature engineering. Each of the authors I looked at had different approaches to feature engineering, whether it was choosing how to bin a feature or combining categorical features into new ones.
在特征工程方面,你可以充分发挥你的想象力,我看到很多作者有不同的途径对于特征工程,无论他们是选择一个已存在的列还是合并分类特征到新的项。
实际案例
Let’s take a deeper look at two competitions, the Titanic competition, followed by the House Prices competition.
让我们通过两个竞赛项目进行深入的查看。泰坦尼克竞赛和房屋竞赛项目
泰坦尼克生还预测竞赛
The Titanic competition is a popular beginners’ competition, and lots of folks on Kaggle cycle through it. As a result the EDAs tend to be well written and thoroughly documented, and were amongst the clearest I saw.
The dataset includes a training spreadsheet with a column Survived
indicating whether a passenger survived or not, along with other supplementary data like their age, gender, ticket fare price, and more.
泰坦尼克生还预测.png
项目目的的描述
以下是项目主页对项目目的的描述
In this challenge, we ask you to complete the analysis of what sorts of people were likely to surviv
Binary classification
用二元分类分析哪类人群有更多的生还可能
训练数据和测试数据.png
以上图片主要描述了训练数据和测试数据。训练数据是以已知结果为前提,测试数据并不知道结果,结果需要通过预测模型来得出。
House Prices is another structured data competition. This one boasts many more variables than the Titanic competition, and includes categorical, ordinal and continuous features.
Python编程指导
这里提供了Python编程的指导
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
Understand how variables are distributed and how they interact Apply different transformations before training machine learning models
理解如此之多的变量应该如何被使用到模型中,并且相互之间发生作用 常用的机器学习模型的理解
The EDAs I chose for analysis were Comprehensive Data Exploration with Python by Pedro Marcelino, Detailed Data Exploration in Python by Angela, and Fun Python EDA Step by Step by Sang-eon Park.
总结
这篇英文文章内容很长,作为英文阅读训练的素材,如果对于数据分析和机器学习没有概念的读者读起来会一头雾水。简单的总结下文章的内容。
这是一篇描述数据分析和机器学习方面的文章,借助www.kaggle.com这个数据竞赛网站提供的两个实际竞赛项目,围绕数据展开,试图向读者说明数据分析的常见模式。
文中涉及很多专业领域名词,包括数据集,测试数据,训练数据,数据预处理,模型,特征工程和数据科学等。
相关链接阅读
泰坦尼克生还预测案例指引 https://www.kaggle.com/c/titanic
房屋价格预测 https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Python编程导引 https://www.kaggle.com/c/titanic/overview/tutorials