前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >数据分析的通用范式-Common Patterns for Analyzing Data

数据分析的通用范式-Common Patterns for Analyzing Data

作者头像
needrunning
发布2019-10-08 16:28:24
8210
发布2019-10-08 16:28:24
举报
文章被收录于专栏:图南科技图南科技图南科技

这是一篇数据科学领域的翻译文章,名称 Common Patterns for Analyzing Data 数据分析的通用模式。

文章比较长,需要一点点数据科学领域的前置认识,更重要的是需要一点点耐心。

文集相关术语

data science 数据科学,对于数据分析,数据挖掘相关工作的泛指,一般会涉及统计学和计算机科学与技术两门学科

feature engineering 特征工程

行业类型:数据分析相关

数据集来源 Kaggle

Kaggle is the place to do data science projects

相关词汇

impart 给予 告知 传授 in handy for 方便的 rated 认定 认为 slice 切片 potential 潜力 潜能,潜在的 interactive 相互影响的,互相作用

boasts 宣扬 Each home boasts an unprecedented level of quality throughout 每家的质量都堪称一流

surviv 生还 survivor 幸存者 acknowledgment 承认,感谢 comply 遵守 同意 complicated 结构复杂的;混乱的,麻烦的 discrete 分离的

以下是原文对应的翻译

Data Scientists spend [the] vast majority of their time by [doing] data preparation, not model optimization. — lorinc

数据总是混乱的,当我前几个月自学机器学习时,我不知道该如何更好的理解数据。构建一个准确的模型的关键步骤是对将要操作的数据的全面理解。

数据科学家们会花费大量时间在数据预处理过程中,而不是在模型优化。

数据科学家们会花费大量时间在数据预处理过程中,而不是在模型优化。

用代码描述数据集

在数据预处理中处理空值和缺失值,是一个严肃的步骤

In this article, I chose a number of Exploratory Data Analyses (or EDAs) that were made publicly available on Kaggle, a website for data science. These analyses mix interactive code snippets alongside prose, and can help offer a birds-eye view of the data or tease out patterns in the data.

本篇文章的数据来源于kaggle,可以认为是探索性数据分析。kaggle是一个专门用于数据科学的网站。对数据结合代码片段进行分析,可以对原有数据形态进行一个鸟瞰。

I simultaneously looked at feature engineering, a technique for taking existing data and transforming it in such a way as to impart additional meaning (for example, taking a timestamp and pulling out a DAY_OF_WEEK column, which might come in handy for predicting sales in a store).

我同时查看了这个地址(https://www.quora.com/Does-deep-learning-reduce-the-importance-of-feature-engineering)下的数据分析文章,关于提取已有数据,追加更多的含义,比如把时间戳timestamp 单独提取到DAY_OF_WEEK列中,在一个商店的销售额预测中,可能会派上用场。

I wanted to look at a variety of different kinds of datasets, so I chose: Structured Data NLP (Natural Language) Image

我希望查看不同种类的数据集,所以我从以下分类中进行选择 结构化数据 自然语言处理 图像数据

Feel free to jump ahead to the conclusions below, or read on to dive into the datasets.

Criteria For each category I chose two competitions where the submission date had passed, and sorted (roughly) by how many teams had submitted.

对于每个分类,我从已通过的提交中选择两个竞赛项目,根据有多少个团队提交

For each competition I searched for EDA tags, and chose three kernels that were highly rated or well commented. Final scores did not factor in (some EDAs didn’t even submit a score).

在每个竞赛项目中,我以EDA为标签进行选择,被很好的推荐和高认可

Structured Data 结构化数据

A structured data dataset is characterized by spreadsheets containing training and test data. The spreadsheets may contain categorical variables (colors, like green, red, and blue), continuous variables (ages, like 4, 15, and 67) and ordinal variables (educational level, like elementary, high school, college).

Imputation — Filling in missing values in the data Binning — Combining continuous data into buckets, a form of feature engineering

结构化数据是固定为训练数据和测试数据的电子表格。【测试数据和训练数据是数据科学领域的一个专有名词】

数据可能包含分类数据,如颜色,连续变量,顺序变量,如(学历水平,高中,大学)

The training spreadsheet has a target column that you’re trying to solve for, which will be missing in the test data. The majority of the EDAs I examined focused on teasing out potential correlations between the target variable and the other columns.

训练数据包含目标列,目标列就是需要预测的列,这一列在测试数据中并不包含。EDA的主要目的集中在目标变量和其它列之间的潜在关联关系。

装箱 压缩连续的数据,进入管道或者容器

Because you’re mostly looking for correlations between different variables, there’s only so many ways you can slice and dice the data.

For visualizations, there’s more options, but even so, some techniques seem better suited for a task at hand than others, resulting in a lot of similar-looking notebooks.

Where you can really let your imagination run wild is with feature engineering. Each of the authors I looked at had different approaches to feature engineering, whether it was choosing how to bin a feature or combining categorical features into new ones.

在特征工程方面,你可以充分发挥你的想象力,我看到很多作者有不同的途径对于特征工程,无论他们是选择一个已存在的列还是合并分类特征到新的项。

实际案例

Let’s take a deeper look at two competitions, the Titanic competition, followed by the House Prices competition.

让我们通过两个竞赛项目进行深入的查看。泰坦尼克竞赛和房屋竞赛项目

泰坦尼克生还预测竞赛

The Titanic competition is a popular beginners’ competition, and lots of folks on Kaggle cycle through it. As a result the EDAs tend to be well written and thoroughly documented, and were amongst the clearest I saw.

The dataset includes a training spreadsheet with a column Survived indicating whether a passenger survived or not, along with other supplementary data like their age, gender, ticket fare price, and more.

泰坦尼克生还预测.png

项目目的的描述

以下是项目主页对项目目的的描述

In this challenge, we ask you to complete the analysis of what sorts of people were likely to surviv

Binary classification

用二元分类分析哪类人群有更多的生还可能

训练数据和测试数据

训练数据和测试数据.png

以上图片主要描述了训练数据和测试数据。训练数据是以已知结果为前提,测试数据并不知道结果,结果需要通过预测模型来得出。

房屋价格预测竞赛

House Prices is another structured data competition. This one boasts many more variables than the Titanic competition, and includes categorical, ordinal and continuous features.

Python编程指导

这里提供了Python编程的指导

https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

Understand how variables are distributed and how they interact Apply different transformations before training machine learning models

理解如此之多的变量应该如何被使用到模型中,并且相互之间发生作用 常用的机器学习模型的理解

The EDAs I chose for analysis were Comprehensive Data Exploration with Python by Pedro Marcelino, Detailed Data Exploration in Python by Angela, and Fun Python EDA Step by Step by Sang-eon Park.

总结

这篇英文文章内容很长,作为英文阅读训练的素材,如果对于数据分析和机器学习没有概念的读者读起来会一头雾水。简单的总结下文章的内容。

这是一篇描述数据分析和机器学习方面的文章,借助www.kaggle.com这个数据竞赛网站提供的两个实际竞赛项目,围绕数据展开,试图向读者说明数据分析的常见模式。

文中涉及很多专业领域名词,包括数据集,测试数据,训练数据,数据预处理,模型,特征工程和数据科学等。

相关链接阅读

泰坦尼克生还预测案例指引 https://www.kaggle.com/c/titanic

房屋价格预测 https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Python编程导引 https://www.kaggle.com/c/titanic/overview/tutorials

Python开发-九层之台 起于垒土


本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-10-03,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 图南科技 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 文集相关术语
  • 相关词汇
    • 训练数据和测试数据
      • 房屋价格预测竞赛
      相关产品与服务
      NLP 服务
      NLP 服务(Natural Language Process,NLP)深度整合了腾讯内部的 NLP 技术,提供多项智能文本处理和文本生成能力,包括词法分析、相似词召回、词相似度、句子相似度、文本润色、句子纠错、文本补全、句子生成等。满足各行业的文本智能需求。
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档