首页
学习
活动
专区
工具
TVP
发布
精选内容/技术社群/优惠产品,尽在小程序
立即前往

Data Cleaning in R笔记与案例

我想带你一起去香格里拉,

一起转经幡

好男人就是我,我就是——贺鲲羽。哈哈哈

DataCamp主要提供Python和R的相关课程,内容拆分比较细,教学通过文字说明结合线上编程展开,穿插小段讲解视频;形式为闯关式,趣味性十足。但讲解相对而言较浅,解决问题的方式自由度低,任务也大多不具挑战性,比较适合萌新,有一定R语言基础的同学还是去啃书比较实在。

最近我在DataCamp上刷R语言的课程,突然起意做一些笔记分享,反正不管怎么样笔记总是要记的嘛,还可以维持更新频率这样子这之后还会陆续推出其他课程的笔记分享,欢迎持续关注。

由于笔记随视频讲解完成,内容比较粗略,故附上了DataCamp课程教学过程中的一个简单案例。小伙伴们可以点击阅读全文查看markdown报告。

如果这篇分享和我的报告对大家有帮助,希望大家可以在点击阅读全文查看报告的同时帮忙给我的repo右上角点赞(star)。

笔记正文如下:

Cleaning Data in R

Eploring Raw Data

Check data structure:

class(),

dim(),

names(),

str(), orglimpse()from`dplyr`[1],

summary().

Looking at the data

head(),

tail().

Tidying Data

Tidy data:observations as rows and variables as columns, only one type of observational unite per table

Addressing messy data with`tidyr`[2]

Q:Columns are values, not variable names(宽数据库)

A:gather(data =,

key =new key column,

value =new value column, ...,

-c(col) =columns to ignore)

Q:Variables are stored in both rows and columns(常见于长数据库)

A:spread(data =,

key =bare names of keys column,

value =bare names ofvalue column)

Q:Multiple variables are stored in one column

A:separate(data =,

col =column to separate,

into =vector of new column names,

sep= "")

Q:多行合并

A:unite(data =,

col =name of new column, ...,

sep ="")

Preparing Data for Analysis

Type conversions:

as.[class](),

as.Date(),as.POSIXct()or`lubridate`[3].

String manipulation with`stringr`[4]

str_trim(): trim white spaces;

str_pad(): pad with additional cahracters;

str_detect(): detect a pattern;

str_replace(): detect and replace a pattern;

tolower("ABC")= "abc";

toupper("abc")= "ABC.

Missing dataformat:

NA;

Inf: 1/0;

NAN: 0/0;

#N/A (from Excel);

. (from SPSS, SAS);

"": Empty string or blank.

Finding and dealing with missing values

any(is.na()): whether there are missing values;

sum(is.na()): how many missing values;

summary(): where and how many;

na.omit(): remove all missings.

Finding outliers and obvious errors

summary()

boxplot(data)

hist()

Data Cleaning in R课程仅作为数据清洗的导论课,涵盖方法和函数有限,建议可以额外参考Intermediate R课程。

07/26/2018

贺鲲羽

代码:github.com/QuinninR/QuinninR-sample-analysis

报告:rpubs.com/QuinninR/407585

[1]Hadley Wickham, Romain Franois, Lionel Henry and Kirill Müller(2018). dplyr: A Grammar of Data Manipulation. R package version0.7.5.

[2]Hadley Wickham and Lionel Henry (2018). tidyr: Easily Tidy Data with'spread()' and'gather()' Functions. R package version 0.8.1.

[3] Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25.

[4] Hadley Wickham (2018). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.3.1.

  • 发表于:
  • 原文链接https://kuaibao.qq.com/s/20180726G1YF4X00?refer=cp_1026
  • 腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长 进交流群

领取专属 10元无门槛券

私享最新 技术干货

扫码加入开发者社群
领券