前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >R Programming-week1 Reading Data

R Programming-week1 Reading Data

作者头像
统计学家
发布2019-04-10 17:10:58
3430
发布2019-04-10 17:10:58
举报
文章被收录于专栏:机器学习与统计学

Reading Data

There are a few principal functions readingdata into R.

read.table, read.csv, for reading tabulardata

readLines, for reading lines of a text file

source, for reading in R code files(inverse of dump)

dget, for reading in R code files (inverseof dput)

load, for reading in saved workspaces

unserialize, for reading single R objectsin binary form

Writing Data

There are analogous functions for writingdata to files

write.table

writeLines

dump

dput

save

serialize

Reading Data Files with read.table

The read.table function is one of the mostcommonly used functions for reading data. It has a few important arguments:

file, the name of a file, or a connection

header, logical indicating if the file has a header line

sep, a string indicating how the columns are separated

colClasses, a character vector indicating the class of each columnin the dataset

nrows, the number of rows in the dataset

comment.char, a character string indicating the comment character s

kip, the number of lines to skip from the beginning

stringsAsFactors, should character variables be coded as factors?

read.table

For small to moderately sized datasets, youcan usually call read.table without specifying any other arguments

data<- read.table("foo.txt")

R will automatically

skip lines that begin with a #

figure out how many rows there are (and how much memory needs to beallocated)

figure what type of variable is in each column of the table TellingR all these things directly makes R run faster and more efficiently.

read.csv is identical to read.table except that the defaultseparator is a comma.

Reading in LargerDatasets with read.table

With much larger datasets, doing the following things will make yourlife easier and will prevent R from choking.

Read the helppage for read.table, which contains many hints

Make a rough calculation of the memory required to store yourdataset. If the dataset is larger than the amount of RAM on your computer, youcan probably stop right here.

Set comment.char= "" if there are no commented lines in your file.

Use the colClassesargument. Specifying this option instead of using the default can make’read.table’ run MUCH faster, often twice as fast. In order to use this option,you have to know the class of each column in your data frame. If all of thecolumns are “numeric”, for example, then you can just set colClasses ="numeric". A quick an dirty way to figure out the classes of eachcolumn is the following:

initial <- read.table("datatable.txt",nrows = 100)

classes <- sapply(initial, class)

tabAll <- read.table("datatable.txt",colClasses = classes)

Set nrows. Thisdoesn’t make R run faster but it helps with memory usage. A mild overestimateis okay. You can use the Unix tool wc to calculate the number of lines in afile.

Know Thy System

In general, whenusing R with larger datasets, it’s useful to know a few things about yoursystem.

How much memoryis available?

What otherapplications are in use?

Are there otherusers logged into the same system?

What operatingsystem? Is the OS 32 or 64 bit?

Calculating MemoryRequirements

CalculatingMemory Requirements I have a data frame with 1,500,000 rows and 120 columns,all of which are numeric data. Roughly, how much memory is required to storethis data frame? 1,500,000 × 120 × 8 bytes/numeric = 1440000000 bytes =1440000000 / bytes/MB = 1,373.29 MB = 1.34 GB

Textual Formats

dumping anddputing are useful because the resulting textual format is edit-able, and inthe case of corruption, potentially recoverable.

Unlike writingout a table or csv file, dump and dput preserve the metadata (sacrificing somereadability), so that another user doesn’t have to specify it all over again.

Textual formatscan work much better with version control programs like subversion or git whichcan only track changes meaningfully in text files

Textual formatscan be longer-lived; if there is corruption somewhere in the file, it can beeasier to fix the problem

Textual formatsadhere to the “Unix philosophy”

Downside: Theformat is not very space-efficient

dput-ting R Objects

Another way topass data around is by deparsing the R object with dput and reading it back inusing dget.

> y <- data.frame(a = 1, b = "a")

> dput(y) structure(list(a = 1, b = structure(1L,.Label = "a",)), .Names = c("a","b"), row.names = c(NA, -1L), class = "data.frame")

> dput(y, file = "y.R")

> new.y <- dget("y.R")

> new.y

a b

1 1 a

Dumping R Objects

Multiple objectscan be deparsed using the dump function and read back in using source

> x <- "foo"

> y <- data.frame(a = 1, b = "a")

> dump(c("x", "y"), file ="data.R")

> rm(x, y)

> source("data.R")

> y

a b

1 1 a

> x

[1] "foo"

Interfaces to the Outside World

Data are read inusing connection interfaces. Connections can be made to files (most common) orto other more exotic things.

file, opens aconnection to a file

gzfile, opens aconnection to a file compressed with gzip

bzfile, opens aconnection to a file compressed with bzip2

url, opens aconnection to a webpage

File Connections

> str(file) function (description = "",open = "", blocking = TRUE, encoding =getOption("encoding"))

description isthe name of the file

open is a codeindicating

“r” read only

“w” writing (andinitializing a new file)

“a” appending

“rb”, “wb”, “ab”reading, writing, or appending in binary mode (Windows)

Connections

In general,connections are powerful tools that let you navigate files or other externalobjects. In practice, we often don’t need to deal with the connection interfacedirectly.

con <- file("foo.txt", "r")

data <- read.csv(con)

close(con)

is the same as

data <- read.csv("foo.txt")

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2015-04-16,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 机器学习与统计学 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档