对于两个数据帧,我在R中使用合并函数有一点问题。我有两个大的数据框架,它们的列和数据是相同的,这就是我用来合并它们的方法。
例如,数据框架1的列有“data”、"RecordDate“、"HourMinuteSecond”、"MilliSecond“、.(和其他几列)虽然数据帧2有列“other”、"RecordDate“、"HourMinuteSecond”、"MilliSecond“、.(与数据帧2相比,还有其他几列)。我现在按照以下方式使用merge函数:
DataFrame3 <- merge(DataFrame2, DataFrame1, by=c("Instrument", "RecordDate","HourMinuteSecond","MilliSecond"))这给了我一个新的数据框架。现在,我遇到的问题是,数据帧之间的某些部分根本不匹配。例如,我发现有一部分数据没有正确匹配:
> DataFrame1[120486,]
Instrument RecordDate HourMinuteSecond MilliSecond
120486 DTE 6/4/2012 16:10:27 42
> DataFrame2[65,]
Instrument RecordDate HourMinuteSecond MilliSecond
65 DTE 6/4/2012 16:10:27 42 (注意还有其他列,但我省略了它们)。现在,我将这些部分作为向量进行比较。首先,我使用相同的函数来比较每个向量中的每个值,这给出了以下内容:
> identical(DataFrame1[120486,1] ,DataFrame2[65,1])
[1] FALSE
> identical(DataFrame1[120486,2] ,DataFrame2[65,2])
[1] TRUE
> identical(DataFrame1[120486,3] ,DataFrame2[65,3])
[1] FALSE
> identical(DataFrame1[120486,4] ,DataFrame2[65,4])
[1] TRUE从相同的函数来看,似乎“the”和"HourMintuteSecond“列中的值彼此不同。有人能告诉我是什么导致了这个问题吗?提前谢谢。
编辑:这是dput输出,希望这就是您所指的:
> dput(droplevels(DataFrame2[65,1:4]))
structure(list(Instrument = structure(1L, .Label = "DTE", class = "factor"),
RecordDate = structure(1L, .Label = "6/4/2012", class = "factor"),
HourMinuteSecond = structure(1L, .Label = "16:10:27", class = "factor"),
MilliSecond = 42L), .Names = c("Instrument", "RecordDate",
"HourMinuteSecond", "MilliSecond"), row.names = 65L, class = "data.frame")
> dput(droplevels(DataFrame1[120486,1:4]))
structure(list(Instrument = structure(1L, .Label = "DTE", class = "factor"),
RecordDate = structure(1L, .Label = "6/4/2012", class = "factor"),
HourMinuteSecond = structure(1L, .Label = "16:10:27", class = "factor"),
MilliSecond = 42L), .Names = c("Instrument", "RecordDate",
"HourMinuteSecond", "MilliSecond"), row.names = 120486L, class = "data.frame")下面是str()输出:
> str(DataFrame1)
'data.frame': 317495 obs. of 9 variables:
$ Instrument : Factor w/ 4 levels "CDD","DTE","ERA",..: 1 1 1 1 1 1 1 1 1 1 ...
$ RecordDate : Factor w/ 30 levels "5/18/2012","5/21/2012",..: 1 1 1 1 1 1 1 1 1 1 ...
$ HourMinuteSecond: Factor w/ 21763 levels "10:02:02","10:02:03",..: 14 14 14 17 19 22 24 25 25 25 ...
$ MilliSecond : int 26 57 158 70 73 8 926 448 457 458 ...
$ L1BidPrice : num 6.91 6.91 6.91 6.91 6.91 6.91 6.9 6.9 6.89 6.89 ...
$ L1BidVolume : int 520 504 504 504 504 508 20 4 20 20 ...
$ L1AskPrice : num 6.92 6.92 6.92 6.92 6.92 6.92 6.91 6.91 6.9 6.9 ...
$ L1AskVolume : int 3917 3917 3915 3932 3915 3915 3407 3407 13 30 ...
$ Midquote : num 6.92 6.92 6.92 6.92 6.92 ...
> str(DataFrame2)
'data.frame': 577 obs. of 15 variables:
$ Instrument : Factor w/ 2 levels "DTE","ERA": 1 1 1 1 1 1 1 1 1 1 ...
$ RecordDate : Factor w/ 30 levels "5/18/2012","5/21/2012",..: 1 1 1 1 1 1 2 2 2 2 ...
$ HourMinuteSecond : Factor w/ 317 levels "10:02:10","10:02:21",..: 301 301 301 301 301 301 2 98 129 130 ...
$ MilliSecond : int 45 45 45 45 45 45 485 6 92 300 ...
$ RecordType : Factor w/ 1 level "TRADE": 1 1 1 1 1 1 1 1 1 1 ...
$ Price : num 0.195 0.195 0.195 0.195 0.195 0.195 0.2 0.19 0.19 0.185 ...
$ Volume : int 2686 6350 6350 6350 1620 3064 1 13986 25000 23092 ...
$ UndisclosedVolume: Factor w/ 1 level "\\N": 1 1 1 1 1 1 1 1 1 1 ...
$ DollarValue : num 524 1238 1238 1238 316 ...
$ Qualifiers : Factor w/ 4 levels "\\N","AC","Bi",..: 2 2 2 2 2 2 4 4 3 4 ...
$ BidID : num 6.13e+18 6.13e+18 6.13e+18 6.13e+18 6.13e+18 ...
$ AskID : num 6.13e+18 6.13e+18 6.13e+18 6.13e+18 6.13e+18 ...
$ BidOrAsk : Factor w/ 1 level "\\N": 1 1 1 1 1 1 1 1 1 1 ...
$ BuyerBrokerID : int 229 229 229 229 229 229 236 129 229 112 ...
$ SellerBrokerID : int 297 210 210 210 110 157 229 229 299 229 ...发布于 2017-02-03 07:01:42
# load data table library used for large data sets
library('data.table')
# convert factors into character
col1 <- colnames(df1)[sapply(df1, is.factor)] # get columns that are factors for df1
col2 <- colnames(df2)[sapply(df2, is.factor)] # get columns that are factors for df2
for(col in col1){ # df1
set(df1, , col, as.character( df1[[col]] ) ) # for more info on set() function, read ?`:=`
}
for(col in col2){ # df2
set(df2, , col, as.character( df2[[col]] ) )
}
# join two data frames by the selected columns in 'on' argument
setDT(df1)[df2, on = c('Instrument', 'RecordDate', 'HourMinuteSecond','MilliSecond')] # setDT converts data frame to data table by reference
# Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote i.L1BidPrice i.L1BidVolume i.L1AskPrice i.L1AskVolume
# 1: DTE 6/4/2012 16:10:27 42 6.91 520 6.92 3917 6.92 7 8 9 10
# i.Midquote
# 1: 11
# merge function in data table is faster than the same function in base R function. You just convert data frame into data tables.
setDT(df1)
setDT(df2)
merge(df1, df2, by = c('Instrument', 'RecordDate', 'HourMinuteSecond','MilliSecond'))数据:
df1 <- structure(list(Instrument = "DTE", RecordDate = "6/4/2012", HourMinuteSecond = "16:10:27",
MilliSecond = 42L, L1BidPrice = 6.91, L1BidVolume = 520,
L1AskPrice = 6.92, L1AskVolume = 3917, Midquote = 6.92), .Names = c("Instrument",
"RecordDate", "HourMinuteSecond", "MilliSecond", "L1BidPrice",
"L1BidVolume", "L1AskPrice", "L1AskVolume", "Midquote"), row.names = c(NA, -1L), class = "data.frame")
df2 <- structure(list(Instrument = "DTE", RecordDate = "6/4/2012", HourMinuteSecond = "16:10:27",
MilliSecond = 42L, L1BidPrice = 7, L1BidVolume = 8, L1AskPrice = 9,
L1AskVolume = 10, Midquote = 11), .Names = c("Instrument",
"RecordDate", "HourMinuteSecond", "MilliSecond", "L1BidPrice",
"L1BidVolume", "L1AskPrice", "L1AskVolume", "Midquote"), row.names = 120486L, class = "data.frame")
df1
# Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote
# 1: DTE 6/4/2012 16:10:27 42 6.91 520 6.92 3917 6.92
df2
# Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote
# 120486: DTE 6/4/2012 16:10:27 42 7 8 9 10 11https://stackoverflow.com/questions/42015715
复制相似问题