首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >处理R的ff库中用引号包装的实数

处理R的ff库中用引号包装的实数
EN

Stack Overflow用户
提问于 2018-06-18 18:56:01
回答 1查看 169关注 0票数 0

我正在尝试探索2017年的HMDA数据。平面文件大约是9GB,可用的这里。CSV太大,无法读入内存,所以我尝试使用ff库。但是,当我试图读取该文件时,会出现错误。

代码语言:javascript
运行
复制
> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'a real', got '"83.9800033569336"'`

当我只扫描前1000行时,错误就消失了;但是它开始于第1000行和第10000行之间:

> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv', nrow = 1000)

我也尝试指定所有的列类,但是它返回一个错误:

代码语言:javascript
运行
复制
> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv',
                      nrow = 10000,
                      colClasses = c('real', 'real', 'integer', 'real', 'integer', 
                                     'integer', 'integer', 'integer', 'integer', 
                                     'factor', 'factor', 'character', 'character', 
                                     'factor', 'factor', 'factor', 'factor', 'factor', 
                                     'factor', 'factor', 'factor', 'factor', 'factor', 
                                     'factor', 'factor', 'factor', 'factor', 'factor', 
                                     'factor', 'factor', 'factor', 'factor', 'factor', 
                                     'factor', 'character', 'integer', 'character', 
                                     'factor', 'factor', 'factor', 'factor', 'factor', 
                                     'factor', 'factor', 'factor', 'factor', 'factor'))
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  scan() expected 'a real', got '"63.5"'

当我将所有整数和重求转换为字符时,仍然会得到一个错误:

代码语言:javascript
运行
复制
... Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, 
  : vmode 'character' not implemented

唯一可行的解决方案是指定colClasses = 'factor',将所有列转换为因子。

编辑:该问题似乎与原始CSV文件有关。有些值是用引号包装的,而有些则不是。如果我将前10,000行导出到CSV并使用read.csv(),它将按预期工作,数据类型为数字。但是在相同的子集上,如果我使用read.csv.ffdf(),就会得到错误scan() expected 'a real', got '"63.5"'。这是CSV的一部分,但是ffdf并不像预期的那样阅读CSV。

由于read.csv()工作,我尝试将文件分块到15个不同的数据帧中,每个数据帧包含1,000,000行。然而,当进入第11个文件时,它会持续冻结,可能是因为它正在将其加载到内存中,以便找到第11,000,000行。

因此,问题是,如何让ff处理不一致地用引号包装的实数?或者如何清除原始数据以删除引号?或者,如何以一种高效使用RAM的方式对数据进行分块?

FYI这里是数据头:

代码语言:javascript
运行
复制
> glimpse(hmda.ff[,])
Observations: 14,285,496
Variables: 47
$ tract_to_msamd_income          <fct> 63.5, 238.1199951171875, 38.189998626708984, 132.32000732421875, 87.5, 138.16000366210938, 98.43000030517578, 93.04000091552...
$ rate_spread                    <fct> , , , , , , , , , , , , , , , , 01.85, , , , 03.92, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ population                     <fct> 7067, 5429, 6869, 3835, 1960, 7120, 1828, 4643, 16372, 2977, 9630, 3298, 4487, 3324, 4099, 3835, 6003, 5187, 4818, 5849, 422...
$ minority_population            <fct> 72.08000183105469, 6.559999942779541, 30.719999313354492, 65.73999786376953, 55.459999084472656, 23.309999465942383, 13.5699...
$ number_of_owner_occupied_units <fct> 1201, 1611, 236, 1027, 407, 2037, 615, 854, 3292, 317, 3052, 1104, 617, 1099, 1409, 1027, 1122, 1638, 1495, 1508, 1187, 1700...
$ number_of_1_to_4_family_units  <fct> 1303, 1807, 794, 1141, 601, 2431, 725, 1936, 5286, 1174, 3188, 1175, 1120, 1404, 1522, 1141, 1520, 2162, 1989, 2080, 1421, 2...
$ loan_amount_000s               <fct> 400, 525, 225, 621, 181, 70, 123, 5, 100, 34, 302, 680, 108, 99, 100, 100, 171, 443, 420, 50, 75, 361, 179, 338, 300, 544, 3...
$ hud_median_family_income       <fct> 107600, 77500, 61800, 75200, 50000, 68800, 79600, 75200, 58400, 70800, 79600, 107600, 79300, 83900, 108300, 75200, 63200, 72...
$ applicant_income_000s          <fct> 90, 300, , 255, 109, 238, 84, 75, 44, 195, 62, 159, 50, 84, 70, 124, 80, 264, 177, 214, 181, 57, 86, 157, 64, 96, , 30, 50, ...
$ state_name                     <fct> Virginia, Illinois, Michigan, California, California, South Carolina, Michigan, California, Florida, Pennsylvania, Michigan,...
$ state_abbr                     <fct> VA, IL, MI, CA, CA, SC, MI, CA, FL, PA, MI, VA, CA, CO, CT, CA, CA, WI, NY, CA, CA, CA, NE, VA, NY, CA, CA, FL, SC, CA, VA, ...
$ sequence_number                <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ respondent_id                  <fct> 7442300004, 0000852218, 0000146672, 0000852218, 86-0860478, 0000617677, 7197000003, 0000504713, 39-2001010, 3027509990, 0000...
$ purchaser_type_name            <fct> Life insurance company, credit union, mortgage bank, or finance company, Loan was not originated or was not sold in calendar...
$ property_type_name             <fct> One-to-four family dwelling (other than manufactured housing), One-to-four family dwelling (other than manufactured housing)...
$ preapproval_name               <fct> Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicab...
$ owner_occupancy_name           <fct> Owner-occupied as a principal dwelling, Owner-occupied as a principal dwelling, Not owner-occupied as a principal dwelling, ...
$ msamd_name                     <fct> Washington, Arlington, Alexandria - DC, VA, MD, WV, Chicago, Naperville, Arlington Heights - IL, Kalamazoo, Portage - MI, Sa...
$ loan_type_name                 <fct> Conventional, Conventional, Conventional, Conventional, FHA-insured, Conventional, FHA-insured, Conventional, Conventional, ...
$ loan_purpose_name              <fct> Home purchase, Refinancing, Home purchase, Refinancing, Home purchase, Home improvement, Refinancing, Home improvement, Home...
$ lien_status_name               <fct> Secured by a first lien, Secured by a first lien, Secured by a first lien, Secured by a first lien, Secured by a first lien,...
$ hoepa_status_name              <fct> Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan...
$ edit_status_name               <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ denial_reason_name_3           <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Insufficient cash (downpayment, closing costs), , , , , ...
$ denial_reason_name_2           <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Debt-to-income ratio, , Debt-to-income ratio, , , , , , ...
$ denial_reason_name_1           <fct> , , , Debt-to-income ratio, , Credit history, , Credit history, , Credit application incomplete, , , , , , , , , , , , Credi...
$ county_name                    <fct> Fairfax County, Cook County, Kalamazoo County, Sacramento County, Fresno County, Charleston County, Macomb County, Sacrament...
$ co_applicant_sex_name          <fct> Male, No co-applicant, No co-applicant, Female, Female, Information not provided by applicant in mail, Internet, or telephon...
$ co_applicant_race_name_5       <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ co_applicant_race_name_4       <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ co_applicant_race_name_3       <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ co_applicant_race_name_2       <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ co_applicant_race_name_1       <fct> White, No co-applicant, No co-applicant, White, Asian, Information not provided by applicant in mail, Internet, or telephone...
$ co_applicant_ethnicity_name    <fct> Not Hispanic or Latino, No co-applicant, No co-applicant, Not Hispanic or Latino, Not Hispanic or Latino, Information not pr...
$ census_tract_number            <fct> 4522.00, 8198.01, 0015.07, 0093.30, 0049.02, 0046.09, 2515.00, 0018.00, 0432.04, 0007.00, 2234.00, 4703.00, 0017.00, 0102.10...
$ as_of_year                     <fct> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017...
$ application_date_indicator     <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ applicant_sex_name             <fct> Female, Male, Male, Male, Male, Information not provided by applicant in mail, Internet, or telephone application, Female, F...
$ applicant_race_name_5          <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ applicant_race_name_4          <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ applicant_race_name_3          <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ applicant_race_name_2          <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 
$ applicant_race_name_1          <fct> White, White, White, White, Asian, Information not provided by applicant in mail, Internet, or telephone application, White,...
$ applicant_ethnicity_name       <fct> Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Info...
$ agency_name                    <fct> Department of Housing and Urban Development, Consumer Financial Protection Bureau, Consumer Financial Protection Bureau, Con...
$ agency_abbr                    <fct> HUD, CFPB, CFPB, CFPB, HUD, CFPB, HUD, CFPB, FDIC, HUD, CFPB, FRS, CFPB, NCUA, CFPB, NCUA, HUD, FRS, HUD, CFPB, NCUA, HUD, C...
$ action_taken_name              <fct> Loan originated, Loan originated, Loan originated, Application denied by financial institution, Loan originated, Application...
EN

回答 1

Stack Overflow用户

发布于 2018-06-19 20:15:49

我创建了一个函数,将数据从因子转换为数字。由于某些原因,在ff中处理虚拟数据帧时,这两个函数是不同的。

代码语言:javascript
运行
复制
hmda[1] <- as.numeric(paste0(hmda[1]))
hmda$first_col <- as.numeric(paste0(hmda$first_col))

第一行将返回一堆NAs (尽管非常不一致),而第二个函数实际上按预期工作。因此,下面是工作的脚本:

代码语言:javascript
运行
复制
require(ff)

# function that converts all numeric-looking fields to numeric
hmda_cleanup <- function(hmda){
  hmda$tract_to_msamd_income <- as.numeric(paste0(hmda$tract_to_msamd_income))
  hmda$rate_spread <- as.numeric(paste0(hmda$rate_spread))
  hmda$population <- as.numeric(paste0(hmda$population))
  hmda$minority_population <- as.numeric(paste0(hmda$minority_population))
  hmda$number_of_owner_occupied_units <- as.numeric(paste0(hmda$number_of_owner_occupied_units))
  hmda$number_of_1_to_4_family_units <- as.numeric(paste0(hmda$number_of_1_to_4_family_units))
  hmda$loan_amount_000s <- as.numeric(paste0(hmda$loan_amount_000s))
  hmda$hud_median_family_income <- as.numeric(paste0(hmda$hud_median_family_income))
  hmda$applicant_income_000s <- as.numeric(paste0(hmda$applicant_income_000s))
  hmda$as_of_year <- as.numeric(paste0(hmda$as_of_year))
  return(hmda)
}

# read in large csv with all values as factors
hmda.ff <- read.csv.ffdf(file ='hmda_lar_2017.csv', 
                         colClasses = 'factor')

# access the list(?) containing the data
hmda.ff.df <- hmda.ff[,]

# run user-defined function on the data
hmda.ff.df <- hmda_cleanup(hmda.ff.df)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50915828

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档