首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >Google不正确地解析CSV,而朱庇特笔记本则解析得很好

Google不正确地解析CSV,而朱庇特笔记本则解析得很好
EN

Stack Overflow用户
提问于 2022-09-16 06:59:11
回答 3查看 65关注 0票数 0

我想在Google Collab上读一本带有Pandas的csv。我使用了标准编码:

代码语言:javascript
运行
复制
train = pd.read_csv("here-I-enter-my-Google-Drive-URL-to-the-file", sep=",", lineterminator='\n', header=0)

但它会返回:

代码语言:javascript
运行
复制
ParserError: Error tokenizing data. C error: Expected 90 fields in line 3, saw 228

关键是,这是我第一次尝试Google,我在网上搜索了这个问题。大多数人认为这是因为数据中可能有一个逗号(除了分隔列的部分),但是我在几个文本和电子表格阅读器上使用了“Find”函数进行搜索,但是没有发现--实际上,任何电子表格阅读器都能正确地读取它,就像朱庇特笔记本一样。

让我在这里复制文件中的前10行,这些行可以完全下载这里

代码语言:javascript
运行
复制
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1,60,RL,65,8450,Pave,NA,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,NA,Attchd,2003,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,NA,NA,NA,0,2,2008,WD,Normal,208500
2,20,RL,80,9600,Pave,NA,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,None,0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,NA,NA,NA,0,5,2007,WD,Normal,181500
3,60,RL,68,11250,Pave,NA,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,NA,NA,NA,0,9,2008,WD,Normal,223500
4,70,RL,60,9550,Pave,NA,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,None,0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,NA,NA,NA,0,2,2006,WD,Abnorml,140000
5,60,RL,84,14260,Pave,NA,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,NA,NA,NA,0,12,2008,WD,Normal,250000
6,50,RL,85,14115,Pave,NA,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,VinylSd,VinylSd,None,0,TA,TA,Wood,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,NA,Attchd,1993,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,NA,MnPrv,Shed,700,10,2009,WD,Normal,143000
7,20,RL,75,10084,Pave,NA,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,8,5,2004,2005,Gable,CompShg,VinylSd,VinylSd,Stone,186,Gd,TA,PConc,Ex,TA,Av,GLQ,1369,Unf,0,317,1686,GasA,Ex,Y,SBrkr,1694,0,0,1694,1,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2004,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,NA,NA,NA,0,8,2007,WD,Normal,307000
8,60,RL,NA,10382,Pave,NA,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,1973,1973,Gable,CompShg,HdBoard,HdBoard,Stone,240,TA,TA,CBlock,Gd,TA,Mn,ALQ,859,BLQ,32,216,1107,GasA,Ex,Y,SBrkr,1107,983,0,2090,1,0,2,1,3,1,TA,7,Typ,2,TA,Attchd,1973,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,NA,NA,Shed,350,11,2009,WD,Normal,200000
9,50,RM,51,6120,Pave,NA,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,BrkFace,Wd Shng,None,0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,1931,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,NA,NA,NA,0,4,2008,WD,Abnorml,129900
10,190,RL,50,7420,Pave,NA,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,MetalSd,MetalSd,None,0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,1939,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,NA,NA,NA,0,1,2008,WD,Normal,118000

对如何解决这个问题有什么想法吗?

谢谢大家!

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2022-09-20 08:31:57

在进行了更多的研究之后,我确实给出了一个答案,这可能会对你和每个人在处理不同在线来源的数据集时有所帮助:

这里我的例子是“失败”:

代码语言:javascript
运行
复制
import pandas as pd
df=pd.read_csv('https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv')

这将产生类似的错误:“错误标记数据。c错误:第9行中预期的1个字段,saw 2”。

问题将是,为什么会发生错误,以及如何解决它,这里我所做的,使用on_bad_lines = "warn"

代码语言:javascript
运行
复制
    df=pd.read_csv('https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv',on_bad_lines="warn")

这将产生一条信息:

‘跳线9:预期1字段,saw 2跳线10:预期1字段,saw 3\n跳线12:预期1字段,saw 4\n跳线27:预期1字段,saw 8\n跳线31……“

即使是艰难的df也会产生,这样我们就能够更好地解释数据并发现错误:

代码语言:javascript
运行
复制
df
<!DOCTYPE html>
0   <html lang="en">
1   <head>
2   <title>House Prices - Advanced Regression Te...
3   <meta charset="utf-8" />
4   <meta name="turbolinks-cache-control" conten...
... ...
86  <div id="site-body" class="hide">
87  </div>
88  </main>
89  </body>
90  </html>

在这个例子中,我们可以看到,该文件是作为html链接加载的,而不是作为csv加载的,这会生成错误,尽管在使用本地保存的文件时没有发生错误。

使用API凭据后(Kaggle、Google、Github等)可以正确加载文件,结果与预期的相同:

代码语言:javascript
运行
复制
df.head()
Id  MSSubClass  MSZoning    LotFrontage LotArea Street  Alley   LotShape    LandContour Utilities   ... PoolArea    PoolQC  Fence   MiscFeature MiscVal MoSold  YrSold  SaleType    SaleCondition   SalePrice
0   1   60  RL  65.0    8450    Pave    NaN Reg Lvl AllPub  ... 0   NaN NaN NaN 0   2   2008    WD  Normal  208500
1   2   20  RL  80.0    9600    Pave    NaN Reg Lvl AllPub  ... 0   NaN NaN NaN 0   5   2007    WD  Normal  181500
2   3   60  RL  68.0    11250   Pave    NaN IR1 Lvl AllPub  ... 0   NaN NaN NaN 0   9   2008    WD  Normal  223500
3   4   70  RL  60.0    9550    Pave    NaN IR1 Lvl AllPub  ... 0   NaN NaN NaN 0   2   2006    WD  Abnorml 140000
4   5   60  RL  84.0    14260   Pave    NaN IR1 Lvl AllPub  ... 0   NaN NaN NaN 0   12  2008    WD  Normal  25000

希望这可能有助于未来独立于在线资源或环境。

票数 1
EN

Stack Overflow用户

发布于 2022-09-16 14:53:34

我已经下载了dataset并在我的机器上复制了这个场景,在这里我介绍了我的实现,它运行得很好。希望它能帮到你。

Python版本: 3.7.14

熊猫版本: 1.3.5

环境: Google

执行情况:

代码语言:javascript
运行
复制
import pandas as pd

from google.colab import files
uploaded = files.upload()

df = pd.read_csv('train.csv') 

df.head(5)
票数 2
EN

Stack Overflow用户

发布于 2022-09-16 07:17:39

我刚刚下载了这些数据,并尝试用您的代码在colab中导入它,它在那里工作得很好。也许试着把它重新放到你的google驱动器上,然后把它放到一个DataFrame上。

我遇到的一件事是,如果您在csv尚未完全导入时运行代码(导入完成之前,文件显示在数据部分),则会中断

希望这至少能帮上忙

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73741043

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档