我可以将csv加载到pandas dataframe中,但它被困在列表中。我怎样才能直接从Pydrill加载到pandas数据框中,或者从pandas数据框列和数据列表中删除?我尝试过取消列表,它会将所有内容都放入一个列表中。
我使用了to_dataframe(),但似乎找不到有关是否可以使用分隔符的文档。由于Pydrill查询,pd.dataframe无法工作。
reviews = drill.query("SELECT * FROM hdfs.datasets.`titanic_ML/titanic.csv` LIMIT 1000", timeout=30)
print(reviews)
import pandas as pd
df2 = reviews.to_dataframe()
df2.rename(columns=df2.iloc[0])
headers = df2.iloc[0]
print(headers)
new_df = pd.DataFrame(df2.values[1:], columns=headers)
new_df.head()
结果将所有内容转换为一个列表。
["pclass","sex","age","sibsp","parch","fare","embarked","survived"]
0 ["3","1","38.0","0","0","7.8958","1","0"]
1 ["1","1","42.0","0","0","26.55","1","0"]
2 ["3","0","9.0","4","2","31.275","1","0"]
3 ["3","1","27.0","0","0","7.25","1","0"]
4 ["1","1","41.0","0","0","26.55","1","0"]
我想把所有的东西都放到一个普通的熊猫数据帧中。
发布于 2019-10-11 17:58:04
我找到的解决方案是:
它不会列出数据帧,但它是该问题的替代解决方案。
connect_str = "dbname='dbname‘user='dsa_ro_user’conn =dbname copg2.connect(Connect_str) host='host database‘
SQL = "SELECT *“SQL +=”FROM train“
df = pd.read_sql(SQL,conn) df.head()
发布于 2020-09-21 02:46:36
尝试使用O’Reily Text: Chapter 4. Querying Delimited Data中所述的表函数。这将分隔文件并将第一行应用于您的列。注意:因为所有内容都是以文本形式读取的,所以如果您想在select
或where
中进行算术运算,则可能需要将值作为浮点数进行cast
。
这应该会让你得到你想要的:
sql="""
SELECT *
FROM table(hdfs.datasets.`/titanic_ML/titanic.csv`(
type => 'text',
extractHeader => true,
fieldDelimiter => ',')
) LIMIT 1000
"""
rows = drill.query(sql, timeout=30)
df = rows.to_dataframe()
df.head()
https://stackoverflow.com/questions/58342953
复制相似问题