我正在尝试将一个xls
文件(只有一个选项卡)打开到一个熊猫数据框中。
这是一个我通常可以在excel或excel中读取的文件,实际上这里是原始文件:https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&rlkey=3aw7whab78jeexbdkthkjzkmu。
我注意到前两行合并了单元格,一些列也合并了。
我尝试过几种方法(从堆栈),但都失败了。
# method 1 - read excel
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file)
print(df)
错误:Excel file format cannot be determined, you must specify an engine manually.
# method 2 - pip install xlrd and use engine
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file, engine='xlrd')
print(df)
错误:Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
# method 3 - rename to xlsx and open with openpyxl
file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx"
df = pd.read_excel(file, engine='openpyxl')
print(df)
错误:File is not a zip file
(可能转换,而不是重命名,是一个选项)。
# method 4 - use read_xml
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_xml(file)
print(df)
这个方法实际上产生了一个结果,但是产生了一个与工作表无关的DataFrame。想必人们需要解释xml (似乎很复杂)?
Style Name Table
0 NaN None NaN
1 NaN All funds NaN
# method 5 - use read_table
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_table(file)
print(df)
此方法将文件读入一个列(系列) DataFrame中。那么,如何使用这些信息创建与xls文件形状相同的标准2d DataFrame呢?
0 <Workbook xmlns="urn:schemas-microsoft-com:off...
1 <Styles>
2 <Style ss:ID="Default">
3 <Alignment Horizontal="Left"/>
4 </Style>
... ...
226532 </Cell>
226533 </Row>
226534 </Table>
226535 </Worksheet>
226536 </Workbook>
# method 5 - use read_html
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_html(file)
print(df)
这将返回一个空列表[]
,而可能至少会有一个DataFrames列表。
因此,问题是,将该文件读取为dataframe (或类似的可用格式)的最简单方法是什么?
发布于 2022-02-02 15:36:45
不是一个完整的解决方案,但它应该让你开始。"xls"
文件实际上是SpreadsheetML
格式的普通xml
文件。将文件扩展名更改为.xml
--在internet浏览器中查看它,结构(至少是给定文件的结构)相当简单。
以下是将数据内容读入熊猫DataFrame中:
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('product-screener.xls')
root = tree.getroot()
data = [[c[0].text for c in r] for r in root[1][0][2:]]
types = [c[0].get('{urn:schemas-microsoft-com:office:spreadsheet}Type') for c in root[1][0][2]]
df = pd.DataFrame(data)
df = df.replace('-', None)
for c in df.columns:
if types[c] == 'Number':
df[c] = pd.to_numeric(df[c])
elif types[c] == 'DateTime':
df[c] = pd.to_datetime(df[c])
由于合并的单元格,从第0行和第1行获取列名会更复杂一些--我把它留给读者作为练习。
发布于 2022-02-02 20:35:41
我在这里发布完整的解决方案,其中包含上述已批准的解决方案(by @Stef),以及最终将头添加到DataFrame中。
'''
get xls file
convert to xml
parse into dataframe
add headers
'''
import pandas as pd
import xml.etree.ElementTree as ET
import shutil
file_xls = "C:\\Users\\admin\\Downloads\\product-screener.xls"
file_xml = 'C:\\Users\\admin\\Downloads\\product-screener.xml'
shutil.copyfile(file_xls, file_xml)
tree = ET.parse(file_xml)
root = tree.getroot()
data = [[c[0].text for c in r] for r in root[1][0][2:]]
types = [c[0].get('{urn:schemas-microsoft-com:office:spreadsheet}Type') for c in root[1][0][2]]
df = pd.DataFrame(data)
df = df.replace('-', None)
for c in df.columns:
if types[c] == 'Number':
df[c] = pd.to_numeric(df[c])
elif types[c] == 'DateTime':
df[c] = pd.to_datetime(df[c])
print(df)
headers = [[c[0].text for c in r] for r in root[1][0][:2]]
# print(headers[0])
# print(len(headers[0]))
# print()
# print(headers[1])
# print(len(headers[1]))
# print()
# upto column (AF) comes from headers[0]
df_headers = headers[0][0:32]
# the next 9 are discrete
x_list = ['discrete: ' + s for s in headers[1][0:9] ]
df_headers = df_headers + x_list
# the next 10 are annualised
x_list = ['annualised: ' + s for s in headers[1][9:19] ]
df_headers = df_headers + x_list
# the next 10 are cumulative
x_list = ['cumulative: ' + s for s in headers[1][19:29] ]
df_headers = df_headers + x_list
# the next 9 are calendar
x_list = ['calendar: ' + s for s in headers[1][29:38] ]
df_headers = df_headers + x_list
# the next 5 are portfolio characteristics (metrics)
x_list = ['metrics: ' + s for s in headers[1][38:43] ]
df_headers = df_headers + x_list
# the next 6 are portfolio characteristics
x_list = ['characteristics: ' + s for s in headers[1][43:49] ]
df_headers = df_headers + x_list
# the final 5 are sustainability characteristics
x_list = ['sustain: ' + s for s in headers[1][49:54] ]
df_headers = df_headers + x_list
print(df_headers)
# add headers to dataframe
df.columns = df_headers
print(df)
https://stackoverflow.com/questions/70954509
复制相似问题