首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何将此列表转换为数据格式?

如何将此列表转换为数据格式?
EN

Stack Overflow用户
提问于 2019-11-07 20:01:03
回答 3查看 81关注 0票数 1

我正在循环浏览网页,抓取表中的行,然后将每一行附加到dataframe中。但是,我得到了一个无法连接到一个数据文件中的列表。如何将此列表转换为允许pd.concat()?

我尝试过pd.DataFrame(data),但返回KeyError: 0

以下是打印(数据) https://imgur.com/a/t0v0QaU的结果

代码语言:javascript
运行
复制
[          Unnamed: 0 2015-2016 2016-2017 2017-2018
0  Average net price    $6,497    $8,311    $7,035,           Unnamed: 0 2015-2016 2016-2017 2017-2018
0  Average net price   $26,916   $27,175   $27,584,           Unnamed: 0 2015-2016 2016-2017 2017-2018
0  Average net price    $8,123    $8,022    $7,687,           Unnamed: 0 2015-2016 2016-2017 2017-2018
0  Average net price         —   $16,694   $21,842,           Unnamed: 0 2015-2016 2016-2017 2017-2018
0  Average net price   $13,888   $12,989   $13,314,           Unnamed: 0 2015-2016 2016-2017 2017-2018
0  Average net price   $28,095   $27,925   $28,406,           Unnamed: 0 2015-2016 2016-2017 2017-2018
0  Average net price    $7,242    $6,960    $8,436,           Unnamed: 0 2015-2016 2016-2017 2017-2018
0  Average net price   $25,839   $26,930   $26,710,           Unnamed: 0 2015-2016 2016-2017 2017-2018
0  Average net price   $18,603   $16,450   $17,145]
代码语言:javascript
运行
复制
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

data = []
url = 'https://nces.ed.gov/collegenavigator/?id='
ids = pd.read_excel('ids.xlsx')
for index, row in ids.iterrows():
    try:
        r = requests.get(url+str(row[0]))
        soup = bs(r.content, 'lxml')
        table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
        data.append(table[0])
    except:
        pass
print(data)

身份证是:

代码语言:javascript
运行
复制
UnitID
180203
177834
222178
138558
412173
126182
188429
188438
168528
133872

理想情况下,我希望输出有一个id列和每个年度范围(2015-2016,2016-2017等)的列,并填写如下矩阵:https://imgur.com/a/RC0hoGz

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-11-07 20:34:27

基本上,只需将id保存在分析过的数据文件的单独列中即可。现在它被忽略了

代码语言:javascript
运行
复制
...
for index, row in ids.iterrows(): 
    try: 
        r = requests.get(url+str(row[0])) 
        soup = bs(r.content, 'lxml') 
        table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')), index_col=0)[0] 
        table['id'] = row[0] # save the Id in a separate column
        data.append(table.set_index('id'))
    except: 
        pass

df = pd.concat(data)

结果:

代码语言:javascript
运行
复制
       2015-2016 2016-2017 2017-2018
id                                  
180203    $6,497    $8,311    $7,035
222178   $26,916   $27,175   $27,584
138558    $8,123    $8,022    $7,687
412173         —   $16,694   $21,842
126182   $13,888   $12,989   $13,314
188429   $28,095   $27,925   $28,406
188438    $7,242    $6,960    $8,436
168528   $25,839   $26,930   $26,710
133872   $18,603   $16,450   $17,145
票数 2
EN

Stack Overflow用户

发布于 2019-11-07 20:28:33

很酷的问题

所以当你用熊猫做任何事情时,它通常会给你一个系列或数据作为输出。因此,当您创建一个名为data的列表并将table[0]追加到它时。你以为你在这上面附加了一份清单(我想)。但是pd.read_html给出了一个数据格式。因此,您只需要将data创建为一个Dataframe,然后将每个dataframe追加到其中。

以下是解决办法:

代码语言:javascript
运行
复制
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

data = pd.DataFrame()
url = 'https://nces.ed.gov/collegenavigator/?id='
ids = pd.read_excel('ids.xlsx')
for index, row in ids.iterrows():
    try:
        r = requests.get(url+str(row[0]))
        soup = bs(r.content, 'lxml')
        table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
        data = data.append(table[0], ignore_index=True)
    except:
        pass

希望这能有所帮助。

票数 1
EN

Stack Overflow用户

发布于 2019-11-07 20:32:22

使用:

代码语言:javascript
运行
复制
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

data = []
url = 'https://nces.ed.gov/collegenavigator/?id='
ids = pd.read_excel('ids.xlsx')

for index, row in ids.iterrows():

    try:
        r = requests.get(url+str(row[0]))
        soup = bs(r.content, 'lxml')
        table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
        dataframe=table[0]
        dataframe.index=row
        data.append(dataframe)
    except:
        pass


df_values= (pd.concat(data,sort=False)
              .drop('Unnamed: 0',axis=1)
              .rename_axis(index='UnitID') )
print(df_values)

输出:

代码语言:javascript
运行
复制
        2015-2016 2016-2017 2017-2018
UnitID                              
180203    $6,497    $8,311    $7,035
222178   $26,916   $27,175   $27,584
138558    $8,123    $8,022    $7,687
412173         —   $16,694   $21,842
126182   $13,888   $12,989   $13,314
188429   $28,095   $27,925   $28,406
188438    $7,242    $6,960    $8,436
168528   $25,839   $26,930   $26,710
133872   $18,603   $16,450   $17,145
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58756137

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档