首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何从网站获取统计数据并将其转换为python上的DataFrame?

如何从网站获取统计数据并将其转换为python上的DataFrame?
EN

Stack Overflow用户
提问于 2019-05-18 02:50:56
回答 1查看 61关注 0票数 0

我正在尝试从这个网站创建一个DataFrame:http://mcubed.net/ncaab/seeds.shtml

我正试着把这些列表做成一个DataFrame,看看每个种子在NCAA锦标赛中的历史。我不熟悉网络抓取和手动输入它将需要一段时间。所以我想知道有没有比手动创建这个DataFrame更简单的方法?

我试着用我自己的数据框来测试它,我会从网站上手动输入数据,但这是一个非常漫长的过程

代码语言:javascript
运行
复制
import pandas as pd
data= {"History of 1 Seed":["1 seed versus 1 seed"],
       "History of 2 Seed":["2 seed versus 1 seed"],
       "History of 3 Seed":["3 seed versus 1 seed"],
       "History of 4 Seed":["4 seed versus 1 seed"],
       "History of 5 Seed":["5 seed versus 1 seed"],
       "History of 6 Seed":["6 seed versus 1 seed"],
       "History of 7 Seed":["7 seed versus 1 seed"],
       "History of 8 Seed":["8 seed versus 1 seed"],
       "History of 9 Seed":["9 seed versus 1 seed"],
       "History of 10 Seed":["10 seed versus 1 seed"],
       "History of 11 Seed":["11 seed versus 1 seed"],
       "History of 12 Seed":["12 seed versus 1 seed"],
       "History of 13 Seed":["13 seed versus 1 seed"],
       "History of 14 Seed":["14 seed versus 1 seed"],
       "History of 15 Seed":["16 seed versus 1 seed"],
       "History of 16 Seed":["16 seed versus 1 seed"]
  
      }
df1= pd.DataFrame(data)
df1

我创建了我的dataframe,但我不确定如何向其中输入值,希望有一种更简单的方法来做到这一点。谢谢

EN

回答 1

Stack Overflow用户

发布于 2019-05-18 07:25:19

解析网站的

第一步是解析网站,并将信息放入DataFrame或一系列DataFrames中。在这里,我们使用requestsBeautifulSoup的组合来获取文本和解析html。你的特定网站的困难之处在于,表格只是文本,而不是特定的html元素。因此,我们必须以与往常略有不同的方式来处理这件事。

代码语言:javascript
运行
复制
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO

url = 'http://mcubed.net/ncaab/seeds.shtml'

#Getting the website text
data = requests.get(url).text

#Parsing the website
soup = BeautifulSoup(data, "html5lib")

#Create an empty list
dflist = []

#If we look at the html, we don't want the tag b, but whats next to it
#StringIO(b.next.next), takes the correct text and makes it readable to pandas
for b in soup.findAll({"b"})[2:-1]:
    dflist.append(pd.read_csv(StringIO(b.next.next), sep = r'\s+', header = None))

dflist[0]

     0   1     2      3
0   vs. #1  (23-23) 50.0%
1   vs. #2  (40-35) 53.3%
2   vs. #3  (25-15) 62.5%

清理和组合DataFrames

接下来我们需要做的是格式化列表中的所有数据帧。我还决定将所有数据帧组合在一起,让团队命名一列,并在另一列中指定他们是谁。这将允许轻松过滤,以获得我们需要的任何信息。

代码语言:javascript
运行
复制
#We need to create a new list, due to the melt we are going to do not been able to replace
#the dataframes in DFList
meltedDF = []

#The second item in the loop is the team number starting from 1
for df, teamnumber in zip(dflist, (np.arange(len(dflist))+1)):

    #Creating the team name
    name = "Team " + str(teamnumber)

    #Making the team name a column, with the values in df[0] and df[1] in our dataframes
    df[name] = df[0] + df[1]

    #Melting the dataframe to make the team name its own column
    meltedDF.append(df.melt(id_vars = [0, 1, 2, 3]))

# Concat all the melted DataFrames
allTeamStats = pd.concat(meltedDF)

# Final cleaning of our new single DataFrame
allTeamStats = allTeamStats.rename(columns = {0:name, 2:'Record', 3:'Win Percent', 'variable':'Team' , 'value': 'VS'})\
                           .reindex(['Team', 'VS', 'Record', 'Win Percent'], axis = 1)

allTeamStats.head()

     Team    VS     Record  Win Percent
0   Team 1  vs.#1   (23-23) 50.0%
1   Team 1  vs.#2   (40-35) 53.3%
2   Team 1  vs.#3   (25-15) 62.5%
3   Team 1  vs.#4   (53-22) 70.7%
4   Team 1  vs.#5   (45-9)  83.3%

查询我们的新DF

现在我们有了所有的信息在一个DataFrame中,我们可以过滤它来提取我们想要的信息!

代码语言:javascript
运行
复制
allTeamStats[allTeamStats['VS'] == 'vs.#1'].head()

     Team    VS     Record  Win Percent
0   Team 1  vs.#1   (23-23)   50.0%
0   Team 2  vs.#1   (35-40)   46.7%
0   Team 3  vs.#1   (15-25)   37.5%
0   Team 4  vs.#1   (22-53)   29.3%
0   Team 5  vs.#1   (9-45)    16.7%

如果您想要一种更简单的方法来调查一支球队的胜负,我们可以进一步创建两个新的列,将他们的胜负与记录分开。

代码语言:javascript
运行
复制
allTeamStats['Win'] = allTeamStats['Record'].str.extract(r'\((\d+)')
allTeamStats['Lose'] = allTeamStats['Record'].str.extract(r'\(\d+-(\d+)')

allTeamStats.head()

     Team    VS     Record  Win Percent Win Lose
0   Team 1  vs.#1   (23-23)   50.0%     23  23
1   Team 1  vs.#2   (40-35)   53.3%     40  35
2   Team 1  vs.#3   (25-15)   62.5%     25  15
3   Team 1  vs.#4   (53-22)   70.7%     53  22
4   Team 1  vs.#5   (45-9)    83.3%     45  9
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56192061

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档