我正在尝试从这个网站创建一个DataFrame:http://mcubed.net/ncaab/seeds.shtml
我正试着把这些列表做成一个DataFrame,看看每个种子在NCAA锦标赛中的历史。我不熟悉网络抓取和手动输入它将需要一段时间。所以我想知道有没有比手动创建这个DataFrame更简单的方法?
我试着用我自己的数据框来测试它,我会从网站上手动输入数据,但这是一个非常漫长的过程
import pandas as pd
data= {"History of 1 Seed":["1 seed versus 1 seed"],
"History of 2 Seed":["2 seed versus 1 seed"],
"History of 3 Seed":["3 seed versus 1 seed"],
"History of 4 Seed":["4 seed versus 1 seed"],
"History of 5 Seed":["5 seed versus 1 seed"],
"History of 6 Seed":["6 seed versus 1 seed"],
"History of 7 Seed":["7 seed versus 1 seed"],
"History of 8 Seed":["8 seed versus 1 seed"],
"History of 9 Seed":["9 seed versus 1 seed"],
"History of 10 Seed":["10 seed versus 1 seed"],
"History of 11 Seed":["11 seed versus 1 seed"],
"History of 12 Seed":["12 seed versus 1 seed"],
"History of 13 Seed":["13 seed versus 1 seed"],
"History of 14 Seed":["14 seed versus 1 seed"],
"History of 15 Seed":["16 seed versus 1 seed"],
"History of 16 Seed":["16 seed versus 1 seed"]
}
df1= pd.DataFrame(data)
df1
我创建了我的dataframe,但我不确定如何向其中输入值,希望有一种更简单的方法来做到这一点。谢谢
发布于 2019-05-18 07:25:19
解析网站的
第一步是解析网站,并将信息放入DataFrame或一系列DataFrames中。在这里,我们使用requests
和BeautifulSoup
的组合来获取文本和解析html。你的特定网站的困难之处在于,表格只是文本,而不是特定的html元素。因此,我们必须以与往常略有不同的方式来处理这件事。
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
url = 'http://mcubed.net/ncaab/seeds.shtml'
#Getting the website text
data = requests.get(url).text
#Parsing the website
soup = BeautifulSoup(data, "html5lib")
#Create an empty list
dflist = []
#If we look at the html, we don't want the tag b, but whats next to it
#StringIO(b.next.next), takes the correct text and makes it readable to pandas
for b in soup.findAll({"b"})[2:-1]:
dflist.append(pd.read_csv(StringIO(b.next.next), sep = r'\s+', header = None))
dflist[0]
0 1 2 3
0 vs. #1 (23-23) 50.0%
1 vs. #2 (40-35) 53.3%
2 vs. #3 (25-15) 62.5%
清理和组合DataFrames
接下来我们需要做的是格式化列表中的所有数据帧。我还决定将所有数据帧组合在一起,让团队命名一列,并在另一列中指定他们是谁。这将允许轻松过滤,以获得我们需要的任何信息。
#We need to create a new list, due to the melt we are going to do not been able to replace
#the dataframes in DFList
meltedDF = []
#The second item in the loop is the team number starting from 1
for df, teamnumber in zip(dflist, (np.arange(len(dflist))+1)):
#Creating the team name
name = "Team " + str(teamnumber)
#Making the team name a column, with the values in df[0] and df[1] in our dataframes
df[name] = df[0] + df[1]
#Melting the dataframe to make the team name its own column
meltedDF.append(df.melt(id_vars = [0, 1, 2, 3]))
# Concat all the melted DataFrames
allTeamStats = pd.concat(meltedDF)
# Final cleaning of our new single DataFrame
allTeamStats = allTeamStats.rename(columns = {0:name, 2:'Record', 3:'Win Percent', 'variable':'Team' , 'value': 'VS'})\
.reindex(['Team', 'VS', 'Record', 'Win Percent'], axis = 1)
allTeamStats.head()
Team VS Record Win Percent
0 Team 1 vs.#1 (23-23) 50.0%
1 Team 1 vs.#2 (40-35) 53.3%
2 Team 1 vs.#3 (25-15) 62.5%
3 Team 1 vs.#4 (53-22) 70.7%
4 Team 1 vs.#5 (45-9) 83.3%
查询我们的新DF的
现在我们有了所有的信息在一个DataFrame中,我们可以过滤它来提取我们想要的信息!
allTeamStats[allTeamStats['VS'] == 'vs.#1'].head()
Team VS Record Win Percent
0 Team 1 vs.#1 (23-23) 50.0%
0 Team 2 vs.#1 (35-40) 46.7%
0 Team 3 vs.#1 (15-25) 37.5%
0 Team 4 vs.#1 (22-53) 29.3%
0 Team 5 vs.#1 (9-45) 16.7%
如果您想要一种更简单的方法来调查一支球队的胜负,我们可以进一步创建两个新的列,将他们的胜负与记录分开。
allTeamStats['Win'] = allTeamStats['Record'].str.extract(r'\((\d+)')
allTeamStats['Lose'] = allTeamStats['Record'].str.extract(r'\(\d+-(\d+)')
allTeamStats.head()
Team VS Record Win Percent Win Lose
0 Team 1 vs.#1 (23-23) 50.0% 23 23
1 Team 1 vs.#2 (40-35) 53.3% 40 35
2 Team 1 vs.#3 (25-15) 62.5% 25 15
3 Team 1 vs.#4 (53-22) 70.7% 53 22
4 Team 1 vs.#5 (45-9) 83.3% 45 9
https://stackoverflow.com/questions/56192061
复制相似问题