我编写了下面的代码,试图使用Python、Pandas等来练习web抓取。总的来说,为了实现我想要的输出,我尝试遵循四个步骤:
我能够让#的1和2正常工作。3号的组件似乎工作正常,但我认为我的尝试有问题:除非我只运行一行代码来抓取特定的playerUrl,表DF就会如预期的那样填充。第一个抓取的玩家没有数据,所以我认为我在错误捕获方面失败了。
对于#4,我真的找不到解决办法。在for循环中迭代时,如何将名称添加到列表中?
任何帮助都是非常感谢的。
import requests
import pandas as pd
from bs4 import BeautifulSoup
### get the player data to create player specific urls
res = requests.get("https://www.mlssoccer.com/players?page=0")
soup = BeautifulSoup(res.content,'html.parser')
data = soup.find('div', class_ = 'item-list' )
names=[]
for player in data:
name = data.find_all('div', class_ = 'name')
for obj in name:
names.append(obj.find('a').text.lower().lstrip().rstrip().replace(' ','-'))
### create a list of player specific urls
url = 'https://www.mlssoccer.com/players/'
playerUrl = []
x = 0
for name in (names):
playerList = names
newUrl = url + str(playerList[x])
print("Gathering url..."+newUrl)
playerUrl.append(newUrl)
x +=1
### now take the list of urls and gather stats tables
tbls = []
i = 0
for url in (playerUrl):
try: ### added the try, except, pass because some players have no stats table
tables = pd.read_html(playerUrl[i], header = 0)[2]
tbls.append(tables)
i +=1
except Exception:
continue发布于 2019-01-02 06:21:46
您可以做几件事来改进代码,并完成步骤3和步骤4。
(i)在使用for name in names循环时,不需要显式使用索引,只需使用变量名。(ii)您可以将游戏者的姓名及其相应的URL保存为dict,其中的名称是键。然后,在步骤3/4中,您可以使用该名称(iii)为每个解析的HTML表构造一个DataFrame,并将播放机的名称附加到其中。单独保存此数据帧。
(iv)最后,级联这些数据帧形成一个单一的数据帧。
下面是用上面建议的更改修改的代码:
import requests
import pandas as pd
from bs4 import BeautifulSoup
### get the player data to create player specific urls
res = requests.get("https://www.mlssoccer.com/players?page=0")
soup = BeautifulSoup(res.content,'html.parser')
data = soup.find('div', class_ = 'item-list' )
names=[]
for player in data:
name = data.find_all('div', class_ = 'name')
for obj in name:
names.append(obj.find('a').text.lower().lstrip().rstrip().replace(' ','-'))
### create a list of player specific urls
url = 'https://www.mlssoccer.com/players/'
playerUrl = {}
x = 0
for name in names:
newUrl = url + str(name)
print("Gathering url..."+newUrl)
playerUrl[name] = newUrl
### now take the list of urls and gather stats tables
tbls = []
for name, url in playerUrl.items():
try:
tables = pd.read_html(url, header = 0)[2]
df = pd.DataFrame(tables)
df['Player'] = name
tbls.append(df)
except Exception as e:
print(e)
continue
result = pd.concat(tbls)
print(result.head())https://stackoverflow.com/questions/54001743
复制相似问题