我用结合了BeautifulSoup
的python
编写了一个脚本来解析网页中的某个地址。然而,当我运行下面的脚本时,当它命中address = [item.find_next_sibling().get_text(strip=True)
行时,我得到了一个问题AttributeError: 'NavigableString' object has no attribute 'text'
。如果我尝试注释掉的行,我可以摆脱这个问题。然而,我想坚持目前应用的方式。我能做些什么呢?
这是我的尝试:
import requests
from bs4 import BeautifulSoup
URL = "https://beta.companieshouse.gov.uk/officers/lX9snXUPL09h7ljtMYLdZU9LmOo/appointments"
def fetch_names(session,link):
session.headers = {"User-Agent":"Mozilla/5.0"}
res = session.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("#content-container dt"):
#the error appears in the following line
address = [item.find_next_sibling().get_text(strip=True) for item in items if "correspondence address" in item.text.lower()][0]
print(address)
if __name__ == '__main__':
with requests.Session() as session:
fetch_names(session,URL)
我可以像下面这样做来消除这个错误,但我想坚持我在脚本中尝试的方法:
items = soup.select("#content-container dt")
address = [item.find_next_sibling().get_text(strip=True) for item in items if "correspondence address" in item.text.lower()][0]
print(address)
编辑:
这不是一个答案,但这是我尝试尝试的方法(仍然不确定如何应用.find_previous_sibling()
:
import requests
from bs4 import BeautifulSoup
URL = "https://beta.companieshouse.gov.uk/officers/lX9snXUPL09h7ljtMYLdZU9LmOo/appointments"
def fetch_names(session,link):
session.headers = {"User-Agent":"Mozilla/5.0"}
res = session.get(link)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("#content-container dt"):
address = [item for item in items.strings if "correspondence address" in item.lower()]
print(address)
if __name__ == '__main__':
with requests.Session() as session:
fetch_names(session,URL)
并且它会产生(没有导航问题)。
[]
['Correspondence address']
[]
[]
发布于 2018-06-17 05:05:40
items
不是一个节点列表,而是一个单独的节点,所以您不应该在这里使用它作为迭代器- for item in items
。只需将列表理解替换为以下内容:
for items in soup.select("#content-container dt"):
if "correspondence address" in items.text.lower():
address = items.find_next_sibling().get_text(strip=True)
print(address)
发布于 2018-06-16 02:35:24
您可以将BeautifulSoup选择器更改为直接查找联系地址id为# contact address -value-1。
import requests
from bs4 import BeautifulSoup
URL = "https://beta.companieshouse.gov.uk/officers/lX9snXUPL09h7ljtMYLdZU9LmOo/appointments"
def fetch_names(session,link):
session.headers = {"User-Agent":"Mozilla/5.0"}
res = session.get(link)
soup = BeautifulSoup(res.text,"lxml")
addresses = [a.text for a in soup.select("#correspondence-address-value-1")]
print(addresses)
if __name__ == '__main__':
with requests.Session() as session:
fetch_names(session,URL)
结果
13:32 $ python test.py
['21 Maes Y Llan, Conwy, Wales, LL32 8NB']
https://stackoverflow.com/questions/50880811
复制相似问题