问如何从BeautifulSoup中的html中提取未指定的链接？
EN

Stack Overflow用户

提问于 2018-08-04 04:25:53

回答 2查看 335关注 0票数 1

关于从HTML文档中提取链接的问题，我还没有找到一个很好的答案。我已经看到了一些答案，您可以直接指定链接。但是，如果你想提取一个未指定的url呢？我只是想知道这是否可能。我在这里有这个HTML

我把这个放到Pycharm中

html = """
<
<html>
<head>
    <title>About me</title>

</head>

<body>
<h1>About Me</h1>

<h4>My Hobbies</h4>
<a href="http://www.google.com"> hello world </a>
<a href="http://www.nytimes.com">byeworld </a>

<ul>
    <li>Cooking</li>
    <li>Gym</li>
    <li>Code</li>
</ul>
</body>
</html> """

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.get_text())

#<html>
#<head>
#   <title>About me</title>
#</head>
#<body>
#<h1>About Me</h1>
#<h4>My Hobbies</h4>
# <a href="http://www.google.com"> hello world </a>
# <a href="http://www.nytimes.com">byeworld </a>
#<ul>
#   <li>Cooking</li>
#   <li>Gym</li>
#   <li>Code</li>
#</ul>
#</body>
#</html>

我得到的输出是：

About me


About Me
My Hobbies


Cooking
Gym
Code

这是我想要的基础，但我希望它也能在纯文本中提取两个URL。

我试着用

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=True):
print(link['href'])
print(soup.get_text())

和

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
soup.find_all("a")
    for link in soup.final_all('a'):
print(link.get('href'))
print(soup.get_text())

我只是对如何做到这一点感到非常困惑。如果有人能帮上忙？

python

beautifulsoup

urllib2

回答 2

Stack Overflow用户

发布于 2018-08-04 05:09:44

只要在for循环之后缩进代码，第一个代码块就可以工作。在python中，缩进是指定代码块的方式，因此任何比for循环多缩进一个制表符的内容都将在每次循环迭代时运行。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=True):
    print(link['href'])
print(soup.get_text())

应打印：

http://www.google.com
http://www.nytimes.com

<


About me


About Me
My Hobbies
 hello world
byeworld

Cooking
Gym
Code

请注意，您的html中也有一个额外的<。

票数 0

Stack Overflow用户

发布于 2018-08-04 05:17:13

您的代码几乎没问题。要选择所有具有href属性的<a>标签，可以使用CSS选择器soup.select('a[href]')。然后只需迭代找到的元素并打印其中的URL和文本：

html = """<html>
<head>
    <title>About me</title>

</head>

<body>
<h1>About Me</h1>

<h4>My Hobbies</h4>
<a href="http://www.google.com"> hello world </a>
<a href="http://www.nytimes.com">byeworld </a>

<ul>
    <li>Cooking</li>
    <li>Gym</li>
    <li>Code</li>
</ul>
</body>
</html> """

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print([(a['href'], a.text.strip()) for a in soup.select('a[href]')])

打印：

[('http://www.google.com', 'hello world'), ('http://www.nytimes.com', 'byeworld')]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51679679

复制

相似问题

问如何从BeautifulSoup中的html中提取未指定的链接？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从BeautifulSoup中的html中提取未指定的链接？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从BeautifulSoup中的html中提取未指定的链接？
EN