blocks|key|1298346|text|下面是在BeautifulSoup中使用SoupStrainer类的一小段代码：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1298347|import+httplib2
from+bs4+import+BeautifulSoup,+SoupStrainer

http+=+httplib2.Http()
status,+response+=+http.request('http://www.nytimes.com')

for+link+in+BeautifulSoup(response,+parse_only=SoupStrainer('a')):
++++if+link.has_attr('href'):
++++++++print(link['href'])|code-block|syntax|javascript|1298348|BeautifulSoup文档实际上相当不错，涵盖了许多典型场景：|1298349|https://www.crummy.com/software/BeautifulSoup/bs4/doc/|offset|length|1298350|编辑:请注意，我使用了SoupStrainer类，因为如果事先知道要解析的内容，它的效率会更高一些(内存和速度)。|1298351|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|0|0|1I|0|0|0^^$0|@$1|2|3|4|5|6|7|V|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|W|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|X|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|Y|8|@]|9|@$K|Z|L|10|1|11]]|A|$]]|$1|M|3|N|5|6|7|12|8|@]|9|@]|A|$]]|$1|O|3|-4|5|6|7|13|8|@]|9|@]|A|$]]]|P|$Q|$5|R|S|T|A|$U|J]]]]

Here's a short snippet using the SoupStrainer class in BeautifulSoup:
<pre><code>import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
 if link.has_attr('href'):
 print(link['href'])
</code></pre>
The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:
<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="noreferrer">https://www.crummy.com/software/BeautifulSoup/bs4/doc/</a>
Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.

blocks|key|1298308|text|import+urllib2
import+BeautifulSoup

request+=+urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response+=+urllib2.urlopen(request)
soup+=+BeautifulSoup.BeautifulSoup(response)
for+a+in+soup.findAll('a'):
++if+'national-park'+in+a['href']:
++++print+'found+a+url+with+national-park+in+the+link'|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|1298309|unstyled|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|G|8|@]|9|@]|A|$B|C]]|$1|D|3|-4|5|E|7|H|8|@]|9|@]|A|$]]]|F|$]]

<pre><code>import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
 if 'national-park' in a['href']:
 print 'found a url with national-park in the link'
</code></pre>

blocks|key|1298515|text|下面的代码使用urllib2和BeautifulSoup4检索网页中所有可用的链接|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1298516|import+urllib2
from+bs4+import+BeautifulSoup

url+=+urllib2.urlopen("http://www.espncricinfo.com/").read()
soup+=+BeautifulSoup(url)

for+line+in+soup.find_all('a'):
++++print(line.get('href'))|code-block|syntax|javascript|1298517|entityMap^0|7|7|F|E|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@$9|N|A|O|B|C]|$9|P|A|Q|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|R|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|S|8|@]|D|@]|E|$]]]|L|$]]

The following code is to retrieve all the links available in a webpage using <code>urllib2</code> and <code>BeautifulSoup4</code>:

<pre><code>import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
 print(line.get('href'))
</code></pre>

blocks|key|1298449|text|在幕后，BeautifulSoup现在使用lxml。请求、lxml和列表理解是一个杀手级的组合。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1298450|import+requests
import+lxml.html

dom+=+lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x+for+x+in+dom.xpath('//a/@href')+if+'//'+in+x+and+'nytimes.com'+not+in+x]|code-block|syntax|javascript|1298451|在列表comp中，"if+'//‘and+'url.com’not+In+x“是一种简单的方法，用于擦除站点”内部“导航url的url列表，等等。|1298452|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

Under the hood BeautifulSoup now uses lxml. Requests, lxml &amp; list comprehensions makes a killer combo.

<pre><code>import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]
</code></pre>

In the list comp, the "if '//' and 'url.com' not in x" is a simple method to scrub the url list of the sites 'internal' navigation urls, etc.

blocks|key|1513095|text|只是为了获得链接，没有B.soup和正则表达式：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1513096|import+urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a+href=\""
endtag="\">"
for+item+in+data:
++++if+"<a+href"+in+item:
++++++++try:
++++++++++++ind+=+item.index(tag)
++++++++++++item=item[ind%2Blen(tag):]
++++++++++++end=item.index(endtag)
++++++++except:+pass
++++++++else:
++++++++++++print+item[:end]|code-block|syntax|javascript|1513097|对于更复杂的操作，BSoup当然仍然是首选。|1513098|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

just for getting the links, without B.soup and regex:

<pre><code>import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("&lt;/a&gt;")
tag="&lt;a href=\""
endtag="\"&gt;"
for item in data:
 if "&lt;a href" in item:
 try:
 ind = item.index(tag)
 item=item[ind+len(tag):]
 end=item.index(endtag)
 except: pass
 else:
 print item[:end]
</code></pre>

for more complex operations, of course BSoup is still preferred.

blocks|key|1513309|text|这个脚本做了你想做的事情，但也将相对链接解析为绝对链接。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1513310|import+urllib
import+lxml.html
import+urlparse

def+get_dom(url):
++++connection+=+urllib.urlopen(url)
++++return+lxml.html.fromstring(connection.read())

def+get_links(url):
++++return+resolve_links((link+for+link+in+get_dom(url).xpath('//a/@href')))

def+guess_root(links):
++++for+link+in+links:
++++++++if+link.startswith('http'):
++++++++++++parsed_link+=+urlparse.urlparse(link)
++++++++++++scheme+=+parsed_link.scheme+%2B+'://'
++++++++++++netloc+=+parsed_link.netloc
++++++++++++return+scheme+%2B+netloc

def+resolve_links(links):
++++root+=+guess_root(links)
++++for+link+in+links:
++++++++if+not+link.startswith('http'):
++++++++++++link+=+urlparse.urljoin(root,+link)
++++++++yield+link++

for+link+in+get_links('http://www.google.com'):
++++print+link|code-block|syntax|javascript|1513311|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

This script does what your looking for, But also resolves the relative links to absolute links.

<pre><code>import urllib
import lxml.html
import urlparse

def get_dom(url):
 connection = urllib.urlopen(url)
 return lxml.html.fromstring(connection.read())

def get_links(url):
 return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
 for link in links:
 if link.startswith('http'):
 parsed_link = urlparse.urlparse(link)
 scheme = parsed_link.scheme + '://'
 netloc = parsed_link.netloc
 return scheme + netloc

def resolve_links(links):
 root = guess_root(links)
 for link in links:
 if not link.startswith('http'):
 link = urlparse.urljoin(root, link)
 yield link 

for link in get_links('http://www.google.com'):
 print link
</code></pre>

blocks|key|1298392|text|为什么不使用正则表达式：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1298393|import+urllib2
import+re
url+=+"http://www.somewhere.com"
page+=+urllib2.urlopen(url)
page+=+page.read()
links+=+re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>",+page)
for+link+in+links:
++++print('href:+%25s,+HTML+text:+%25s'+%25+(link[0],+link[1]))|code-block|syntax|javascript|1298394|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Why not use regular expressions:

<pre><code>import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"&lt;a.*?\s*href=\"(.*?)\".*?&gt;(.*?)&lt;/a&gt;", page)
for link in links:
 print('href: %s, HTML text: %s' % (link[0], link[1]))
</code></pre>

blocks|key|1513352|text|下面是一个使用@ars接受的答案以及BeautifulSoup4、requests和wget模块来处理下载的示例。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1513353|import+requests
import+wget
import+os

from+bs4+import+BeautifulSoup,+SoupStrainer

url+=+'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type+=+'.tar.gz'

response+=+requests.get(url)

for+link+in+BeautifulSoup(response.content,+'html.parser',+parse_only=SoupStrainer('a')):
++++if+link.has_attr('href'):
++++++++if+file_type+in+link['href']:
++++++++++++full_path+=+url+%2B+link['href']
++++++++++++wget.download(full_path)|code-block|syntax|javascript|1513354|entityMap^0|I|E|X|8|16|4|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@$9|N|A|O|B|C]|$9|P|A|Q|B|C]|$9|R|A|S|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|T|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|U|8|@]|D|@]|E|$]]]|L|$]]

Here's an example using @ars accepted answer and the <code>BeautifulSoup4</code>, <code>requests</code>, and <code>wget</code> modules to handle the downloads.

<pre><code>import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
 if link.has_attr('href'):
 if file_type in link['href']:
 full_path = url + link['href']
 wget.download(full_path)
</code></pre>

blocks|key|1298657|text|我通过@Blairg23找到了答案，经过以下更正(涵盖了它无法正常工作的场景)：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1298658|for+link+in+BeautifulSoup(response.content,+'html.parser',+parse_only=SoupStrainer('a')):
++++if+link.has_attr('href'):
++++++++if+file_type+in+link['href']:
++++++++++++full_path+=urlparse.urljoin(url+,+link['href'])+#module+urlparse+need+to+be+imported
++++++++++++wget.download(full_path)|code-block|syntax|javascript|1298659|对于Python+3：|1298660|必须使用urllib.parse.urljoin才能获得完整的URL。|offset|length|style|CODE|1298661|entityMap^0|0|0|0|4|K|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|R|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|S|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|T|8|@$K|U|L|V|M|N]]|9|@]|A|$]]|$1|O|3|-4|5|6|7|W|8|@]|9|@]|A|$]]]|P|$]]

I found the answer by @Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):

<pre><code>for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
 if link.has_attr('href'):
 if file_type in link['href']:
 full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
 wget.download(full_path)
</code></pre>

For Python 3:

<code>urllib.parse.urljoin</code> has to be used in order to obtain the full URL instead.

blocks|key|1298683|text|可以有许多重复的链接以及外部和内部链接。要区分这两者并仅使用集合获取唯一链接：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1298684|#+Python+3.
import+urllib++++
from+bs4+import+BeautifulSoup

url+=+"http://www.espncricinfo.com/"
resp+=+urllib.request.urlopen(url)
#+Get+server+encoding+per+recommendation+of+Martijn+Pieters.
soup+=+BeautifulSoup(resp,+from_encoding=resp.info().get_param('charset'))++
external_links+=+set()
internal_links+=+set()
for+line+in+soup.find_all('a'):
++++link+=+line.get('href')
++++if+not+link:
++++++++continue
++++if+link.startswith('http'):
++++++++external_links.add(link)
++++else:
++++++++internal_links.add(link)

#+Depending+on+usage,+full+internal+links+may+be+preferred.
full_internal_links+=+{
++++urllib.parse.urljoin(url,+internal_link)+
++++for+internal_link+in+internal_links
}

#+Print+all+unique+external+and+full+internal+links.
for+link+in+external_links.union(full_internal_links):
++++print(link)|code-block|syntax|javascript|1298685|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:

<pre><code># Python 3.
import urllib 
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset')) 
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
 link = line.get('href')
 if not link:
 continue
 if link.startswith('http'):
 external_links.add(link)
 else:
 internal_links.add(link)

# Depending on usage, full internal links may be preferred.
full_internal_links = {
 urllib.parse.urljoin(url, internal_link) 
 for internal_link in internal_links
}

# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
 print(link)
</code></pre>

blocks|key|1298526|text|import+urllib2
from+bs4+import+BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To+get+href+part+alone
print+links[0].attrs['href']|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|1298527|unstyled|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|G|8|@]|9|@]|A|$B|C]]|$1|D|3|-4|5|E|7|H|8|@]|9|@]|A|$]]]|F|$]]

<pre><code>import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']
</code></pre>

How can I retrieve the links of a webpage and copy the url address of the links using Python?

retrieve links from web page using python and BeautifulSoup

如何使用Python检索网页的链接并复制链接的url地址？

问使用python和BeautifulSoup从网页中检索链接
EN

回答 11

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python和BeautifulSoup从网页中检索链接EN

回答 11

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python和BeautifulSoup从网页中检索链接
EN