blocks|key|1770724|text|我总是使用lxml来完成这类任务。您也可以使用beautifulsoup。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1770725|import+lxml.html
t+=+lxml.html.parse(url)
print(t.find(".//title").text)|code-block|syntax|javascript|1770726|基于注释进行编辑：|1770727|from+urllib2+import+urlopen
from+lxml.html+import+parse

url+=+"https://www.google.com"
page+=+urlopen(url)
p+=+parse(page)
print(p.find(".//title").text)|1770728|entityMap|0|LINK|mutability|MUTABLE|url|http://lxml.de/|1|http://www.crummy.com/software/BeautifulSoup/^0|5|4|0|N|D|1|0|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@$A|X|B|Y|1|Z]|$A|10|B|11|1|12]]|C|$]]|$1|D|3|E|5|F|7|13|8|@]|9|@]|C|$G|H]]|$1|I|3|J|5|6|7|14|8|@]|9|@]|C|$]]|$1|K|3|L|5|F|7|15|8|@]|9|@]|C|$G|H]]|$1|M|3|-4|5|6|7|16|8|@]|9|@]|C|$]]]|N|$O|$5|P|Q|R|C|$S|T]]|U|$5|P|Q|R|C|$S|V]]]]

I'll always use <a href="http://lxml.de/" rel="nofollow noreferrer">lxml</a> for such tasks. You could use <a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow noreferrer">beautifulsoup</a> as well.
<pre><code>import lxml.html
t = lxml.html.parse(url)
print(t.find(&quot;.//title&quot;).text)
</code></pre>
EDIT based on comment:
<pre><code>from urllib2 import urlopen
from lxml.html import parse

url = &quot;https://www.google.com&quot;
page = urlopen(url)
p = parse(page)
print(p.find(&quot;.//title&quot;).text)
</code></pre>

blocks|key|1555848|text|无需导入其他库。Request内置了此功能。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1555849|>>+hearders+=+{'headers':'Mozilla/5.0+(X11;+Ubuntu;+Linux+x86_64;+rv:51.0)+Gecko/20100101+Firefox/51.0'}
>>>+n+=+requests.get('http://www.imdb.com/title/tt0108778/',+headers=hearders)
>>>+al+=+n.text
>>>+al[al.find('<title>')+%2B+7+:+al.find('</title>')]
u'Friends+(TV+Series+1994\u20132004)+-+IMDb'+|code-block|syntax|javascript|1555850|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

No need to import other libraries. Request has this functionality in-built.

<pre><code>&gt;&gt; hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
&gt;&gt;&gt; n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
&gt;&gt;&gt; al = n.text
&gt;&gt;&gt; al[al.find('&lt;title&gt;') + 7 : al.find('&lt;/title&gt;')]
u'Friends (TV Series 1994\u20132004) - IMDb' 
</code></pre>

blocks|key|1770850|text|机械化浏览器对象有一个title()方法。因此，this+post中的代码可以重写为：|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1770851|from+mechanize+import+Browser
br+=+Browser()
br.open("http://www.google.com/")
print+br.title()|code-block|syntax|javascript|1770852|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/questions/51233/how-can-i-retrieve-the-page-title-of-a-webpage-using-python#51242^0|O|9|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@$A|R|B|S|1|T]]|C|$]]|$1|D|3|E|5|F|7|U|8|@]|9|@]|C|$G|H]]|$1|I|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|J|$K|$5|L|M|N|C|$O|P]]]]

The mechanize Browser object has a title() method. So the code from <a href="https://stackoverflow.com/questions/51233/how-can-i-retrieve-the-page-title-of-a-webpage-using-python#51242">this post</a> can be rewritten as:

<pre><code>from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()
</code></pre>

blocks|key|1770807|text|对于这样一个简单的任务来说，这可能有些过分了，但是如果您打算做更多的事情，那么从这些工具(机械化、BeautifulSoup)开始会更明智一些，因为它们比其他工具(用来获取内容的urllib和用来解析html的正则表达式或其他解析器)更容易使用。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1770808|链接：BeautifulSoup+mechanize|offset|length|1770809|#!/usr/bin/env+python
#coding:utf-8

from+bs4+import+BeautifulSoup
from+mechanize+import+Browser

#This+retrieves+the+webpage+content
br+=+Browser()
res+=+br.open("https://www.google.com/")
data+=+res.get_data()+

#This+parses+the+content
soup+=+BeautifulSoup(data)
title+=+soup.find('title')

#This+outputs+the+content+:)
print+title.renderContents()|code-block|syntax|javascript|1770810|entityMap|0|LINK|mutability|MUTABLE|url|http://crummy.com/software/BeautifulSoup|1|http://wwwsearch.sourceforge.net/mechanize/^0|0|3|D|0|H|9|1|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|V|8|@]|9|@$D|W|E|X|1|Y]|$D|Z|E|10|1|11]]|A|$]]|$1|F|3|G|5|H|7|12|8|@]|9|@]|A|$I|J]]|$1|K|3|-4|5|6|7|13|8|@]|9|@]|A|$]]]|L|$M|$5|N|O|P|A|$Q|R]]|S|$5|N|O|P|A|$Q|T]]]]

This is probably overkill for such a simple task, but if you plan to do more than that, then it's saner to start from these tools (mechanize, BeautifulSoup) because they are much easier to use than the alternatives (urllib to get content and regexen or some other parser to parse html)
Links:
<a href="http://crummy.com/software/BeautifulSoup" rel="nofollow noreferrer">BeautifulSoup</a>
<a href="http://wwwsearch.sourceforge.net/mechanize/" rel="nofollow noreferrer">mechanize</a>
<pre><code>#!/usr/bin/env python
#coding:utf-8

from bs4 import BeautifulSoup
from mechanize import Browser

#This retrieves the webpage content
br = Browser()
res = br.open(&quot;https://www.google.com/&quot;)
data = res.get_data() 

#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')

#This outputs the content :)
print title.renderContents()
</code></pre>

blocks|key|1555832|text|使用HTMLParser|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1555833|from+urllib.request+import+urlopen
from+html.parser+import+HTMLParser


class+TitleParser(HTMLParser):
++++def+__init__(self):
++++++++HTMLParser.__init__(self)
++++++++self.match+=+False
++++++++self.title+=+''

++++def+handle_starttag(self,+tag,+attributes):
++++++++self.match+=+tag+==+'title'

++++def+handle_data(self,+data):
++++++++if+self.match:
++++++++++++self.title+=+data
++++++++++++self.match+=+False

url+=+"http://example.com/"
html_string+=+str(urlopen(url).read())

parser+=+TitleParser()
parser.feed(html_string)
print(parser.title)++#+prints:+Example+Domain|code-block|syntax|javascript|1555834|entityMap|0|LINK|mutability|MUTABLE|url|https://docs.python.org/3.4/library/html.parser.html^0|2|A|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@$A|R|B|S|1|T]]|C|$]]|$1|D|3|E|5|F|7|U|8|@]|9|@]|C|$G|H]]|$1|I|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|J|$K|$5|L|M|N|C|$O|P]]]]

Using <a href="https://docs.python.org/3.4/library/html.parser.html" rel="noreferrer">HTMLParser</a>:
<pre><code>from urllib.request import urlopen
from html.parser import HTMLParser


class TitleParser(HTMLParser):
 def __init__(self):
 HTMLParser.__init__(self)
 self.match = False
 self.title = ''

 def handle_starttag(self, tag, attributes):
 self.match = tag == 'title'

 def handle_data(self, data):
 if self.match:
 self.title = data
 self.match = False

url = &quot;http://example.com/&quot;
html_string = str(urlopen(url).read())

parser = TitleParser()
parser.feed(html_string)
print(parser.title) # prints: Example Domain
</code></pre>

blocks|key|1771078|text|使用soup.select_one作为标题标签的目标|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1771079|import+requests
from+bs4+import+BeautifulSoup+as+bs

r+=+requests.get('url')
soup+=+bs(r.content,+'lxml')
print(soup.select_one('title').text)|code-block|syntax|javascript|1771080|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Use soup.select_one to target title tag

<pre><code>import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
</code></pre>

blocks|key|1770945|text|使用正则表达式|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1770946|import+re
match+=+re.search('<title>(.*?)</title>',+raw_html)
title+=+match.group(1)+if+match+else+'No+title'|code-block|syntax|javascript|1770947|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Using regular expressions

<pre><code>import re
match = re.search('&lt;title&gt;(.*?)&lt;/title&gt;', raw_html)
title = match.group(1) if match else 'No title'
</code></pre>

blocks|key|1555813|text|soup.title.string实际上返回一个unicode字符串。要将该字符串转换为普通字符串，需要执行string=string.encode('ascii','ignore')|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1555814|entityMap^0|0|H|1I|12|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]|$9|K|A|L|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|M|8|@]|D|@]|E|$]]]|G|$]]

<code>soup.title.string</code> actually returns a unicode string.
To convert that into normal string, you need to do
<code>string=string.encode('ascii','ignore')</code>

blocks|key|1770988|text|下面是一个容错HTMLParser实现。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1770989|你可以在不破坏get_title()的情况下抛出几乎任何东西，如果发生任何意外的事情，get_title()将返回None。|1770990|当Parser()下载页面时，不管页面中使用的是什么字符集，它都会将其编码为ASCII，忽略任何错误。更改to_ascii()以将数据转换为UTF-8或任何其他编码将是微不足道的。只需添加一个编码参数并将函数重命名为类似于to_encoding()的名称即可。|1770991|默认情况下，HTMLParser()会在损坏的html上中断，它甚至会在不匹配的标签等琐碎的事情上中断。为了防止这种行为，我用一个忽略错误的函数替换了HTMLParser()的error方法。|1770992|#-*-coding:utf8;-*-
#qpy:3
#qpy:console

'''+
Extract+the+title+from+a+web+page+using
the+standard+lib.
'''

from+html.parser+import+HTMLParser
from+urllib.request+import+urlopen
import+urllib

def+error_callback(*_,+**__):
++++pass

def+is_string(data):
++++return+isinstance(data,+str)

def+is_bytes(data):
++++return+isinstance(data,+bytes)

def+to_ascii(data):
++++if+is_string(data):
++++++++data+=+data.encode('ascii',+errors='ignore')
++++elif+is_bytes(data):
++++++++data+=+data.decode('ascii',+errors='ignore')
++++else:
++++++++data+=+str(data).encode('ascii',+errors='ignore')
++++return+data


class+Parser(HTMLParser):
++++def+__init__(self,+url):
++++++++self.title+=+None
++++++++self.rec+=+False
++++++++HTMLParser.__init__(self)
++++++++try:
++++++++++++self.feed(to_ascii(urlopen(url).read()))
++++++++except+urllib.error.HTTPError:
++++++++++++return
++++++++except+urllib.error.URLError:
++++++++++++return
++++++++except+ValueError:
++++++++++++return

++++++++self.rec+=+False
++++++++self.error+=+error_callback

++++def+handle_starttag(self,+tag,+attrs):
++++++++if+tag+==+'title':
++++++++++++self.rec+=+True

++++def+handle_data(self,+data):
++++++++if+self.rec:
++++++++++++self.title+=+data

++++def+handle_endtag(self,+tag):
++++++++if+tag+==+'title':
++++++++++++self.rec+=+False


def+get_title(url):
++++return+Parser(url).title

print(get_title('http://www.google.com'))|code-block|syntax|javascript|1770993|entityMap^0|7|A|0|7|B|17|B|1L|4|0|1|8|12|5|1H|A|1Y|5|33|D|0|6|C|23|C|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@$9|T|A|U|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|V|8|@$9|W|A|X|B|C]|$9|Y|A|Z|B|C]|$9|10|A|11|B|C]]|D|@]|E|$]]|$1|H|3|I|5|6|7|12|8|@$9|13|A|14|B|C]|$9|15|A|16|B|C]|$9|17|A|18|B|C]|$9|19|A|1A|B|C]|$9|1B|A|1C|B|C]]|D|@]|E|$]]|$1|J|3|K|5|6|7|1D|8|@$9|1E|A|1F|B|C]|$9|1G|A|1H|B|C]]|D|@]|E|$]]|$1|L|3|M|5|N|7|1I|8|@]|D|@]|E|$O|P]]|$1|Q|3|-4|5|6|7|1J|8|@]|D|@]|E|$]]]|R|$]]

Here is a fault tolerant <code>HTMLParser</code> implementation. 
You can throw pretty much anything at <code>get_title()</code> without it breaking, If anything unexpected happens
<code>get_title()</code> will return <code>None</code>. 
When <code>Parser()</code> downloads the page it encodes it to <code>ASCII</code>
regardless of the charset used in the page ignoring any errors.
It would be trivial to change <code>to_ascii()</code> to convert the data into <code>UTF-8</code> or any other encoding. Just add an encoding argument and rename the function to something like <code>to_encoding()</code>. 
By default <code>HTMLParser()</code> will break on broken html, it will even break on trivial things like mismatched tags. To prevent this behavior I replaced <code>HTMLParser()</code>'s error method with a function that will ignore the errors.

<pre><code>#-*-coding:utf8;-*-
#qpy:3
#qpy:console

''' 
Extract the title from a web page using
the standard lib.
'''

from html.parser import HTMLParser
from urllib.request import urlopen
import urllib

def error_callback(*_, **__):
 pass

def is_string(data):
 return isinstance(data, str)

def is_bytes(data):
 return isinstance(data, bytes)

def to_ascii(data):
 if is_string(data):
 data = data.encode('ascii', errors='ignore')
 elif is_bytes(data):
 data = data.decode('ascii', errors='ignore')
 else:
 data = str(data).encode('ascii', errors='ignore')
 return data


class Parser(HTMLParser):
 def __init__(self, url):
 self.title = None
 self.rec = False
 HTMLParser.__init__(self)
 try:
 self.feed(to_ascii(urlopen(url).read()))
 except urllib.error.HTTPError:
 return
 except urllib.error.URLError:
 return
 except ValueError:
 return

 self.rec = False
 self.error = error_callback

 def handle_starttag(self, tag, attrs):
 if tag == 'title':
 self.rec = True

 def handle_data(self, data):
 if self.rec:
 self.title = data

 def handle_endtag(self, tag):
 if tag == 'title':
 self.rec = False


def get_title(url):
 return Parser(url).title

print(get_title('http://www.google.com'))
</code></pre>

blocks|key|1556020|text|在Python3中，我们可以从urllib.request中调用urlopen方法，从bs4库中调用BeautifulSoup方法来获取页面标题。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1556021|from+urllib.request+import+urlopen
from+bs4+import+BeautifulSoup
html+=+urlopen("https://www.google.com")
soup+=+BeautifulSoup(html,+'lxml')
print(soup.title.string)|code-block|syntax|javascript|1556022|这里我们使用了最有效的解析器'lxml‘。|1556023|entityMap^0|F|E|W|7|17|3|1E|D|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@$9|P|A|Q|B|C]|$9|R|A|S|B|C]|$9|T|A|U|B|C]|$9|V|A|W|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|X|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|Y|8|@]|D|@]|E|$]]|$1|M|3|-4|5|6|7|Z|8|@]|D|@]|E|$]]]|N|$]]

In Python3, we can call method <code>urlopen</code> from <code>urllib.request</code> and <code>BeautifulSoup</code> from <code>bs4</code> library to fetch the page title.
<pre><code>from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(&quot;https://www.google.com&quot;)
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)
</code></pre>
Here we are using the most efficient parser 'lxml'.

blocks|key|1771040|text|使用lxml...|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1771041|从根据Facebook+opengraph协议标记的页面meta中获取：|1771042|import+lxml.html.parse
html_doc+=+lxml.html.parse(some_url)

t+=+html_doc.xpath('//meta[@property="og:title"]/@content')[0]|code-block|syntax|javascript|1771043|或者将.xpath与lxml一起使用：|1771044|t+=+html_doc.xpath(".//title")[0].text|1771045|entityMap^0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|P|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|Q|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|R|8|@]|9|@]|A|$]]|$1|K|3|L|5|F|7|S|8|@]|9|@]|A|$G|H]]|$1|M|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|N|$]]

Using lxml...

Getting it from page meta tagged according to the Facebook opengraph protocol:

<pre><code>import lxml.html.parse
html_doc = lxml.html.parse(some_url)

t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]
</code></pre>

or using .xpath with lxml:

<pre><code>t = html_doc.xpath(".//title")[0].text
</code></pre>

How can I retrieve the page title of a webpage (title html tag) using Python?

How can I retrieve the page title of a webpage using Python?

如何使用Python检索网页的页面标题(title html标签)？

问如何使用Python检索网页的页面标题？
EN

回答 11

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Python检索网页的页面标题？EN

回答 11

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用Python检索网页的页面标题？
EN