问使用lxml HTML解析UTF-8/unicode字符串
EN

Stack Overflow用户

提问于 2012-08-14 01:03:16

回答 1查看 14K关注 0票数 23

我一直在尝试用etree.HTML()解析一个用UTF-8编码的文本，但没有成功。

→ python
Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> import requests
>>> headers = {'User-Agent': "Opera/9.80 (Macintosh; Intel Mac OS X 10.8.0) Presto/2.12.363 Version/12.50"}
>>> r = requests.get("http://www.rakuten.co.jp/", headers=headers)
>>> r.status_code
200
>>> r.headers
{'x-cache': 'MISS from www.rakuten.co.jp', 'transfer-encoding': 'chunked', 'set-cookie': 'wPzd=lng%3DNA%3Acnt%3DCA; expires=Tue, 13-Aug-2013 16:51:38 GMT; path=/; domain=www.rakuten.co.jp', 'server': 'Apache', 'pragma': 'no-cache', 'cache-control': 'private', 'date': 'Mon, 13 Aug 2012 16:51:38 GMT', 'content-type': 'text/html; charset=EUC-JP'}
>>> responsetext = r.text

到目前一切尚好。响应文本是正确的，并且是unicode字符串。现在，如果我试图获取CSS的列表。也没问题。

>>> tree = etree.HTML(responsetext)
>>> csspathlist = tree.xpath('//link[@rel="stylesheet"]/@href')
>>> csspathlist
['http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/common.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/layout.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/sidecolumn.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/api.css?v=1207111500', '/com/inc/home/20080930/beta/css/liquid/myrakuten_dpgs.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/leftcolumn.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/header.css?v=1207111500', '/com/inc/home/20080930/opt/css/normal/footer.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/ipad.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/genre.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/opt/css/normal/supersale.css?v=1207111500', '/com/inc/home/20080930/beta/css/liquid/rakuten_membership.css', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/noscript/set.css?v=1207111500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/suggest-2.0.1.css?v=1204231500', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/liquid_banner.css?v=1203011138', 'http://a.ichiba.jp.rakuten-static.com/com/inc/home/20080930/beta/css/liquid/area_announce.css?v=1203011138']

现在，让我们从unicode更改为UTF-8，并再次请求CSS URI列表。

>>> htmltext = responsetext.encode('utf-8')
>>> tree2 = etree.HTML(htmltext)
>>> csspathlist2 = tree2.xpath('//link[@rel="stylesheet"]/@href')
>>> csspathlist2
[]

我得到了一个空的列表。

>>> etree.tostring(tree2)
'<html lang="ja" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta http-equiv="Content-Type" content="text/html; charset=EUC-JP"/><meta http-equiv="Content-Style-Type" content="text/css"/><meta http-equiv="Content-Script-Type" content="text/javascript"/><title/></head></html>'

实际上，第二次解析在标题中的第一个日语字符之后立即停止。

<meta http-equiv="Content-Script-Type" content="text/javascript"/>
<title> 【楽天市場】Shopping is Entertainment! ： インターネット最大級の通信販売、通販オンラインショッピングコミュニティ </title>

我仍然在努力理解我做错了什么。

python

parsing

unicode

utf-8

lxml

回答 1

Stack Overflow用户

回答已采纳

发布于 2012-08-14 01:23:30

好了，刚刚找到了。将问题写在StackOverflow上通常会有所帮助。

etree.HTML()正在尝试根据文档中的元数据来猜测编码

<meta http-equiv="Content-Type" content="text/html; charset=EUC-JP"/>

在本例中，我手动将文档转换为utf-8，这意味着它不再是日语编码：EUC-JP。因此，要解决这个问题，只需迫使HTML解析器理解utf-8。在我们的例子中，代码变成：

>>> myparser = etree.HTMLParser(encoding="utf-8")
>>> tree = etree.HTML(htmltext, parser=myparser)

票数 27

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/11938924

复制

相似问题

问使用lxml HTML解析UTF-8/unicode字符串
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用lxml HTML解析UTF-8/unicode字符串EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用lxml HTML解析UTF-8/unicode字符串
EN