import lxml.etreeimport re
return lxml.etree.parse(url, lxml.etree.HTMLParser>>> import lxml.etree
>>> tree = lxml.etree.parse('http://finance.yahoo.com/q?>", line 1,
我希望写一个脚本,将通过一个目录,并检查如果html文件是错误的格式。请看我的代码for root, dirs, files in os.walk(directory): if str(file).endswith('.html'): if file is badly formed:
print "Badly Formed"
('//*[%s]' % refs)File "lxml.etree.pyx", line 1201, in lxml.etree._Element.iterchildren (src/lxml/lxml.etree.c:36294)
File "lxml.etree.pyx", line 2163, in lxml.etree.ElementChildI
到目前为止,这是我的代码:root = etree.fromstring(string.encode('/lxml.etree.c:68106)
File "parser.pxi", line 1785, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102
/lxml.etree.c:181079: error: ‘XML_PARSE_NOCDATA’ undeclared (first use in this function)src/lxml/lxml.etree.c:182556: error: ‘__pyx_v_4lxml_5etree_XSLT_DOC_DEFAULT_LOADERfirst use in this functio