我有以下HTML结构:
<ul>
<li>
<div>
<h3>TheFirst</h3>
</div>
<div class='LastDiv'>TheLast</div>
</li>
<li>
<div>
<h3>TheSecond</h3>
</div>
<div class='LastDiv'>TheLast</div>
</li>
<li>
<div>
<h3>TheNew</h3>
</div>
<div class='LastDiv'>TheLastNew</div>
</li>
</ul>
我在这里尝试做的是从这个结构中提取以下数据:
{
'TheLast': ['TheFirst', 'TheSecond'],
'TheLastNew': ['TheNew']
}
我做的事情如下:
data = {}
list = response.css('ul li').extract()
for li in list:
data[li.css('div.LastDiv::text')].append(li.css('div > h3::text'))
print(data)
但是我一直收到这个错误:
AttributeError:“str”对象没有属性“”css“”
有没有更快的方法从这样的集合中提取这些数据?
列表值=
['<li>\r\n <div>\r\n <h3>TheFirst</h3>\r\n </div>\r\n <div class="LastDiv">TheLast</div>\r\n </li>', '<li>\r\n <div>\r\n <h3>TheSecond</h3>\r\n </div>\r\n <div class="LastDiv">TheLast</div>\r\n </li>', '<li>\r\n <div>\r\n <h3>TheNew</h3>\r\n </div>\r\n <div class="LastDiv">TheLastNew</div>\r\n </li>']
print()
之前的总体结果为
>>> data = {}
>>> list = response.css('ul li').extract()
>>> for li in list:
... data[li.css('div.LastDiv::text')].append(li.css('div > h3::text'))
...
Traceback (most recent call last):
File "<console>", line 2, in <module>
AttributeError: 'str' object has no attribute 'css'
发布于 2018-06-09 23:12:50
这是因为你从'ul li‘中提取了html。然后您尝试运行'html'.css()。在准备循环的"list“变量时,必须删除".extract()”。如下所示:
from scrapy.selector import Selector
with open('input.html') as fd:
content = fd.read()
response = Selector(text=content)
data = {}
list = response.css('ul li')
for li in list:
key = li.css('div.LastDiv::text').extract_first()
if key not in data:
data[key] = []
data[key].append(li.css('div > h3::text').extract_first())
print(data)
发布于 2018-06-10 03:19:56
Oleg的答案是不完整的。data
是一个字典,要求键实现__hash__
,而SelectorList没有。这就是为什么你会得到这个错误的原因。
正确的解决方案是:
#!/usr/bin/env python3
import collections
from scrapy.selector import Selector
with open('input.html') as fd:
content = fd.read()
response = Selector(text=content)
data = collections.defaultdict(list)
lst = response.css('ul li') #.extract()
for li in lst:
key = li.css('div.LastDiv::text')[0].extract()
data[key].append(li.css('div > h3::text')[0].extract())
print(dict(data))
其中input.html
是包含问题中的HTML代码片段的文件。这将打印您要查找的内容:
{'TheLast': ['TheFirst', 'TheSecond'], 'TheLastNew': ['TheNew']}
https://stackoverflow.com/questions/50774421
复制相似问题