我抓取了特定类的所有li标记,并得到了输出:
<li>Aug 14-18, <a href="https://ai4good.org/fragile-earth-2021/">Fragile Earth 2021</a>, develop radically new technological foundations for advancing and meeting the Sustainable Development Goals. Online KDD-21 workshop.
</li>
<li>Aug 19-26, <a href="https://ijcai-21.org/">IJCAI-21: 30th Int. Joint Conference on Artificial Intelligence</a>. Montreal-themed Virtual Reality, Online.
</li>我可以分别提取href和text,但是我也希望将日期存储在一列数据帧中,或者至少分别获取日期。你知道我该怎么做吗?
以下是该网站的链接:https://www.kdnuggets.com/meetings/index.html#Y21-10
发布于 2021-08-16 20:57:14
我认为这应该可以做到这一点:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<li>Aug 14-18, <a href="https://ai4good.org/fragile-earth-2021/">Fragile Earth 2021</a>, develop radically new technological foundations for advancing and meeting the Sustainable Development Goals. Online KDD-21 workshop.
</li>
<li>Aug 19-26, <a href="https://ijcai-21.org/">IJCAI-21: 30th Int. Joint Conference on Artificial Intelligence</a>. Montreal-themed Virtual Reality, Online.
</li>""", "lxml")
dates = [x.text.split(',')[0] for x in soup.find_all('li')]
print(dates)输出:
['Aug 14-18', 'Aug 19-26']https://stackoverflow.com/questions/68809053
复制相似问题