首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >不同于使用请求get和漂亮汤的srcaping rss的输出

不同于使用请求get和漂亮汤的srcaping rss的输出
EN

Stack Overflow用户
提问于 2021-10-14 15:43:22
回答 1查看 36关注 0票数 0

我想从这个链接的代码中抓取数据:https://news.ycombinator.com/rss。它包含的html语法是:"link>the URL‘(里面充满了打开和关闭的链接,但不能放在这里),但是在使用此代码时,链接的打印输出是:' link />the URL’,并且在json文件中没有关键字'link‘的内容。

代码语言:javascript
运行
复制
import requests
import bs4
from bs4 import BeautifulSoup
import json
import html5lib 

def rss(x):
r = requests.get(x)
s = BeautifulSoup(r.content, features='html5lib')
the_list = []
for i in s.find_all('item'):
    title = i.find('title').text
    link = i.find('link').text
    date = i.find('pubdate').text

    article = {
        'title' : title,
        'link' : link,
        'date' : date
    }

    the_list.append(article)
with open('the_list.json','w') as f:
    json.dump(the_list,f)

rss('https://news.ycombinator.com/rss')
EN

Stack Overflow用户

回答已采纳

发布于 2021-10-14 17:10:45

会发生什么?

正如你已经提到的,-在soup中似乎不是很好的格式,因为它缺少<link>,它只有</link>,所以你不能用text属性从它里面得到文本。

但好消息是,有一个解决方案。

如何修复?

只需选择带有<link>元素的next_sibling的文本:

代码语言:javascript
运行
复制
i.find('link').next_sibling

输出

代码语言:javascript
运行
复制
/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释 
* [{"title": "Gitlab from YC to IPO", "link": "https://blog.ycombinator.com/gitlab-from-yc-to-ipo/", "date": "Thu, 14 Oct 2021 13:31:43 +0000"}, {"title": "Apple Joins Blender Development Fund", "link": "https://www.blender.org/press/apple-joins-blender-development-fund/", "date": "Thu, 14 Oct 2021 14:48:59 +0000"}, {"title": "Sunset Geometry (2016)", "link": "https://www.shapeoperator.com/2016/12/12/sunset-geometry/", "date": "Thu, 14 Oct 2021 14:29:08 +0000"}, {"title": "iPhone Macro: A Big Day for Small Things", "link": "https://lux.camera/iphone-macro-camera-a-big-day-for-small-things/", "date": "Mon, 11 Oct 2021 10:22:06 +0000"}, {"title": "Michelin Airless", "link": "https://www.michelin.com/en/innovation/vision-concept/airless/", "date": "Thu, 14 Oct 2021 14:36:58 +0000"}, {"title": "Release (YC W20) Is Hiring \u2013 Product Marketing Manager", "link": "https://releasehub.com/company#hire", "date": "Thu, 14 Oct 2021 17:00:15 +0000"}, {"title": "Global Climate Report \u2013 September 2021", "link": "https://www.ncdc.noaa.gov/sotc/global/202109", "date": "Thu, 14 Oct 2021 14:49:59 +0000"}, {"title": "Esbuild \u2013 An extremely fast JavaScript bundler", "link": "https://esbuild.github.io/", "date": "Thu, 14 Oct 2021 05:07:27 +0000"}, {"title": "Small Language Models Are Also Few-Shot Learners", "link": "https://aclanthology.org/2021.naacl-main.185/", "date": "Tue, 12 Oct 2021 09:59:34 +0000"}, {"title": "Who was Aleph Null? (2013)", "link": "http://bit-player.org/2013/who-was-aleph-null", "date": "Mon, 11 Oct 2021 08:35:29 +0000"}, {"title": "Hands-On Rust: Effective Learning Through 2D Game Development and Play", "link": "https://pragprog.com/titles/hwrust/hands-on-rust/", "date": "Thu, 14 Oct 2021 07:59:24 +0000"}, {"title": "Ask HN: What's the Point of Life?", "link": "https://news.ycombinator.com/item?id=28866558", "date": "Thu, 14 Oct 2021 16:38:15 +0000"}, {"title": "What I wish I knew when learning F#", "link": "https://danielbachler.de/2020/12/23/what-i-wish-i-knew-when-learning-fsharp.html", "date": "Thu, 14 Oct 2021 12:07:40 +0000"}, {"title": "Investing in Startups by Passing the Series 65", "link": "https://www.natecation.com/accredited-investor-investing-startups-series-65/", "date": "Wed, 13 Oct 2021 17:57:25 +0000"}, {"title": "OpenBSD 7.0", "link": "https://www.openbsd.org/70.html", "date": "Thu, 14 Oct 2021 10:24:21 +0000"}, {"title": "Countries are gathering in an effort to stop a biodiversity collapse", "link": "https://www.nytimes.com/2021/10/14/climate/un-biodiversity-conference-climate-change.html", "date": "Thu, 14 Oct 2021 13:32:00 +0000"}, {"title": "Alden Global Capital, the secretive hedge fund gutting newsrooms", "link": "https://www.theatlantic.com/magazine/archive/2021/11/alden-global-capital-killing-americas-newspapers/620171/", "date": "Thu, 14 Oct 2021 15:17:06 +0000"}, {"title": "Child suicides in Japan hit record high", "link": "https://www3.nhk.or.jp/nhkworld/en/news/20211013_19/", "date": "Thu, 14 Oct 2021 08:52:39 +0000"}, {"title": "Every search bar looks like a URL bar to users", "link": "https://shkspr.mobi/blog/2021/10/every-search-bar-looks-like-a-url-bar-to-users/", "date": "Thu, 14 Oct 2021 13:27:58 +0000"}, {"title": "Psychonetics: A nerd's toolset to work with mind and perception", "link": "http://deconcentration-of-attention.com/psychonetics.html", "date": "Tue, 12 Oct 2021 11:28:43 +0000"}, {"title": "FB seals off some internal message boards to prevent leaking, immediately leaked", "link": "https://www.businessinsider.com/facebook-whistleblower-leaks-restricts-staff-access-message-boards-elections-safety-2021-10", "date": "Thu, 14 Oct 2021 11:09:08 +0000"}, {"title": "Working around expired root certificates", "link": "https://scotthelme.co.uk/should-clients-care-about-the-expiration-of-a-root-certificate/", "date": "Mon, 11 Oct 2021 21:27:27 +0000"}, {"title": "An unprecedented wave of online bank fraud is hitting Britain", "link": "https://www.reuters.com/world/uk/welcome-britain-bank-scam-capital-world-2021-10-14/", "date": "Thu, 14 Oct 2021 09:57:39 +0000"}, {"title": "Interoperable Serendipity", "link": "https://noeldemartin.com/blog/interoperable-serendipity", "date": "Wed, 13 Oct 2021 12:02:37 +0000"}, {"title": "Instagram took down post with figure from paper showing male advantage in sports", "link": "https://twitter.com/SwipeWright/status/1448064426670583814", "date": "Thu, 14 Oct 2021 16:36:30 +0000"}, {"title": "IoT hacking and rickrolling my high school district", "link": "https://whitehoodhacker.net/posts/2021-10-04-the-big-rick", "date": "Tue, 12 Oct 2021 19:38:06 +0000"}, {"title": "Boeing says certain 787 parts improperly manufactured", "link": "https://www.reuters.com/business/aerospace-defense/boeing-deals-with-new-defect-787-dreamliner-wsj-2021-10-14/", "date": "Thu, 14 Oct 2021 13:26:46 +0000"}, {"title": "Practice Problems for Hardware Engineers", "link": "https://arxiv.org/abs/2110.06526", "date": "Thu, 14 Oct 2021 03:48:24 +0000"}, {"title": "Interface ergonomics: automation isn't just about time saved", "link": "https://macoy.me/blog/programming/InterfaceFriction", "date": "Wed, 13 Oct 2021 01:05:52 +0000"}, {"title": "Syncthing \u2013 a continuous file synchronization program", "link": "https://syncthing.net/", "date": "Thu, 14 Oct 2021 01:23:19 +0000"}]
*/
票数 0
EN
查看全部 1 条回答
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69573625

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档