blocks|key|1745941|text|这是你开始工作所需要的一切。阅读“清单7.SimplePython网站爬虫”一节。示例甚至是用python编写的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1745942|http://www.ibm.com/developerworks/linux/library/l-spider/|offset|length|1745943|祝好运!|1745944|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|1L|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|P|8|@]|9|@$D|Q|E|R|1|S]]|A|$]]|$1|F|3|G|5|6|7|T|8|@]|9|@]|A|$]]|$1|H|3|-4|5|6|7|U|8|@]|9|@]|A|$]]]|I|$J|$5|K|L|M|A|$N|C]]]]

Here is pretty much everything you need to get started. Read the section "Listing 7. Simple Python Web site crawler". The examples are even written in python.

<a href="http://www.ibm.com/developerworks/linux/library/l-spider/" rel="nofollow noreferrer">http://www.ibm.com/developerworks/linux/library/l-spider/</a>

Good luck!

blocks|key|1745970|text|Python的一个流行的web抓取模块是刮痕。例如，继续查看底部的教程链接。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1745971|entityMap|0|LINK|mutability|MUTABLE|url|http://scrapy.org/^0|K|2|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

A popular web scraping module for Python is <a href="http://scrapy.org/" rel="nofollow noreferrer">Scrapy</a>. Go ahead and take a look at the tutorial link at the bottom for instance.

blocks|key|195880|text|你在找“网络刮擦”。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|195881|你可以在谷歌上搜索到很多不同的技术和实用工具，比如这个。|195882|http://www.webscrape.com/|offset|length|195883|更多信息|195884|http://blogs.computerworld.com/node/324|195885|entityMap|0|LINK|mutability|MUTABLE|url|1^0|0|0|0|P|0|0|0|0|13|1|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|U|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|V|8|@]|9|@$F|W|G|X|1|Y]]|A|$]]|$1|H|3|I|5|6|7|Z|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|10|8|@]|9|@$F|11|G|12|1|13]]|A|$]]|$1|L|3|-4|5|6|7|14|8|@]|9|@]|A|$]]]|M|$N|$5|O|P|Q|A|$R|E]]|S|$5|O|P|Q|A|$R|K]]]]

you're looking for "web scraping". 

you can Google around to find quite a bit of different techniques and utilities such as this one

<a href="http://www.webscrape.com/" rel="nofollow noreferrer">http://www.webscrape.com/</a>

more info 

<a href="http://blogs.computerworld.com/node/324" rel="nofollow noreferrer">http://blogs.computerworld.com/node/324</a>

blocks|key|1401632|text|有必要用Python来做吗？如果不是，HTTrack可能是您的完美解决方案。这可以将整个站点复制到HTML文件的层次结构中。如果您正在寻找Python解决方案，请尝试刮痕。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1401633|entityMap|0|LINK|mutability|MUTABLE|url|http://www.httrack.com/|1|http://scrapy.org/^0|J|7|0|2B|2|1|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]|$A|R|B|S|1|T]]|C|$]]|$1|D|3|-4|5|6|7|U|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]|L|$5|G|H|I|C|$J|M]]]]

Is it necessary to do it in Python? If not, <a href="http://www.httrack.com/" rel="nofollow noreferrer">HTTrack</a> could be the perfect solution for you. This can copy entire sites into a hierarchy of HTML files. If you are looking for a Python solution, try <a href="http://scrapy.org/" rel="nofollow noreferrer">Scrapy</a>.

blocks|key|1745997|text|您可以使用惠特和--spider选项。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1745998|entityMap|0|LINK|mutability|MUTABLE|url|http://www.gnu.org/software/wget/^0|8|8|5|2|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@$9|O|A|P|B|C]]|D|@$9|Q|A|R|1|S]]|E|$]]|$1|F|3|-4|5|6|7|T|8|@]|D|@]|E|$]]]|G|$H|$5|I|J|K|E|$L|M]]]]

You could use <a href="http://www.gnu.org/software/wget/" rel="nofollow noreferrer">wget</a> with the <code>--spider</code> option.

blocks|key|196013|text|上一次我不得不这样做的时候，我开始做这样的事情：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|196014|from+BeautifulSoup+import+BeautifulSoup
import+urllib
html+=+urllib.urlopen("http://www.someurl.com")
html+=+html.read()
soup+=+BeautifulSoup(html)|code-block|syntax|javascript|196015|以下是美丽汤(http://www.crummy.com/software/BeautifulSoup/documentation.html)的文档，虽然它可能会对您的目的造成过度消耗，但在我看来，了解它是很方便的。|offset|length|196016|entityMap|0|LINK|mutability|MUTABLE|url|http://www.crummy.com/software/BeautifulSoup/documentation.html^0|0|0|7|1R|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@]|9|@$I|V|J|W|1|X]]|A|$]]|$1|K|3|-4|5|6|7|Y|8|@]|9|@]|A|$]]]|L|$M|$5|N|O|P|A|$Q|R]]]]

Last time I had to do something like this, I started something like this:

<pre><code>from BeautifulSoup import BeautifulSoup
import urllib
html = urllib.urlopen("http://www.someurl.com")
html = html.read()
soup = BeautifulSoup(html)
</code></pre>

Here is the documentation for Beautiful Soup (<a href="http://www.crummy.com/software/BeautifulSoup/documentation.html" rel="nofollow noreferrer">http://www.crummy.com/software/BeautifulSoup/documentation.html</a>) and while it may be overkill for your purposes, it is handy to know in my opinion.

First time here -- thought I'd field a question on behalf of a coworker.

Somebody in my lab is doing a content analysis (e.g. reading an article or transcript line by line and identifying relevant themes) of the web presences of various privatized neuroimaging centers (e.g. <a href="http://www.canmagnetic.com/" rel="nofollow noreferrer">http://www.canmagnetic.com/</a>). She's been c/ping entire site maps by hand, and I know I could slap something together with Python to follow links and dump full text (with line numbers) for her, but I've never actually done anything quite like this. Any ideas for how I'd get started?

Cheers,
-alex

downloading full page text from a web domain

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

第一次来这里我实验室的人正在对各种私有化的神经成像中心(如)的网络存在进行内容分析(例如逐行阅读一篇文章或成绩单并识别相关主题)。她用手绘制了整个站点地图，我知道我可以用Python来跟踪链接并为她倾倒全文(行号)，但我从来没有做过这样的事情。我该怎么开始呢？干杯，-alex

问从web域下载全文
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从web域下载全文EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从web域下载全文
EN