blocks|key|120142|text|服从robots.txt+(不要像已经说过的那样过于咄咄逼人)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|120143|您可能需要考虑一下您的用户代理字符串--它们是一个很好的地方，可以预先了解您正在做什么，以及如何与您联系。|120144|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

Obey robots.txt (and not too aggressive like has been said already).

You might want to think about your user-agent string - they're a good place to be up-front about what you're doing and how you can be contacted.

blocks|key|1119484|text|除了WillDean和Einar的好答案之外，我还建议您花一些时间阅读HTTP响应代码的含义，以及您的爬虫在遇到每个响应代码时应该做什么，因为它将对您的性能产生很大的影响，并且在某些站点上禁止您使用。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1119485|一些有用的链接：|1119486|HTTP/1.1:状态代码定义|offset|length|1119487|聚合器客户端HTTP测试|1119488|维基百科|1119489|entityMap|0|LINK|mutability|MUTABLE|url|http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html|1|http://diveintomark.org/tests/client/http/|2|http://en.wikipedia.org/wiki/List_of_HTTP_status_codes^0|0|0|0|F|0|0|0|C|1|0|0|4|2|0^^$0|@$1|2|3|4|5|6|7|X|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Y|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|Z|8|@]|9|@$F|10|G|11|1|12]]|A|$]]|$1|H|3|I|5|6|7|13|8|@]|9|@$F|14|G|15|1|16]]|A|$]]|$1|J|3|K|5|6|7|17|8|@]|9|@$F|18|G|19|1|1A]]|A|$]]|$1|L|3|-4|5|6|7|1B|8|@]|9|@]|A|$]]]|M|$N|$5|O|P|Q|A|$R|S]]|T|$5|O|P|Q|A|$R|U]]|V|$5|O|P|Q|A|$R|W]]]]

Besides WillDean's and Einar's good answers, I would really recommend you take a time to read about the meaning of the HTTP response codes, and what your crawler should do when encountering each one, since it will make a big a difference on your performance, and on wether or not you are banned from some sites. 

Some useful links:

<a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html" rel="nofollow noreferrer">HTTP/1.1: Status Code Definitions</a>

<a href="http://diveintomark.org/tests/client/http/" rel="nofollow noreferrer">Aggregator client HTTP tests</a>

<a href="http://en.wikipedia.org/wiki/List_of_HTTP_status_codes" rel="nofollow noreferrer">Wikipedia</a>

blocks|key|120373|text|请确保在您的用户代理字符串中包含一个URL，该URL解释机器人爬行的原因//内容。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|120374|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Please be sure to include a URL in your user-agent string that explains who/what/why your robot is crawling.

blocks|key|1119542|text|还不要忘记遵守bot元标记：http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.2|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1119543|另一件要考虑的事情--当蜘蛛页出现时，不要太仓促地决定不存在或有错误。由于维护工作或在短时间内更正的错误，一些页面处于脱机状态。|1119544|entityMap|0|LINK|mutability|MUTABLE|url|http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.2^0|E|1K|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]]|C|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]]]

Also do not forget to obey the bot meta tags: <a href="http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.2" rel="nofollow noreferrer">http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.2</a>

Another thing to think about - when spider pages, don't be too hasty deciding things don't exist or have errors. Some pages are offline due to maintenance work or errors that are corrected within a short period.

blocks|key|1119599|text|所有好的观点，都是在这里提出的。您还必须处理动态生成的Java和JavaScript链接、参数和会话ID、转义单引号和双引号、在相对链接上的失败尝试(使用.././通过根目录)、区分大小写、帧、重定向、cookie.|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1119600|我可以继续生活几天，然后就有了。我有一个机器人检查表，它涵盖了大部分内容，我很乐意回答我所能回答的问题。|offset|length|1119601|您还应该考虑使用开放源代码的机器人爬虫代码，因为它在所有这些问题上给了您很大的帮助。我在这方面也有一页：开源机器人代码。希望这能帮上忙！|1119602|entityMap|0|LINK|mutability|MUTABLE|url|http://searchtools.com/robots/robot-checklist.html|1|http://searchtools.com/robots/robot-code.html^0|0|K|6|0|0|1G|7|1|0^^$0|@$1|2|3|4|5|6|7|R|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|S|8|@]|9|@$D|T|E|U|1|V]]|A|$]]|$1|F|3|G|5|6|7|W|8|@]|9|@$D|X|E|Y|1|Z]]|A|$]]|$1|H|3|-4|5|6|7|10|8|@]|9|@]|A|$]]]|I|$J|$5|K|L|M|A|$N|O]]|P|$5|K|L|M|A|$N|Q]]]]

All good points, the ones made here. You will also have to deal with dynamically-generated Java and JavaScript links, parameters and session IDs, escaping single and double quotes, failed attempts at relative links (using ../../ to go past the root directory), case sensitivity, frames, redirects, cookies....

I could go on for days, and kinda have. I have a <a href="http://searchtools.com/robots/robot-checklist.html" rel="nofollow noreferrer" title="Robots Checklist">Robots Checklist</a> that covers most of this, and I'm happy answer what I can.

You should also think about using open-source robot crawler code, because it gives you a huge leg up on all these issues. I have a page on that as well: <a href="http://searchtools.com/robots/robot-code.html" rel="nofollow noreferrer" title="Source Code for Web Robot Spiders">open source robot code</a>. Hope that helps!

blocks|key|822821|text|我想说的是，这是非常重要的考虑你造成了多少负荷。例如，如果您的爬虫同时请求单个站点的每个对象，则可能会导致该特定站点的加载问题。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|822822|换句话说，确保你的爬虫不太咄咄逼人。|822823|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

I'd say that it is very important to consider how much load you are causing. For instance, if your crawler requests every object of a single site, more or less at once, it might cause load problems for that particular site.

In other words, make sure your crawler is not too aggressive.

blocks|key|826996|text|这是完全可以适应的-只是确保它只访问每一页一次的每一次会议。在技术上创建搜索机器人时，必须遵守robots.txt和no-cache规则。如果需要的话，人们仍然可以阻止你的机器人。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|826997|据我所知，您只是在寻找源代码，因此您需要构建一些东西来遵循样式表的<link>s和JavaScripts的<script+src="..."></script>。|826998|entityMap^0|1M|8|0|X|6|1H|R|0^^$0|@$1|2|3|4|5|6|7|J|8|@$9|K|A|L|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|M|8|@$9|N|A|O|B|C]|$9|P|A|Q|B|C]]|D|@]|E|$]]|$1|H|3|-4|5|6|7|R|8|@]|D|@]|E|$]]]|I|$]]

It's perfectly accetable to do - just make sure it only visits each page once for each session. As you're technically creating a searchbot you must obey robots.txt and <code>no-cache</code> rules. People can still block your bot specifically if needed by blocking IPs.

You're only looking for source code as far as I can tell so you'll want to build something to follow <code>&lt;link&gt;</code>s for stylesheets and <code>&lt;script src="..."&gt;&lt;/script&gt;</code> for JavaScripts.

blocks|key|1119435|text|负荷是一个很重要的考虑因素。限制你爬行某一特定网站的频率，以及完成你的目标所需要的最基本的信息。如果你正在寻找文本，不要下载所有的图片，类似的东西。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1119436|当然，遵守robots.txt，但也要确保您的用户代理字符串包括准确的联系信息，也许是一个链接到一个网页，描述您正在做什么和如何做。如果一个网络管理员看到很多来自你的请求，并且好奇的话，你可以用一个信息丰富的网页来回答很多问题。|1119437|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

Load is a big consideration. Put limits on how often you crawl a particular site and what is the most basic info you need to accomplish your goal. If you are looking for text do not download all images, stuff like that.

Of course obey robots.txt but also make sure your user agent string includes accurate contact info and maybe a link to a web page describing what you are doing and how you do it. If a web admin is seeing a lot of requests from you and is curious you might be able to answer a lot of questions with an informative web page.

blocks|key|120458|text|您将需要添加一些功能到黑名单网站/域或其他事情(IP范围，ASN等)，以避免您的蜘蛛陷入垃圾邮件网站。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|120459|您需要有一个HTTP实现，对超时和行为有很大的控制。期望很多网站发送无效的响应、巨大的响应、垃圾邮件，或者只是在没有响应的情况下无限期地打开连接等等。|120460|另外，不要相信200的状态意味着“页面存在”。在我的经验中，相当大比例的站点返回200个“找不到”或其他错误(还有一个大型HTML文档)。|120461|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|I|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|J|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|G|$]]

You will need to add some capability to blacklist sites / domains or other things (IP ranges, ASN, etc) to avoid your spider getting bogged down with spam sites.

You'll need to have a HTTP implementation with a lot of control over timeout and behaviour. Expect a lot of sites to send back invalid responses, huge responses, rubbish headers, or just leave the connection open indefinitely with no response etc.

Also don't trust a 200 status to mean "the page exists". Quite a large proportion of sites send back 200 for "Not found" or other errors, in my experience (Along with a large HTML document).

I just started thinking about creating/customizing a web crawler today, and know very little about web crawler/robot etiquette. A majority of the writings on etiquette I've found seem old and awkward, so I'd like to get some current (and practical) insights from the web developer community.

I want to use a crawler to walk over "the web" for a super simple purpose - "does the markup of site XYZ meet condition ABC?".

This raises a lot of questions for me, but I think the two main questions I need to get out of the way first are:

<ul>
<li>It feels a little "iffy" from the get go -- is this sort of thing acceptable?</li>
<li>What specific considerations should the crawler take to not upset people?</li>
</ul>

What are the key considerations when creating a web crawler?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我今天刚开始考虑创建/定制一个网络爬虫，对网络爬虫/机器人礼仪知之甚少。我发现大部分关于礼仪的文章都显得过时和笨拙，所以我想从web开发人员社区获得一些最新的(和实用的)见解。我想使用爬虫通过“网络”的一个超级简单的目的-“标记的网站XYZ满足条件ABC?”这给我带来了很多问题，但我认为我首先需要回答的两个主要问题是：从一开始就感觉有点“不确定”--这类事情可以接受吗？爬虫应该采取什么具体的考虑来

问在创建网络爬虫时，需要考虑哪些关键问题？
EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在创建网络爬虫时，需要考虑哪些关键问题？EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在创建网络爬虫时，需要考虑哪些关键问题？
EN