前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >robots协议

robots协议

作者头像
py3study
发布2020-01-16 14:50:18
6580
发布2020-01-16 14:50:18
举报
文章被收录于专栏:python3python3

<div id="cnblogs_post_body" class="blogpost-body"><h3><strong>什么是robots.txt?</strong></h3> <p>robots.txt是一个纯文本文件,是爬虫抓取网站的时候要查看的第一个文件,一般位于网站的根目录下。robots.txt文件定义了爬虫在爬取该网站时存在的限制,哪些部分爬虫可以爬取,哪些不可以爬取(防君子不防小人)</p> <p>更多robots.txt协议信息参考:www.robotstxt.org</p> <p>在爬取网站之前,检查robots.txt文件可以最小化爬虫被封禁的可能</p> <p>下面是百度robots.txt协议的一部分:https://www.baidu.com/robots.txt</p> <div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div> <pre><span style="color: #008080;"> 1</span> <span style="color: #000000;">User-agent: Baiduspider </span><span style="color: #008080;"> 2</span> <span style="color: #000000;">Disallow: /baidu </span><span style="color: #008080;"> 3</span> <span style="color: #000000;">Disallow: /s? </span><span style="color: #008080;"> 4</span> <span style="color: #000000;">Disallow: /ulink? </span><span style="color: #008080;"> 5</span> <span style="color: #000000;">Disallow: /link? </span><span style="color: #008080;"> 6</span> <span style="color: #000000;">Disallow: /home/news/data/ </span><span style="color: #008080;"> 7</span> <span style="color: #008080;"> 8</span> <span style="color: #000000;">User-agent: Googlebot </span><span style="color: #008080;"> 9</span> <span style="color: #000000;">Disallow: /baidu </span><span style="color: #008080;">10</span> <span style="color: #000000;">Disallow: /s? </span><span style="color: #008080;">11</span> <span style="color: #000000;">Disallow: /shifen/ </span><span style="color: #008080;">12</span> <span style="color: #000000;">Disallow: /homepage/ </span><span style="color: #008080;">13</span> <span style="color: #000000;">Disallow: /cpro </span><span style="color: #008080;">14</span> <span style="color: #000000;">Disallow: /ulink? </span><span style="color: #008080;">15</span> <span style="color: #000000;">Disallow: /link? </span><span style="color: #008080;">16</span> <span style="color: #000000;">Disallow: /home/news/data/ </span><span style="color: #008080;">17</span> <span style="color: #008080;">18</span> <span style="color: #000000;">User-agent: MSNBot </span><span style="color: #008080;">19</span> <span style="color: #000000;">Disallow: /baidu </span><span style="color: #008080;">20</span> <span style="color: #000000;">Disallow: /s? </span><span style="color: #008080;">21</span> <span style="color: #000000;">Disallow: /shifen/ </span><span style="color: #008080;">22</span> <span style="color: #000000;">Disallow: /homepage/ </span><span style="color: #008080;">23</span> <span style="color: #000000;">Disallow: /cpro </span><span style="color: #008080;">24</span> <span style="color: #000000;">Disallow: /ulink? </span><span style="color: #008080;">25</span> <span style="color: #000000;">Disallow: /link? </span><span style="color: #008080;">26</span> <span style="color: #000000;">Disallow: /home/news/data/ </span><span style="color: #008080;">27</span> <span style="color: #008080;">28</span> <span style="color: #000000;">User-agent: Baiduspider-image </span><span style="color: #008080;">29</span> <span style="color: #000000;">Disallow: /baidu </span><span style="color: #008080;">30</span> <span style="color: #000000;">Disallow: /s? </span><span style="color: #008080;">31</span> <span style="color: #000000;">Disallow: /shifen/ </span><span style="color: #008080;">32</span> <span style="color: #000000;">Disallow: /homepage/ </span><span style="color: #008080;">33</span> <span style="color: #000000;">Disallow: /cpro </span><span style="color: #008080;">34</span> <span style="color: #000000;">Disallow: /ulink? </span><span style="color: #008080;">35</span> <span style="color: #000000;">Disallow: /link? </span><span style="color: #008080;">36</span> <span style="color: #000000;">Disallow: /home/news/data/ </span><span style="color: #008080;">37</span> <span style="color: #008080;">38</span> <span style="color: #000000;">User-agent: YoudaoBot </span><span style="color: #008080;">39</span> <span style="color: #000000;">Disallow: /baidu </span><span style="color: #008080;">40</span> <span style="color: #000000;">Disallow: /s? </span><span style="color: #008080;">41</span> <span style="color: #000000;">Disallow: /shifen/ </span><span style="color: #008080;">42</span> <span style="color: #000000;">Disallow: /homepage/ </span><span style="color: #008080;">43</span> <span style="color: #000000;">Disallow: /cpro </span><span style="color: #008080;">44</span> <span style="color: #000000;">Disallow: /ulink? </span><span style="color: #008080;">45</span> <span style="color: #000000;">Disallow: /link? </span><span style="color: #008080;">46</span> <span style="color: #000000;">Disallow: /home/news/data/ </span><span style="color: #008080;">47</span> <span style="color: #008080;">48</span> <span style="color: #000000;">User-agent: Sogou spider2 </span><span style="color: #008080;">49</span> <span style="color: #000000;">Disallow: /baidu </span><span style="color: #008080;">50</span> <span style="color: #000000;">Disallow: /s? </span><span style="color: #008080;">51</span> <span style="color: #000000;">Disallow: /shifen/ </span><span style="color: #008080;">52</span> <span style="color: #000000;">Disallow: /homepage/ </span><span style="color: #008080;">53</span> <span style="color: #000000;">Disallow: /cpro </span><span style="color: #008080;">54</span> <span style="color: #000000;">Disallow: /ulink? </span><span style="color: #008080;">55</span> <span style="color: #000000;">Disallow: /link? </span><span style="color: #008080;">56</span> <span style="color: #000000;">Disallow: /home/news/data/ </span><span style="color: #008080;">57</span> <span style="color: #008080;">58</span> <span style="color: #000000;">User-agent: Sogou blog </span><span style="color: #008080;">59</span> <span style="color: #000000;">Disallow: /baidu </span><span style="color: #008080;">60</span> <span style="color: #000000;">Disallow: /s? </span><span style="color: #008080;">61</span> <span style="color: #000000;">Disallow: /shifen/ </span><span style="color: #008080;">62</span> <span style="color: #000000;">Disallow: /homepage/ </span><span style="color: #008080;">63</span> <span style="color: #000000;">Disallow: /cpro </span><span style="color: #008080;">64</span> <span style="color: #000000;">Disallow: /ulink? </span><span style="color: #008080;">65</span> <span style="color: #000000;">Disallow: /link? </span><span style="color: #008080;">66</span> <span style="color: #000000;">Disallow: /home/news/data/ </span><span style="color: #008080;">67</span> <span style="color: #008080;">68</span> <span style="color: #000000;">User-agent: Sogou News Spider </span><span style="color: #008080;">69</span> <span style="color: #000000;">Disallow: /baidu </span><span style="color: #008080;">70</span> <span style="color: #000000;">Disallow: /s? </span><span style="color: #008080;">71</span> <span style="color: #000000;">Disallow: /shifen/ </span><span style="color: #008080;">72</span> <span style="color: #000000;">Disallow: /homepage/ </span><span style="color: #008080;">73</span> <span style="color: #000000;">Disallow: /cpro </span><span style="color: #008080;">74</span> <span style="color: #000000;">Disallow: /ulink? </span><span style="color: #008080;">75</span> <span style="color: #000000;">Disallow: /link? </span><span style="color: #008080;">76</span> <span style="color: #000000;">Disallow: /home/news/data/ </span><span style="color: #008080;">77</span> 78 <span style="color: #000000;">User-agent: * </span>79 Disallow: /</pre> <div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div> <p><span style="font-size: 15px;"><strong>robots.txt中的参数含义:</strong></span></p> <p>1. User-agent:描述搜索引擎spider的名字。在“robots.txt“文件中,如果有多条 User-agent记录,说明有多个robot会受到该协议的约束。所以,“robots.txt”文件中至少要有一条User- agent记录。如果该项的值设为*(通配符),则该协议对任何搜索引擎机器人均有效。在“robots.txt”文件 中,“User-agent:*”这样的记录只能有一条。</p> <p>2. Disallow: / 禁止访问的路径</p> <p>例如,Disallow: /home/news/data/,代表爬虫不能访问/home/news/data/后的所有URL,但能访问/home/news/data123</p> <p>Disallow: /home/news/data,代表爬虫不能访问/home/news/data123、/home/news/datadasf等一系列以data开头的URL。</p> <p>前者是精确屏蔽,后者是相对屏蔽</p> <p>3.&nbsp; Allow:/允许访问的路径</p> <p>例如,Disallow:/home/后面有news、video、image等多个路径</p> <p>接着使用Allow:/home/news,代表禁止访问/home/后的一切路径,但可以访问/home/news路径</p> <p>&nbsp;</p></div>

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019-06-04 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档