网络上存在各种各样的爬虫与蜘蛛,有的是对网站有帮助的,譬如说:百度(Baiduspider)、谷歌(Googlebot)、Bing(bingbot)等等,但是也有一些纯粹是垃圾爬虫,不但本身对网站毫无帮助,还大幅损耗服务器资源,如:BLEXBot、AhrefsBot、MJ12bot、hubspot、opensiteexplorer、leiki、webmeup 等,所以我们可以通过UserAgent信息来屏蔽垃圾爬虫
将指定的userAgent返回403
if($http_user_agent ~* 'curl|python-requests|urllib|Baiduspider|YisouSpider|Google|Sogou|bingbot|python|AndroidDownloadManager|ZoominfoBot|SemrushBot|AhrefsBot|Java|Jullo|UniversalFeedParser|Swiftbot|Microsoft|oBot|FlightDeckReports|Linguee|DotBot|Indy|jaunty|HttpClient|WinHttp|ZmEu|ApacheBench|CrawlDaddy|BOT for JCE')
{
return 403;
}
# 如果多个server都设置,建议使用map
# 禁止的user agent
map $http_user_agent $ban_ua {
default '';
'~*MJ12bot|curl|NetcraftSurvey|Go-http-client|polaris botnet|python-requests|urllib|Scrapy|Baiduspider|YisouSpider|Google|Sogou|bingbot|python|AndroidDownloadManager|ZoominfoBot|SemrushBot|AhrefsBot|Java|Jullo|UniversalFeedParser|Swiftbot|Microsoft|oBot|FlightDeckReports|Linguee|DotBot|Indy|jaunty|HttpClient|WinHttp|ZmEu|ApacheBench|CrawlDaddy|BOT for JCE' 'error';
}
if ($ban_ua){
return 403;
}
常见的搜索引擎的可以推荐,利于收录,一般都遵循robots.txt协议
该分类下的爬虫对网站帮助不大,可能会暴力爬取页面,流氓的有的还不遵循robots.txt协议