我正在为一个餐厅网站开发一个自动的网络刮刀,但我有一个问题。上述网站使用Cloudflare的反机器人安全性,我想绕过它,而不是攻击模式,而是一个captcha测试,只有当它检测到一个非美国的IP或一个机器人。我试图绕过它,因为当我清除cookie、禁用javascript或使用美国代理时,Cloudflare的安全性不会触发。
知道了这一点,我尝试使用python的请求库如下:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
response = requests.get("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers).text
print(response)
但是这最终触发了Cloudflare,不管我使用的是代理。
但是,当使用具有相同标头的时,:
import urllib.request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
request = urllib.request.Request("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers)
r = urllib.request.urlopen(request).read()
print(r.decode('utf-8'))
使用相同的美国IP运行时,这一次它不会触发Cloudflare的安全性,即使它使用与请求库相同的标头和IP。
因此,我正在试图找出在未位于urllib库中的请求库中触发Cloudflare的确切原因。
虽然典型的答案将是“只需使用urllib那么”,但我想找出与请求的确切不同之处,以及如何修复它,首先要了解请求是如何工作的,Cloudflare检测机器人,而且这样我可以将我能找到的任何修复应用于其他httplib(特别是异步的)。
编辑N°2:目前为止的进展:
由于@TuanGeek,我们现在可以使用请求绕过Cloudflare块,只要我们直接连接到主机IP,而不是域名(出于某种原因,DNS重定向带有请求触发器Cloudflare,但urllib没有):
import requests
from collections import OrderedDict
import socket
# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
s = requests.Session()
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", verify=False).text
注意:尝试通过HTTP访问(而不是将验证变量设置为False的HTTPS )将触发Cloudflare块
现在这很好,但不幸的是,我使用httplib HTTPX异步完成这项工作的最终目标仍然没有实现,因为使用以下代码,Cloudflare块仍然被触发,尽管我们直接通过主机IP连接,具有适当的报头,并将验证设置为False:
import trio
import httpx
import socket
from collections import OrderedDict
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
async def asks_worker():
async with httpx.AsyncClient(headers=headers, verify=False) as s:
r = await s.get(f'https://{address}/guest/accountlogin')
print(r.text)
async def run_task():
async with trio.open_nursery() as nursery:
nursery.start_soon(asks_worker)
trio.run(run_task)
编辑N°1:关于其他细节,下面是来自urllib的原始HTTP请求和请求
请求:
send: b'GET /guest/nologin/account-balance HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: grimaldis.myguestaccount.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: Date: Thu, 02 Jul 2020 20:20:06 GMT
header: Content-Type: text/html; charset=UTF-8
header: Transfer-Encoding: chunked
header: Connection: close
header: CF-Chl-Bypass: 1
header: Set-Cookie: __cfduid=df8902e0b19c21b364f3bf33e0b1ce1981593721256; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
header: Expires: Thu, 01 Jan 1970 00:00:01 GMT
header: X-Frame-Options: SAMEORIGIN
header: cf-request-id: 03b2c8d09300000ca181928200000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=df8962e1b27c25b364f3bf66e8b1ce1981593723206; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Vary: Accept-Encoding
header: Server: cloudflare
header: CF-RAY: 5acb25c75c981ca1-EWR
URLLIB:
send: b'GET /guest/nologin/account-balance HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: grimaldis.myguestaccount.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 02 Jul 2020 20:20:01 GMT
header: Content-Type: text/html;charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Set-Cookie: __cfduid=db9de9687b6c22e6c12b33250a0ded3251292457801; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Expires: Thu, 2 Jul 2020 20:20:01 GMT
header: Cache-Control: no-cache, private, no-store
header: X-Powered-By: Undertow/1
header: Pragma: no-cache
header: X-Frame-Options: SAMEORIGIN
header: Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://use.typekit.net connect.facebook.net/ https://googleads.g.doubleclick.net/ app.pendo.io cdn.pendo.io pendo-static-6351154740266000.storage.googleapis.com pendo-io-static.storage.googleapis.com https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.google.com/recaptcha/api.js apis.google.com https://www.googletagmanager.com api.instagram.com https://app-rsrc.getbee.io/plugin/BeePlugin.js https://loader.getbee.io api.instagram.com https://bat.bing.com/bat.js https://www.googleadservices.com/pagead/conversion.js https://connect.facebook.net/en_US/fbevents.js https://connect.facebook.net/ https://fonts.googleapis.com/ https://ssl.gstatic.com/ https://tagmanager.google.com/;style-src 'unsafe-inline' *;img-src * data:;connect-src 'self' app.pendo.io api.feedback.us.pendo.io; frame-ancestors 'self' app.pendo.io pxsweb.com *.pxsweb.com;frame-src 'self' *.myguestaccount.com https://app.getbee.io/ *;
header: X-Lift-Version: Unknown Lift Version
header: CF-Cache-Status: DYNAMIC
header: cf-request-id: 01b2c5b1fa00002654a25485710000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Server: cloudflare
header: CF-RAY: 5acb58a62c5b5144-EWR
发布于 2020-07-02 00:56:42
这真的激起了我的兴趣。我能找到的requests
解决方案。
解决方案
最后缩小了问题的范围。使用请求时,它使用urllib3连接池。在常规urllib3连接和连接池之间似乎存在一些不一致。工作解决方案:
import requests
from collections import OrderedDict
from requests import Session
import socket
# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
s = Session()
headers = OrderedDict({
'Accept-Encoding': 'gzip, deflate, br',
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)
技术背景
所以我在Burp中运行了这两种方法来比较请求。下面是请求的原始转储
使用请求
GET /guest/accountlogin HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Connection: close
Host: grimaldis.myguestaccount.com
Accept-Language: en-GB,en;q=0.5
Upgrade-Insecure-Requests: 1
dnt: 1
使用urllib
GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: close
Upgrade-Insecure-Requests: 1
Dnt: 1
差异是标题的顺序,,dnt
大写的差异实际上不是问题。
因此,我能够通过以下原始请求成功地提出请求:
GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
因此,Host
头已发送到User-Agent
之上。所以如果你想继续使用请求。考虑使用OrderedDict来确保头的排序。
发布于 2020-07-09 13:53:49
经过一些调试,并且感谢@TuanGeek的回答,我们发现请求库的问题似乎来自请求处理cloudflare时的DNS问题,解决此问题的一个简单方法就是直接连接到主机IP:
import requests
from collections import OrderedDict
from requests import Session
import socket
# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
s = Session()
headers = OrderedDict({
'Accept-Encoding': 'gzip, deflate, br',
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)
现在,在使用httplib时,这个修复没有工作,但是我已经发现了问题的根源。
该问题来自h11库( HTTPX用于处理HTTP/1.1请求),而urllib将自动修复标头的字母大小写,而h11则采用不同的方法降低每个标头。虽然理论上这不应该引起任何问题,因为服务器应该以不区分大小写的方式处理头(在很多情况下都是这样),但现实情况是,HTTP是硬的™️,而Cloudflare等服务不尊重™️,并且要求头的大写化。
关于资本化的讨论已经在h11进行了一段时间:
https://github.com/python-hyper/h11/issues/31
“最近”也开始出现在HTTPX的回购上:
https://github.com/encode/httpx/issues/538
https://github.com/encode/httpx/issues/728
现在,对于Cloudflare和HTTPX之间的问题,不满意的答案是,在H11方面做了一些事情之前(或者直到Cloudflare奇迹般地开始尊重RFC2616),HTTPX和Cloudflare处理头大写的方式不会有太大变化。
或者使用不同的HTTPLIB,如aiohttp或requests,使用h11自己尝试分叉和修补头大写,或者等待并希望h11团队能够正确地处理这个问题。
https://stackoverflow.com/questions/62684468
复制相似问题