我正在尝试创建一个网络爬虫,我希望它能够通过本地代理连接到网站。
因此,假设我们想要向google发送一个GET消息并检索它的HTML代码,所有这一切都通过一个本地代理(我在我的大学工作,有一个代理可以连接到像google这样的外部站点)。
这是我的代码:
#include <iostream>
#include <cstring> // Needed for memset
#include <sys/socket.h> // Needed for the socket functions
#include <netdb.h> // Needed for the socket functions
#include <cstdlib>
#include <string>
using namespace std;
int main(int argc, char* argv[])
{
addrinfo host_info; // The struct that getaddrinfo() fills up with data.
addrinfo *host_info_list;
int socketfd;
char* msg = NULL;
char* msg2 = NULL;
int status;
int len;
memset(&host_info, 0, sizeof host_info);
host_info.ai_family = AF_INET;//AF_UNSPEC;
host_info.ai_socktype = SOCK_STREAM;
//PROXY IP = proxy.fing.edu.uy ; PORT = 3128 ; //HTTP1.0 proxy
status = getaddrinfo("proxy.fing.edu.uy", "3128", &host_info, &host_info_list);
socketfd = socket(host_info_list->ai_family, host_info_list->ai_socktype,
host_info_list->ai_protocol);
if (socketfd == -1) std::cout << "ERROR: socket error " << std::endl ;
std::cout << "Connect()ing..." << std::endl;
status = connect(socketfd, host_info_list->ai_addr, host_info_list->ai_addrlen);
if (status == -1) std::cout << "ERROR: connect error" << std::endl ;
msg = new char[200];
strcpy(msg,"CONNECT www.google.com HTTP/1.0\r\n");
strcat(msg,"\r\n");
ssize_t bytes_sent;
len = strlen(msg);
bytes_sent = send(socketfd, msg, len, 0);
ssize_t bytes_recieved=0;
std::cout << "Waiting to recieve data..." << std::endl;
char* incoming_data_buffer = new char[200];
bytes_recieved = recv(socketfd, incoming_data_buffer,200, 0);
if (bytes_recieved == 0) std::cout << "host shut down." << std::endl ;
if (bytes_recieved == -1)std::cout << "ERROR: receive error!" << std::endl ;
std::cout << bytes_recieved << " bytes recieved" << std::endl ;
std::cout << incoming_data_buffer << std::endl;
msg2 = new char[300];
strcpy(msg2,"GET http://www.google.com/ HTTP/1.0\r\n\r\n");
std::cout << "Message sent to google: " << msg2 << std::endl;
len = strlen(msg2);
bytes_sent = send(socketfd, msg2, len, 0);
cout << "bytes_sent: " << bytes_sent << endl;
bytes_recieved=0;
std::cout << "Waiting to recieve data ..." << std::endl;
char* incoming_data_buffer2 = new char[1000];
bytes_recieved = recv(socketfd, incoming_data_buffer2,1000, 0);
if (bytes_recieved == 0) std::cout << "host shut down." << std::endl ;
if (bytes_recieved == -1)std::cout << "ERROR: recieve error!" << std::endl ;
std::cout << bytes_recieved << " bytes recieved" << std::endl ;
std::cout << incoming_data_buffer2 << std::endl;
return 0;
}
我遇到的问题如下..首先,incoming_data_buffer (来自“CONNECT”的缓冲区)返回:“HTTP1.0200 connection established",这很好,到目前为止没有问题。接下来,我向代理发送"GET“消息,这样它就会按预期将消息转发给google (现在连接已建立),并在"recv()”中保持空闲1分钟左右,然后返回0(我猜这意味着连接已关闭),并且缓冲区为空……我的问题是我不知道为什么recv()返回0...有什么想法吗?这可能意味着连接已关闭,但为什么呢?为了让代理保持连接,我还需要做些什么?(假设“连接关闭”是问题所在)。
提前感谢!
发布于 2014-09-22 08:48:15
CONNECT
方法是一种HTTP隧道功能。支持它的代理可能会将它的使用限制为连接到HTTPS网站(来源:Wikipedia -- HTTP tunnel)。您正在尝试通过代理可能阻止的CONNECT
与标准HTTP服务器建立连接。
在与代理建立连接后,只需发送请求即可,而不是建立隧道。由于您正在使用absoluteURI
来指定您的GET
目标,因此这将会起作用。
https://stackoverflow.com/questions/25965194
复制相似问题