java爬虫带你爬天爬地爬人生,爬新浪

HttpClient简介

HttpClient是Apache Jakarta Common下的子项目,可以用来提供高效的、最新的、功能丰富的支持HTTP协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本。它的主要功能有:

(1) 实现了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)

(2) 支持自动转向

(3) 支持 HTTPS 协议

(4) 支持代理服务器等

Jsoup简介

jsoup是一款Java的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。它的主要功能有:

(1) 从一个URL,文件或字符串中解析HTML;

(2) 使用DOM或CSS选择器来查找、取出数据;

(3) 可操作HTML元素、属性、文本;

使用步骤

代码

import org.apache.http.HttpEntity; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.protocol.HttpClientContext; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClientBuilder; import org.apache.http.util.EntityUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.junit.Test; import java.util.List; /** * HttpClient & Jsoup libruary test class * * Created by xuyh at 2017/11/6 15:28. */ public classHttpClientJsoupTest{     @Test     public void test() {             //通过httpClient获取网页响应,将返回的响应解析为纯文本         HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");         httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());         CloseableHttpClient httpClient = null;         CloseableHttpResponse response = null;         String responseStr = "";         try {             httpClient = HttpClientBuilder.create().build();             HttpClientContext context = HttpClientContext.create();             response = httpClient.execute(httpGet, context);             int state = response.getStatusLine().getStatusCode();             if (state != 200)                 responseStr = "";             HttpEntity entity = response.getEntity();             if (entity != null)                 responseStr = EntityUtils.toString(entity, "utf-8");         } catch (Exception e) {             e.printStackTrace();         } finally {             try {                 if (response != null)                     response.close();                 if (httpClient != null)                     httpClient.close();             } catch (Exception ex) {                 ex.printStackTrace();             }         }         if (responseStr == null)             return;         //将解析到的纯文本用Jsoup工具转换成Document文档并进行操作         Document document = Jsoup.parse(responseStr);         List<Element> elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()                 .getElementsByAttributeValue("class", "phdnews_hdline");         elements.forEach(element -> {             for (Element e : element.getElementsByTag("a")) {                 System.out.println(e.attr("href"));                 System.out.println(e.text());             }         });     }                                                                                                                                                                                                                    

详解

新建HttpGet对象,对象将从 http://sports.sina.com.cn/ 这个URL地址获取GET响应。并设置socket超时时间和连接超时时间分别为30000ms。

将HttpClient和Jsoup进行封装,形成一个工具类,内容如下:

import org.apache.http.HttpEntity; import org.apache.http.NameValuePair; import org.apache.http.client.CookieStore; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.entity.UrlEncodedFormEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.client.protocol.HttpClientContext; import org.apache.http.conn.ssl.SSLConnectionSocketFactory; import org.apache.http.cookie.Cookie; import org.apache.http.entity.ContentType; import org.apache.http.entity.StringEntity; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClientBuilder; import org.apache.http.impl.client.HttpClients; import org.apache.http.message.BasicNameValuePair; import org.apache.http.ssl.SSLContextBuilder; import org.apache.http.util.EntityUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import javax.net.ssl.*; import java.io.IOException; import java.security.GeneralSecurityException; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; /** *  * Http工具,包含: * 普通http请求工具(使用httpClient进行http,https请求的发送) *  * Created by xuyh at 2017/7/17 19:08. */ public classHttpUtils{     /** * 请求超时时间,默认20000ms */     private int timeout = 20000;     /** * cookie表 */     private Map<String, String> cookieMap = new HashMap<>();     /** * 请求编码(处理返回结果),默认UTF-8 */     private String charset = "UTF-8";     private static HttpUtils httpUtils;     privateHttpUtils(){     }     /** * 获取实例 * *@return */     publicstaticHttpUtilsgetInstance(){         if (httpUtils == null)             httpUtils = new HttpUtils();         return httpUtils;     }     /** * 清空cookieMap */     publicvoidinvalidCookieMap(){         cookieMap.clear();     }     publicintgetTimeout(){         return timeout;     }     /** * 设置请求超时时间 * *@paramtimeout */     publicvoidsetTimeout(inttimeout){         this.timeout = timeout;     }     publicStringgetCharset(){         return charset;     }     /** * 设置请求字符编码集 * *@paramcharset */     publicvoidsetCharset(String charset){         this.charset = charset;     }     /** * 将网页返回为解析后的文档格式 * *@paramhtml *@return *@throwsException */     publicstaticDocumentparseHtmlToDoc(String html)throwsException{         return removeHtmlSpace(html);     }     privatestaticDocumentremoveHtmlSpace(String str){         Document doc = Jsoup.parse(str);         String result = doc.html().replace("&nbsp;", "");         return Jsoup.parse(result);     }     /** * 执行get请求,返回doc * *@paramurl *@return *@throwsException */     publicDocumentexecuteGetAsDocument(String url)throwsException{         return parseHtmlToDoc(executeGet(url));     }     /** * 执行get请求 * *@paramurl *@return *@throwsException */     publicStringexecuteGet(String url)throwsException{         HttpGet httpGet = new HttpGet(url);         httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));         httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());         CloseableHttpClient httpClient = null;         String str = "";         try {             httpClient = HttpClientBuilder.create().build();             HttpClientContext context = HttpClientContext.create();             CloseableHttpResponse response = httpClient.execute(httpGet, context);             getCookiesFromCookieStore(context.getCookieStore(), cookieMap);             int state = response.getStatusLine().getStatusCode();             if (state == 404) {                 str = "";             }             try {                 HttpEntity entity = response.getEntity();                 if (entity != null) {                     str = EntityUtils.toString(entity, charset);                 }             } finally {                 response.close();             }         } catch (IOException e) {             throw e;         } finally {             try {                 if (httpClient != null)                     httpClient.close();             } catch (IOException e) {                 throw e;             }         }         return str;     }     /** * 用https执行get请求,返回doc * *@paramurl *@return *@throwsException */     publicDocumentexecuteGetWithSSLAsDocument(String url)throwsException{         return parseHtmlToDoc(executeGetWithSSL(url));     }     /** * 用https执行get请求 * *@paramurl *@return *@throwsException */     publicStringexecuteGetWithSSL(String url)throwsException{         HttpGet httpGet = new HttpGet(url);         httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));         httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());         CloseableHttpClient httpClient = null;         String str = "";         try {             httpClient = createSSLInsecureClient();             HttpClientContext context = HttpClientContext.create();             CloseableHttpResponse response = httpClient.execute(httpGet, context);             getCookiesFromCookieStore(context.getCookieStore(), cookieMap);             int state = response.getStatusLine().getStatusCode();             if (state == 404) {                 str = "";             }             try {                 HttpEntity entity = response.getEntity();                 if (entity != null) {                     str = EntityUtils.toString(entity, charset);                 }             } finally {                 response.close();             }         } catch (IOException e) {             throw e;         } catch (GeneralSecurityException ex) {             throw ex;         } finally {             try {                 if (httpClient != null)                     httpClient.close();             } catch (IOException e) {                 throw e;             }         }         return str;     }     /** * 执行post请求,返回doc * *@paramurl *@paramparams *@return *@throwsException */     publicDocumentexecutePostAsDocument(String url, Map<String, String> params)throwsException{         return parseHtmlToDoc(executePost(url, params));     }     /** * 执行post请求 * *@paramurl *@paramparams *@return *@throwsException */     publicStringexecutePost(String url, Map<String, String> params)throwsException{         String reStr = "";         HttpPost httpPost = new HttpPost(url);         httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());         httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));         List<NameValuePair> paramsRe = new ArrayList<>();         for (String key : params.keySet()) {             paramsRe.add(new BasicNameValuePair(key, params.get(key)));         }         CloseableHttpClient httpclient = HttpClientBuilder.create().build();         CloseableHttpResponse response;         try {             httpPost.setEntity(new UrlEncodedFormEntity(paramsRe));             HttpClientContext context = HttpClientContext.create();             response = httpclient.execute(httpPost, context);             getCookiesFromCookieStore(context.getCookieStore(), cookieMap);             HttpEntity entity = response.getEntity();             reStr = EntityUtils.toString(entity, charset);         } catch (IOException e) {             throw e;         } finally {             httpPost.releaseConnection();         }         return reStr;     }     /** * 用https执行post请求,返回doc * *@paramurl *@paramparams *@return *@throwsException */     publicDocumentexecutePostWithSSLAsDocument(String url, Map<String, String> params)throwsException{         return parseHtmlToDoc(executePostWithSSL(url, params));     }     /** * 用https执行post请求 * *@paramurl *@paramparams *@return *@throwsException */     publicStringexecutePostWithSSL(String url, Map<String, String> params)throwsException{         String re = "";         HttpPost post = new HttpPost(url);         List<NameValuePair> paramsRe = new ArrayList<>();         for (String key : params.keySet()) {             paramsRe.add(new BasicNameValuePair(key, params.get(key)));         }         post.setHeader("Cookie", convertCookieMapToString(cookieMap));         post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());         CloseableHttpResponse response;         try {             CloseableHttpClient httpClientRe = createSSLInsecureClient();             HttpClientContext contextRe = HttpClientContext.create();             post.setEntity(new UrlEncodedFormEntity(paramsRe));             response = httpClientRe.execute(post, contextRe);             HttpEntity entity = response.getEntity();             if (entity != null) {                 re = EntityUtils.toString(entity, charset);             }             getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);         } catch (Exception e) {             throw e;         }         return re;     }     /** * 发送JSON格式body的POST请求 * *@paramurl 地址 *@paramjsonBody json body *@return *@throwsException */     publicStringexecutePostWithJson(String url, String jsonBody)throwsException{         String reStr = "";         HttpPost httpPost = new HttpPost(url);         httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());         httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));         CloseableHttpClient httpclient = HttpClientBuilder.create().build();         CloseableHttpResponse response;         try {             httpPost.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));             HttpClientContext context = HttpClientContext.create();             response = httpclient.execute(httpPost, context);             getCookiesFromCookieStore(context.getCookieStore(), cookieMap);             HttpEntity entity = response.getEntity();             reStr = EntityUtils.toString(entity, charset);         } catch (IOException e) {             throw e;         } finally {             httpPost.releaseConnection();         }         return reStr;     }     /** * 发送JSON格式body的SSL POST请求 * *@paramurl 地址 *@paramjsonBody json body *@return *@throwsException */     publicStringexecutePostWithJsonAndSSL(String url, String jsonBody)throwsException{         String re = "";         HttpPost post = new HttpPost(url);         post.setHeader("Cookie", convertCookieMapToString(cookieMap));         post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());         CloseableHttpResponse response;         try {             CloseableHttpClient httpClientRe = createSSLInsecureClient();             HttpClientContext contextRe = HttpClientContext.create();             post.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));             response = httpClientRe.execute(post, contextRe);             HttpEntity entity = response.getEntity();             if (entity != null) {                 re = EntityUtils.toString(entity, charset);             }             getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);         } catch (Exception e) {             throw e;         }         return re;     }     privatevoidgetCookiesFromCookieStore(CookieStore cookieStore, Map<String, String> cookieMap){         List<Cookie> cookies = cookieStore.getCookies();         for (Cookie cookie : cookies) {             cookieMap.put(cookie.getName(), cookie.getValue());         }     }     privateStringconvertCookieMapToString(Map<String, String> map){         String cookie = "";         for (String key : map.keySet()) {             cookie += (key + "=" + map.get(key) + "; ");         }         if (map.size() > 0) {             cookie = cookie.substring(0, cookie.length() - 2);         }         return cookie;     }     /** * 创建 SSL连接 * *@return *@throwsGeneralSecurityException */     privatestaticCloseableHttpClientcreateSSLInsecureClient()throwsGeneralSecurityException{         try {             SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, (chain, authType) -> true).build();             SSLConnectionSocketFactory sslConnectionSocketFactory = new SSLConnectionSocketFactory(sslContext,                     (s, sslContextL) -> true);             return HttpClients.custom().setSSLSocketFactory(sslConnectionSocketFactory).build();         } catch (GeneralSecurityException e) {             throw e;         }     } }

给大家推荐一个程序员学习交流群:863621962。群里有分享的视频,还有思维导图

群公告有视频,都是干货的,你可以下载来看。主要分享分布式架构、高可扩展、高性能、高并发、性能优化、Spring boot、Redis、ActiveMQ、Nginx、Mycat、Netty、Jvm大型分布式项目实战学习架构师视频。

原创声明,本文系作者授权云+社区发表,未经许可,不得转载。

如有侵权,请联系 yunjia_community@tencent.com 删除。

编辑于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏智能大石头

老瓶装新酒 - C#调用WM手机发送短信(源码)

一些系统,需要能够发送短信,量很小,平均每日10条。 运营商平台太贵,白名单很严格,小额只能发省内; 各短信平台有各种限制,大事件前后会关闭; 飞信以前可以用W...

2405
来自专栏技术之路

Caliburn.Micro学习笔记(三)----事件聚合IEventAggregator和 Ihandle<T>

今天 说一下Caliburn.Micro的IEventAggregator和IHandle<T>分成两篇去讲这一篇写一个简单的例子 看一它的的实现和源码 下一篇...

2839
来自专栏用户画像

Java HttpClient两种数据传输方式

二、server端的数据接收方式,使用@RequestBody接收二进制字节流,使用@RequestParam接收参数列表

4731
来自专栏跟着阿笨一起玩NET

LINQ多条件OR模糊查询

本文章转载:http://www.cnblogs.com/guyun/archive/2012/10/18/2729888.html

1691
来自专栏跟着阿笨一起玩NET

多条件动态LINQ 组合查询

本文章转载:http://www.cnblogs.com/wangiqngpei557/archive/2013/02/05/2893096.html

1172
来自专栏大内老A

Enterprise Library Policy Injection Application Block 之三:PIAB的扩展—创建自定义CallHandler(提供Source Code下载)

本系列的第一部分对PIAB使用场景进行了简单的介绍,作中阐述了通过PI(Policy Injection)的方式实现了Business Logic和Non-Bu...

34210
来自专栏跟着阿笨一起玩NET

使用dynamic来简化反射实现,并且提高了性能。

本人转载:http://www.cnblogs.com/cuitsl/archive/2012/01/06/2314636.html

1601
来自专栏GreenLeaves

C#核编之System.Environment类

      在前面的例子中用来了Environment.GetCommandLineArgs()这个方法,这个方法就是获取用户的命令行输入,是Environme...

2407
来自专栏大内老A

开发自己的Data Access Application Block[上篇]

经常在网上看到对ORM的讨论沸沸扬扬,我也来凑个热闹,谈谈我写的一个ORM。最近在做一项工作,把我们经常用到的一些业务逻辑抽象出来,写成一个个的Applicat...

1928
来自专栏木宛城主

庖丁解牛看委托和事件(续)

上一篇文章:庖丁解牛——深入解析委托和事件之后,以一题面试题来总结事件 using System; using System.Collections.Gener...

3719

扫码关注云+社区

领取腾讯云代金券