blocks|key|991032|text|看看超文本标记语言解析器，比如TagSoup、HTMLCleaner或NekoHTML。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|991033|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Look at an HTML parser such as TagSoup, HTMLCleaner or NekoHTML.

blocks|key|1205488|text|机械化for+Java将是一个很好的选择，正如Wadjy+Essam提到的，它在HMLT中使用了JSoup。mechanize是一个分阶段的HTTP/HTML客户端，支持导航、表单提交和页面抓取。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1205489|http://gistlabs.com/software/mechanize-for-java/+(和GitHub+here+https://github.com/GistLabs/mechanize)|offset|length|1205490|entityMap|0|LINK|mutability|MUTABLE|url|http://gistlabs.com/software/mechanize-for-java/|1|https://github.com/GistLabs/mechanize^0|0|0|1C|0|1R|11|1|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Q|8|@]|9|@$D|R|E|S|1|T]|$D|U|E|V|1|W]]|A|$]]|$1|F|3|-4|5|6|7|X|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|M]]|N|$5|I|J|K|A|$L|O]]]]

mechanize for Java would be a good fit for this, and as Wadjy Essam mentioned it uses JSoup for the HMLT. mechanize is a stageful HTTP/HTML client that supports navigation, form submissions, and page scraping.

<a href="http://gistlabs.com/software/mechanize-for-java/" rel="nofollow">http://gistlabs.com/software/mechanize-for-java/</a> (and the GitHub here <a href="https://github.com/GistLabs/mechanize" rel="nofollow">https://github.com/GistLabs/mechanize</a>)

blocks|key|991080|text|还有Java+Scraping+&+JSON+Querying+http://jaunt-api.com|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|991081|entityMap|0|LINK|mutability|MUTABLE|url|http://jaunt-api.com/^0|W|K|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

There is also Jaunt Java Web Scraping &amp; JSON Querying - <a href="http://jaunt-api.com" rel="nofollow noreferrer">http://jaunt-api.com</a>

blocks|key|991123|text|如果您希望自动抓取大量页面或数据，那么可以尝试Gotz+ETL。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|991124|它完全是模型驱动的，就像一个真正的ETL工具。数据结构、任务流程和要抓取的页面由一组XML定义文件定义，不需要编码。查询可以使用带有JSoup的选择器或带有HtmlUnit的XPath来编写。|991125|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/maithilish/gotz^0|N|8|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]]|C|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]]]

If you wish to automate scraping of large amount pages or data, then you could try <a href="https://github.com/maithilish/gotz" rel="nofollow noreferrer">Gotz ETL</a>. 

It is completely model driven like a real ETL tool. Data structure, task workflow and pages to scrape are defined with a set of XML definition files and no coding is required. Query can be written either using Selectors with JSoup or XPath with HtmlUnit.

blocks|key|991174|text|对于这种类型的任务，我通常使用Crawller4j+%2B+Jsoup。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|991175|使用ULR从域下载页面，您可以使用正则表达式指定哪个crawler4j。|991176|使用jsoup，我用crawler4j“解析”了您搜索和下载的html数据。|991177|通常，您也可以使用jsoup下载数据，但是Crawler4J使查找链接变得更容易。使用crawler4j的另一个优点是它是多线程的，您可以配置并发线程的数量|991178|https://github.com/yasserg/crawler4j/wiki|offset|length|991179|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|0|0|0|15|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|T|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|U|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|V|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|W|8|@]|9|@$J|X|K|Y|1|Z]]|A|$]]|$1|L|3|-4|5|6|7|10|8|@]|9|@]|A|$]]]|M|$N|$5|O|P|Q|A|$R|I]]]]

For tasks of this type I usually use Crawller4j + Jsoup.
With crawler4j I download the pages from a domain, you can specify which ULR with a regular expression.
With jsoup, I &quot;parsed&quot; the html data you have searched for and downloaded with crawler4j.
Normally you can also download data with jsoup, but Crawler4J makes it easier to find links.
Another advantage of using crawler4j is that it is multithreaded and you can configure the number of concurrent threads
<a href="https://github.com/yasserg/crawler4j/wiki" rel="nofollow noreferrer">https://github.com/yasserg/crawler4j/wiki</a>

blocks|key|1205670|text|通常我使用selenium，这是一个用于测试自动化的软件。您可以通过webdriver控制浏览器，所以使用javascripts不会有问题，如果使用完整版，通常不会被检测到。无头浏览器可以更容易识别。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1205671|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Normally I use selenium, which is software for testing automation.
You can control a browser through a webdriver, so you will not have problems with javascripts and it is usually not very detected if you use the full version. Headless browsers can be more identified.

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some <code>pageID</code> and extract the HTML titles / other stuff in their DOM trees.

Are there ways other than web scraping?

Web scraping with Java

我找不到任何好的基于Java的web抓取API。我需要抓取的站点也不提供任何API；我想使用一些pageID遍历所有网页，并在它们的DOM树中提取HTML标题/其他内容。除了网络抓取之外，还有其他方法吗？

问使用Java进行Web抓取
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Java进行Web抓取EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Java进行Web抓取
EN