我必须收集隐藏在一些网站的整个html代码中的信息。问题是我不能阅读这些网站的全部html代码。我试过jsoup、Apache的HTTPClient、HTTPClient java7版本和最新的java HTTPClient。当前(最后一个)选项在显示3098行html代码时效果最好。整个文档还有大约1000行。
负责下载页面的方法:
public static boolean saveMaterials(String link, String filePath) throws IOException, InterruptedException {
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder().uri(URI.create(link)).build();
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
try {
PrintWriter tmp = new PrintWriter(filePath);
tmp.print(response.body());
return true;
} catch (FileNotFoundException e) {
e.printStackTrace();
}
return false;
}
有没有人知道如何改变缓冲区大小或其他解决问题的方法?
发布于 2021-07-22 19:24:31
按照Mr.Andersen的建议,我使用了curl命令:
static public boolean save(String link, String filePath) throws IOException {
String command = "curl -X GET " + link;
try (InputStream inputStream = new ProcessBuilder(command.split(" "))
.directory(new File("/"))
.start()
.getInputStream();
OutputStream output = new FileOutputStream(filePath, false)) {
inputStream.transferTo(output);
} catch (IOException e) {
return false;
}
return true;
}
https://stackoverflow.com/questions/68471455
复制相似问题