最近一两天,我一直在努力学习Java。这是我正在做的第一个项目,所以请容忍我。我在一个多线程的网络爬虫上工作。这很简单,但我想征求一些建议。
程序从一个web地址(在此代码中为http://google.com)开始,并在给定的响应中查找所有有效的URL。在响应中找到的所有URL都将添加到队列中。然后,爬虫将继续在队列中的URL中爬行。若要停止爬行器,请在输入中键入exit
Http.java
package com.janchr;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
/**
* Created by Jan on 8/20/2016.
*/
public class Http {
public static BufferedReader Get(URL url) throws IOException {
HttpURLConnection con = (HttpURLConnection)url.openConnection();
con.setRequestMethod("GET");
// pretend that we are a new-ish browser. current user agent is actually from 2015.
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36");
con.setInstanceFollowRedirects(true);
int statusCode = con.getResponseCode();
// https://www.mkyong.com/java/java-httpurlconnection-follow-redirect-example/
boolean redirect = false;
if (statusCode != HttpURLConnection.HTTP_OK) {
if (statusCode == HttpURLConnection.HTTP_MOVED_TEMP
|| statusCode == HttpURLConnection.HTTP_MOVED_PERM
|| statusCode == HttpURLConnection.HTTP_SEE_OTHER)
redirect = true;
}
if (redirect) {
// get redirect url from "location" header field
String newUrl = con.getHeaderField("Location");
// get the cookie if need
String cookies = con.getHeaderField("Set-Cookie");
return Http.Get(new URL(newUrl));
}
return new BufferedReader(new InputStreamReader(con.getInputStream()));
}
}
Crawler.java
package com.janchr;
import java.io.BufferedReader;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* Created by Jan on 8/20/2016.
*/
class CrawlThread implements Runnable {
final static Pattern urlPat = Pattern.compile("(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]");
Crawler c;
int num;
boolean stop;
public Thread t;
public CrawlThread(Crawler c, int num) {
this.c = c;
this.num = num;
this.t = new Thread(this, "CrawlThread");
t.start();
}
private LinkedList<String> parse(BufferedReader r) {
String lineBuf = "";
LinkedList<String> urls = new LinkedList<String>();
do {
try {
lineBuf = r.readLine();
} catch (IOException e) {
System.out.println("(" + this.num + ") error parsing: " + e);
return urls;
}
if (lineBuf == null) {
return urls;
}
Matcher m = urlPat.matcher(lineBuf);
while(m.find()) {
//System.out.println("(" + this.num + ") match: " + m.group(0));
urls.add(m.group(0));
}
} while(lineBuf != null);
return urls;
}
public void run() {
// pop_front the next URL and get it
do {
String surl = c.next();
//System.out.println("(" + this.num + ") getting " + surl);
URL url;
try {
url = new URL(surl);
} catch (MalformedURLException e) {
System.out.println("(" + this.num + ") bad url " + surl + ": " + e);
continue;
}
BufferedReader r;
try {
r = Http.Get(url);
} catch (IOException e) {
System.out.println("(" + this.num + ") IOException Http.Get " + surl + ": " + e);
continue;
}
c.done(surl);
for (String newUrl: this.parse(r)) {
c.addURL(newUrl);
}
} while(!this.stop);
}
}
class VisitedURL {
public String url;
public int visits;
VisitedURL(String url) {
this.url = url;
}
}
public class Crawler {
private List<String> queue = Collections.synchronizedList(new LinkedList<>());
private Map<String, VisitedURL> visited = Collections.synchronizedMap(new LinkedHashMap<>());
private ArrayList<CrawlThread> threads = new ArrayList<>();
private int maxThreads;
public Crawler(int maxThreads) {
this.maxThreads = maxThreads;
}
public void start(String entryPoint) {
this.queue.add(entryPoint);
for (int i = 0; i < this.maxThreads; i++) {
this.threads.add(new CrawlThread(this, i));
}
}
public synchronized void stop() {
for(CrawlThread t: this.threads) {
// interrupting the thread should be fine for us in our use-case.
t.stop = true;
t.t.interrupt();
}
}
public synchronized String next() {
// I got IndexOutOfBoundsException here when starting up the crawler.
// the only way to fix it for me was this loop. I don't know what would
// be a better way to fix it. A mutex didn't work for me.
do {
if (this.queue.size() == 0) {
try {
wait();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
} while(this.queue.size() == 0);
synchronized (this.queue) {
if (this.queue.size() == 1) {
System.out.println("QUEUE EMPTY NOW");
}
return this.queue.remove(0);
}
}
public void done(String url) {
final VisitedURL obj = this.visited.putIfAbsent(url, new VisitedURL(url));
if (obj == null) {
this.visited.get(url).visits++;
}
}
public synchronized void addURL(String url) {
// TODO: we might want to ignore the URLs query
if (this.queue.contains(url)) {
return;
}
if (this.visited.containsKey(url)) {
this.visited.get(url).visits++;
return;
}
this.queue.add(url);
notifyAll();
}
public Map<String, VisitedURL> getVisitedUrls() {
return visited;
}
}
Main.java
package com.janchr;
import java.util.Scanner;
public class Main {
public static void main(String[] args) {
Crawler c = new Crawler(8);
System.out.println("starting crawler");
c.start("http://google.com");
Scanner s = new Scanner(System.in);
while (!s.next().equals("exit"));
c.stop();
synchronized (c) {
System.out.println("\n\n---------------------------------------------------------------------");
for (VisitedURL u : c.getVisitedUrls().values()) {
System.out.println(u.visits + "x " + u.url);
}
System.out.println("---------------------------------------------------------------------");
System.out.println("visited " + c.getVisitedUrls().size() + " unique urls");
}
}
}
一些问题:
wait
调用放在那里,但我想这是因为我后来使用了notifyAll
(?)我更新了我的代码以使用线程,正如我在其中一个问题中所指定的那样。这是更好的解决办法吗?
Crawler.java
package com.janchr;
import java.io.BufferedReader;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* Created by Jan on 8/20/2016.
*/
class CrawlThread implements Runnable {
final static Pattern urlPat = Pattern.compile("https?://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]");
Crawler c;
String url;
public Thread t;
public CrawlThread(Crawler c, String url) {
this.c = c;
this.url = url;
this.t = new Thread(this, "CrawlThread");
t.start();
}
private LinkedList<String> parse(BufferedReader r) {
String lineBuf = "";
LinkedList<String> urls = new LinkedList<String>();
do {
try {
lineBuf = r.readLine();
} catch (IOException e) {
System.out.println("error parsing: " + e);
return urls;
}
if (lineBuf == null) {
return urls;
}
Matcher m = urlPat.matcher(lineBuf);
while(m.find()) {
urls.add(m.group(0));
}
} while(lineBuf != null);
return urls;
}
public void run() {
URL url;
try {
url = new URL(this.url);
} catch (MalformedURLException e) {
System.out.println("bad url " + this.url + ": " + e);
c.done(this, this.url);
return;
}
BufferedReader r;
try {
r = Http.Get(url);
} catch (IOException e) {
System.out.println("IOException Http.Get " + this.url + ": " + e);
c.done(this, this.url);
return;
}
for (String newUrl: this.parse(r)) {
c.addURL(newUrl);
}
c.done(this, this.url);
}
}
class VisitedURL {
public String url;
public int visits;
VisitedURL(String url) {
this.url = url;
}
}
public class Crawler {
private List<String> queue = Collections.synchronizedList(new LinkedList<>());
private Map<String, VisitedURL> visited = Collections.synchronizedMap(new LinkedHashMap<>());
private ArrayList<CrawlThread> threads = new ArrayList<>();
private int maxThreads;
public Crawler(int maxThreads) {
this.maxThreads = maxThreads;
}
public void start(String entryPoint) {
this.queue.add(entryPoint);
this.tryNext();
}
public synchronized void stop() {
for(CrawlThread t: this.threads) {
// interrupting the thread should be fine for us in our use-case.
t.t.interrupt();
}
}
public synchronized boolean hasNext() {
return this.queue.size() > 0;
}
public synchronized String next() {
if (this.queue.size() == 0) {
return null;
}
return this.queue.remove(0);
}
private void tryNext() {
if (!this.hasNext() || this.threads.size() == this.maxThreads) {
return;
}
String next = this.next();
if (next == null) {
System.out.println("invalid next string");
return;
}
this.threads.add(new CrawlThread(this, next));
}
public void done(CrawlThread t, String url) {
final VisitedURL obj = this.visited.putIfAbsent(url, new VisitedURL(url));
if (obj == null) {
this.visited.get(url).visits++;
}
this.threads.remove(t);
this.tryNext();
}
public synchronized void addURL(String url) {
// TODO: we might want to ignore the URLs query
if (this.queue.contains(url)) {
return;
}
if (this.visited.containsKey(url)) {
this.visited.get(url).visits++;
return;
}
this.queue.add(url);
this.tryNext();
}
public Map<String, VisitedURL> getVisitedUrls() {
return visited;
}
}
.interrupt
我的线程了。我该怎么解决这个问题?发布于 2016-09-24 17:42:35
对于只有几天Java经验的人来说,干得不错!现在有了一些改进:
您正在研究经典生产者-消费者问题的导数。这是一个常见的问题,在Java中解决这个问题的模式已经很成熟了。
您想在这里使用的抽象称为ExecutorService。本质上,它允许您提交Runnable
s由ExecutorService执行。您可以使用ExecutorService
轻松地构造Executors#newFixedThreadPool
。我们可以对您的CrawlThread类做一些修改,以使它在这个新模型中工作:
class Crawler implements Runnable {
private final String url;
private final Executor executor;
private final Map<String, SeenUrl> seenUrls;
public Crawler(
String url,
Executor executor,
Map<String, VisitedUrl> seenUrls) {
this.url = url;
this.executor = executor;
this.seenUrls = seenUrls;
}
@Override
public void run() {
List<String> newUrls = parse(); // Very similar to your parse
for (String newUrl : newUrls) {
synchronized(seenUrls) {
if (seenUrls.containsKey(newUrl)) {
seenUrls.get(newUrl).timesSeen++;
} else {
seenUrls.put(newUrl, new SeenUrl(newUrl));
executor.submit(new Crawler(newUrl, executor, seenUrls));
}
}
}
}
}
public class Main {
public static void main(String[] args) {
// Run with 5 threads, adjust as necessary.
ExecutorService executorService = Executors.newFixedThreadPool(5);
Map<String, SeenURL> seenUrls = new LinkedHashMap<>();
seenUrls
.put("http://google.com", new SeenUrl("http://google.com"));
executorService.submit(
new Crawler("http://google.com", executorService, seenUrls));
executorService.awaitTermination();
}
}
现在,在上面的代码片段中可能会有一些惊喜:
ExecutorService#shutdown()
。我要做的下一个改进是将SeenUrls的映射替换为一个多集。但是,标准集合库中没有包含这一点。
发布于 2016-11-02 13:55:46
马特·H为如何利用Java的广泛库提供了一个很好的答案。遗嘱执行人是一项伟大的服务。不要回避那些API,它们的工作效率很高,并且加快了编码过程的许多倍。
格式是示例(缩进点,正确的cAsE)。我想我将简单地讨论通常的编码标准/Java成语。
类往往是名词,而不是动词(它们是对象类别,不定义动作)。这种方法提供了执行动作的手段,这些方法应该是动词。
CrawlThread
更好地称为ThreadCrawler
,因为它是什么,而不是它所做的。Crawler.done()
应该被命名为Crawler.markAsDone()
class CrawlThread implements Runnable {
final static Pattern urlPat = ...;
Crawler c;
String url;
public Thread t;
你可能忘记了几个可变的访问级别。我不知道您是否打算对这些类进行子类化,但是让所有这些private
,特别是静态常量是个好主意。默认访问级别很少是可取的。
特别是public Thread
句柄是一个等待发生的事故。这让你可以从任何地方调用你的线程和混乱的你的线程。把它藏起来,只暴露安全的控制方法。
您的类没有定义访问级别。尽量让它成为private
,特别是VisitedURL
。
我不确定像VisitedURL
这样的次级类是否在它们自己的文件中定义,但如果不是,则必须生成这些static
。
class CrawlThread implements Runnable {
...
public Thread t;
你对这句话不太清楚。您声称CrawlThread
是Runnable
,所以我应该将它交给Thread
。但是一旦我构建了一个线程,它就会自动调用一个线程,我没有控制能力。这看起来像我所称的Thread
实际上!
你需要分担责任。要么保持CrawlThread
可运行,但删除它的Thread
并在外部管理它,要么让它直接扩展线程。
它在逐个案例的基础上发生变化,但尽量将try/catch块放到尽可能远的地方。特别是如果catch子句退出了包围do/while。
更喜欢这个:
try {
do {
lineBuf = r.readLine();
} while(lineBuf != null);
} catch (IOException e) {
System.out.println("error parsing: " + e);
return urls;
}
这将更好地显示您打算在出现异常时退出while
。如果catch
子句有一个continue
,则可能不是这样。
BufferedReader
si a Closeable
.此外,它也是一个AutoCloseable
。
有一种更好的方法来编写这段代码,因为Java7(它无法close
BufferedReader
):
BufferedReader r;
try {
r = Http.Get(url);
} catch (IOException e) {
System.out.println("IOException Http.Get " + this.url + ": " + e);
c.done(this, this.url);
return;
}
for (String newUrl: this.parse(r)) {
c.addURL(newUrl);
}
c.done(this, this.url);
现在,它可以写成更简洁和更健壮的:
try (BufferedReader r = Http.Get(url)){
for (String newUrl: this.parse(r)) {
c.addURL(newUrl);
}
} finally {
c.done(this, this.url);
} catch (IOException e) {
System.out.println("error parsing: " + e);
}
private LinkedList<String> parse(BufferedReader r) {
总是返回一个List<String>
是个好主意,这样您以后就可以在没有大量代码编辑的情况下更改实现。
您的错误日志没有完全使用Java强度:
System.out.println("(" + this.num + ") error parsing: " + e);
这将简单地将消息(然后是异常标签)显示到stdout
(标准输出)。最有用的东西(堆栈跟踪)完全丢失了。
您应该确保堆栈跟踪不会丢失,至少调用e.printStackTrace()
会将堆栈发送到stderr
(错误输出)。
但是,最好的方法是包含一个记录器实用程序。基本的一项内容如下:
import java.util.logging.Logger;
// assumes the current class is called MyLogger
private final static Logger LOGGER = Logger.getLogger(MyLogger.class.getName());
并用于显示消息和堆栈:
LOGGER.log("My message", theException); // Replaces both System.out.println() and printStacktrace()
if (next == null) {
System.out.println("invalid next string");
return;
}
遇到null
值时,始终可以抛出InvalidArgumentException
、InvalidStateException
等。如果您只使用return
,那么用户将假设一切都进行得很顺利,尽管在对象状态中存在一个大问题。这就引出了我的下一个观点。
防止失败总是一个好主意。理想情况下,如果以后无法单独处理对象,则不应输入null
引用。因此,在获得参数时,检查它们并失败。
public void start(String entryPoint) throws InvalidArgumentException{
if(entryPoint == null){
throw new InvalidArgumentException("blahblahblah");
}
this.queue.add(entryPoint);
this.tryNext();
}
当然,addURL
等也是如此。不要让别人搅乱你的对象状态!
的力量
如果您是Java新手,那么您可能还没有意识到一个好的Javadoc在多大程度上节省了您的时间。就让它发生吧。不要长篇大论,保持简洁。好的Javadoc不是一本书。有些方法(getter、setter)在理想情况下不需要任何东西。
熟悉Java的这个方面!它和for
循环一样是语言的一部分。
main
中做得越来越少
你选择了这个方法的名字吗?不是的。这不是你的程序的一部分,爬虫,任何东西。它只是一个任意的入口点,再也不能使用了。保持这种状态,将真正的工作委托给设计对象的一个方法,一个有意义的方法。这将增加更多的易用性和方向性。
这就是您的所有main
方法应该看起来的样子:
public static void main(String[] args){
new MyBysinessObject(args).startDoingSomethinguseful();
}
还有一些其他的方法需要清理,大部分都被移到了其他的物体上。如果你需要打电话:
c.done(this, this.url);
这可能意味着应该扭转这种情况:
this.done(c);
但这可能是因为您的Crawler
实际上是某种类型的ThreadManager
,并且肯定会消失。当/if修改了代码后,我将进一步说明这一点。
最近一两天,我一直在努力学习Java。
哈!我第一次尝试Java并没有那么好看。干得好!
https://codereview.stackexchange.com/questions/139219
复制相似问题