DotnetSpider 是一个轻量、灵活、高性能、跨平台的分布式网络爬虫框架,可以帮助 .NET 工程师快速的完成爬虫的开发。
爬虫的基本流程是:下载数据(发送 HTTP 请求并获得返回的 resonse) -> 解析返回的文本(可以是 text、json、html) -> 存储解析到的数据,针对这三个主逻辑,我们可以再细下成以下模块。
DotnetSpider官网:https://github.com/dotnetcore/DotnetSpider
Install-Package DotnetSpider
Install-Package Serilog.AspNetCore
Install-Package Serilog.Sinks.Console
Install-Package Serilog.Sinks.File
Install-Package Serilog.Sinks.PeriodicBatching
using ConsoleTest;
using DotnetSpider.Scheduler.Component;
using Serilog.Events;
using Serilog;
using DotnetSpider;
using DotnetSpider.Scheduler;
using Microsoft.Extensions.Hosting;
//设置线程池
ThreadPool.SetMaxThreads(255, 255);
ThreadPool.SetMinThreads(255, 255);
//设置日志
Log.Logger = new LoggerConfiguration()
.MinimumLevel.Information()
.MinimumLevel.Override("Microsoft.Hosting.Lifetime", LogEventLevel.Warning)
.MinimumLevel.Override("Microsoft", LogEventLevel.Warning)
.MinimumLevel.Override("System", LogEventLevel.Warning)
.MinimumLevel.Override("Microsoft.AspNetCore.Authentication", LogEventLevel.Warning)
.Enrich.FromLogContext()
.WriteTo.Console().WriteTo.File("logs/spider.log")
.CreateLogger();
var builder = Builder.CreateDefaultBuilder<GithubSpider>(options =>
{
// 每秒 1 个请求
options.Speed = 1;
});
builder.UseSerilog();
builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
await builder.Build().RunAsync();
Console.WriteLine("Bye!");
using ConsoleTest;
using DotnetSpider.Scheduler.Component;
using Serilog.Events;
using Serilog;
using DotnetSpider;
using DotnetSpider.Scheduler;
using Microsoft.Extensions.Hosting;
//设置线程池
ThreadPool.SetMaxThreads(255, 255);
ThreadPool.SetMinThreads(255, 255);
//设置日志
Log.Logger = new LoggerConfiguration()
.MinimumLevel.Information()
.MinimumLevel.Override("Microsoft.Hosting.Lifetime", LogEventLevel.Warning)
.MinimumLevel.Override("Microsoft", LogEventLevel.Warning)
.MinimumLevel.Override("System", LogEventLevel.Warning)
.MinimumLevel.Override("Microsoft.AspNetCore.Authentication", LogEventLevel.Warning)
.Enrich.FromLogContext()
.WriteTo.Console().WriteTo.File("logs/spider.log")
.CreateLogger();
var builder = Builder.CreateDefaultBuilder<GithubSpider>(options =>
{
// 每秒 1 个请求
options.Speed = 1;
});
builder.UseSerilog();
builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
await builder.Build().RunAsync();
Console.WriteLine("Bye!");
日志
[20:51:24 INF]
_____ _ _ _____ _ _
| __ \ | | | | / ____| (_) | |
| | | | ___ | |_ _ __ ___| |_| (___ _ __ _ __| | ___ _ __
| | | |/ _ \| __| '_ \ / _ \ __|\___ \| '_ \| |/ _` |/ _ \ '__|
| |__| | (_) | |_| | | | __/ |_ ____) | |_) | | (_| | __/ |
|_____/ \___/ \__|_| |_|\___|\__|_____/| .__/|_|\__,_|\___|_| version: 5.0.8.0
| |
|_|
[20:51:24 INF] RequestedQueueCount: 1000
[20:51:24 INF] Depth: 0
[20:51:24 INF] RetriedTimes: 3
[20:51:24 INF] EmptySleepTime: 60
[20:51:24 INF] Speed: 1
[20:51:24 INF] Batch: 4
[20:51:24 INF] RemoveOutboundLinks: False
[20:51:24 INF] StorageType: DotnetSpider.MySql.MySqlEntityStorage, DotnetSpider.MySql
[20:51:24 INF] RefreshProxy: 30
[20:51:24 INF] Agent is starting
[20:51:24 INF] Agent started
[20:51:24 INF] Initialize spider 602e62cc5f337be5627cd768, Github
[20:51:25 INF] 602e62cc5f337be5627cd768 DataFlows: Parser -> ConsoleStorage
[20:51:25 INF] 602e62cc5f337be5627cd768 register topic DotnetSpider_602e62cc5f337be5627cd768
[20:51:25 INF] Statistics service starting
[20:51:25 INF] Statistics service started
[20:51:30 INF] 602e62cc5f337be5627cd768 total 1, speed: 0, success 0, failure 0, left 1
[20:51:31 INF] 602e62cc5f337be5627cd768 download https://github.com/zlzforever, l7nvHQ== completed
DATA: {"username":"zlzforever","author":"Lewis Zou"}
[20:51:35 INF] 602e62cc5f337be5627cd768 total 1, speed: 0.10, success 1, failure 0, left 0
[20:51:40 INF] 602e62cc5f337be5627cd768 total 1, speed: 0.07, success 1, failure 0, left 0
[20:52:28 INF] Statistics service stopping
[20:52:28 INF] Statistics service stopped
[20:52:28 INF] 602e62cc5f337be5627cd768 stopped
[20:52:28 INF] Agent is stopping
[20:52:28 INF] Agent stopped
Bye!