愚公系列2023年DotnetSpider配置式爬虫,如何改写为长尾词?

2026-03-30 15:241阅读0评论SEO教程
  • 内容介绍
  • 文章标签
  • 相关推荐

本文共计4465个文字,预计阅读时间需要18分钟。

愚公系列2023年DotnetSpider配置式爬虫,如何改写为长尾词?

(文章目录)+ 前言

1.DotnetSpider概述

DotnetSpider是一个轻量级、灵活、高性能、跨平台的分布式网络爬虫框架,可以帮助.NET工程师快速完成爬虫开发。

2. DotnetSpider模块介绍爬虫的基本流程

(文章目录)


前言

1.DotnetSpider概述

DotnetSpider 是一个轻量、灵活、高性能、跨平台的分布式网络爬虫框架,可以帮助 .NET 工程师快速的完成爬虫的开发。

2.DotnetSpider模块介绍

爬虫的基本流程是:下载数据(发送 HTTP 请求并获得返回的 resonse) -> 解析返回的文本(可以是 text、json、html) -> 存储解析到的数据,针对这三个主逻辑,我们可以再细下成以下模块。

  • Scheduler 调度器:用于对采集请求的去重、采集顺序控制,默认实现了广度优先和深度优先两种调度器。调度器可以采用不同的 Hash 去重器,通常使用默认的 HashSetDuplicateRemover 即可,若是采集量很大可以使用 BloomFilterDuplicateRemover。若想要调度海量的请求或者有重启续跑这样的需求,则需要自行实现基于数据库(关系型数据库、Redis等)的调度器。
  • 下载代理器:下载代理器可以部署在不同的机器上,若是单机爬虫则是每个爬虫实例会启动一个单独的下载代理器。下载代理器负责接收需要下载的请求并使用对应的下载器(HttpClient,Puppter 或者自定义实现的下载器)。
  • 下载代理器注册服务:此服务仅用于接收下载代理器的注册、心跳,即便不启用起服务也并不会影响爬虫的使用。单机爬虫会默认启用一个内存型的注册服务。
  • 统计服务:统计各个爬虫和下载代理器的运行状态,如爬虫总的请求数、成功的请求数等,下载代理器总的成功请求数、总的消耗时间等
  • 请求供应接口:在很多场景下可能下载请求是可以提前知道或存在某个地方(可以是文件、数据库)
  • 请求配置(Spider.ConfigureRequest):一般情况下请求都可以自动构建好,但在某些特别情况下如加 sign 等,可以统一处理。
  • DataFlow: 数据流分两种,解析器和存储器。最极端情况是你不想搞那么复杂,解析和存储都自己在一个 DataFlow中实现。一个爬虫可以有多个 DataFlow,执行顺序按添加顺序,在任意一个 DataFlow 中抛出异常都会中断整个处理流程。
  • 代理池:每个爬虫实例会启动一个代理后台服务,此后台服务定时从注册的 IProxySupplier中获取新的代理,每个获得的新代理需要经过检测成功才会入到代理池。在配置文件中或者 Builder创建时可以配置测试地址:ProxyTestUri
  • 并发控制器:并发控制器以一定速度从 Scheduler 中获取请求并推到到消息队列中,这些请求会缓存在 RequestedQueue中,这个队列是使用低开销的 HashedWheelTimer 实现的,若在一定时间内未收到下载代理器返回的消息,则认为是 Timeout 触发重试直到超过重试次数限制。

DotnetSpider官网:github.com/dotnetcore/DotnetSpider

一、DotnetSpider爬虫框架的配置式爬虫

1.安装包

Install-Package DotnetSpider Install-Package Serilog.AspNetCore Install-Package Serilog.Sinks.Console Install-Package Serilog.Sinks.File Install-Package Serilog.Sinks.PeriodicBatching Install-Package DotnetSpider.MySql

2.创建EntitySpider类

1、下面按一个配置式爬虫完整案例,页面和代码如下

愚公系列2023年DotnetSpider配置式爬虫,如何改写为长尾词?

using DotnetSpider.DataFlow.Parser.Formatters; using DotnetSpider.DataFlow.Parser; using DotnetSpider.DataFlow.Storage; using DotnetSpider.Http; using DotnetSpider.Infrastructure; using DotnetSpider.Selector; using DotnetSpider; using Microsoft.Extensions.Logging; using Microsoft.Extensions.Options; using System.ComponentModel.DataAnnotations; using DotnetSpider.Downloader; using Serilog; using DotnetSpider.Scheduler; using DotnetSpider.Scheduler.Component; using Microsoft.Extensions.Hosting; using DotnetSpider.MySql.Scheduler; public class EntitySpider : Spider { public static async Task RunAsync() { var builder = Builder.CreateDefaultBuilder<EntitySpider>(options => { options.Speed = 1; }); //使用下载器 builder.UseDownloader<HttpClientDownloader>(); //使用日志 builder.UseSerilog(); //忽略证书报错 builder.IgnoreServerCertificateError(); builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>(); await builder.Build().RunAsync(); } public static async Task RunMySqlQueueAsync() { var builder = Builder.CreateDefaultBuilder<EntitySpider>(options => { options.Speed = 1; }); builder.UseDownloader<HttpClientDownloader>(); builder.UseSerilog(); builder.IgnoreServerCertificateError(); builder.UseMySqlQueueBfsScheduler(x => { x.ConnectionString = builder.Configuration["SchedulerConnectionString"]; }); await builder.Build().RunAsync(); } public EntitySpider(IOptions<SpiderOptions> options, DependenceServices services, ILogger<Spider> logger) : base( options, services, logger) { } protected override async Task InitializeAsync(CancellationToken stoppingToken = default) { //添加数据解析器CnblogsEntry AddDataFlow(new DataParser<CnblogsEntry>()); // 使用默认存储器 AddDataFlow(GetDefaultStorage()); //添加异步请求地址 await AddRequestsAsync( new Request( "news.cnblogs.com/n/page/1", new Dictionary<string, object> { { "网站", "博客园" } })); } //生成的爬虫ID protected override SpiderId GenerateSpiderId() { return new(ObjectId.CreateId().ToString(), "博客园"); } [Schema("cnblogs", "news")]//数据库,数据表 [EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]//页面对象:表示使用 XPath 查询器查询出符合 .//div[@class='news_block'] 的所有内容块,每个内容块为一个数据对象,也即对应一条数据。 [GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]//表示使用 XPath 查询器 .//a[@class='current'] 结果若为 v 则保存为 { key: 类别, value: v },然后在数据实体中可以配置环境查询来设置值 [GlobalValueSelector(Expression = "//title", Name = "Title", Type = SelectorType.XPath)] [FollowRequestSelector(Expressions = new[] { "//div[@class='pager']" })]//默认表示使用 XPath 查询器 //div[@class='pager'] 查询到的页面元素里的所有链接都尝试加入到 Scheduler 中,也可以使用其它类型的查询器 public class CnblogsEntry : EntityBase<CnblogsEntry> { protected override void Configure() { HasIndex(x => x.Title); HasIndex(x => new { x.WebSite, x.Guid }, true); } public int Id { get; set; } [Required] [StringLength(200)] [ValueSelector(Expression = "类别", Type = SelectorType.Environment)] public string Category { get; set; } [Required] [StringLength(200)] [ValueSelector(Expression = "网站", Type = SelectorType.Environment)] public string WebSite { get; set; } [StringLength(200)] [ValueSelector(Expression = "Title", Type = SelectorType.Environment)] [ReplaceFormatter(NewValue = "", OldValue = " - 博客园")] public string Title { get; set; } [StringLength(40)] [ValueSelector(Expression = "GUID", Type = SelectorType.Environment)] public string Guid { get; set; } [ValueSelector(Expression = ".//h2[@class='news_entry']/a")] public string News { get; set; } [ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")] public string Url { get; set; } [ValueSelector(Expression = ".//div[@class='entry_summary']")] [TrimFormatter] public string PlainText { get; set; } [ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)] public DateTime CreationTime { get; set; } } }

配置文件代码如下

{ "StorageType": "DotnetSpider.MySql.MySqlEntityStorage, DotnetSpider.MySql", "MySql": { "ConnectionString": "Database='mysql';Data Source=localhost;password=123456;User ID=root;Port=3306;", "Mode": "InsertIgnoreDuplicate" }, "SchedulerConnectionString": "Database='scheduler';Data Source=localhost;password=123456;User ID=root;Port=3306;" }

Mode 表示数据存储器的模式:

  • Insert:直接插入,若遇到重复索引可能会有异常导致爬虫中止。所有数据库都支持
  • InsertIgnoreDuplicate:若数据没有违反重复约束则插入,若有重复则忽略,不是所有数据库都支持此种模式
  • InsertAndUpdate:若数据不存在则插入,重复则更新
  • Update:只做更新

2、参数说明:

  • 数据实体:实体必须继承自 EntityBase<>,只有继承自 EntityBase<> 的数据实体才能被框架默认实现的解析器 DataParse 和 实体存储器。
  • Schema:Schema 定义数据实体需要存到的哪个数据库、哪个表,可以支持表名后缀:周度、月度、当天
  • EntitySelector:定义如何从文本中要抽出数据对象,若是没有配置此特性,表示这个数据对象为页面级别的,即一个页面只产生一个数据对象,也即一条数据。
  • GlobalValueSelector:定义从文本中查询出的数据暂存到环境数据中,可以供数据实体内部属性查询,可以配置多个。
  • FollowRequestSelector:定义如何从当前文本抽取合适的链接加入到 Scheduler 中,可以定义 xpath 查询元素以获取链接,也可以配置 pattern 来确定请求是否符合要求,若是不符合的链接则会完全忽略,即便在爬虫 InitializeAsync 中加入到 Scheduler 的链接,也要受到 pattern 的约束。
  • ValueSelector:支持的查询类型有:XPath、Regex、Css、JsonPath、Environment。其中 Environment 表示为环境值,其数据来源有:
    • 构造 Request 时设置的 Properties
    • GlobalValueSelector 查询到的所有值
    • 某些系统定义的值:

ENTITY_INDEX: 表示当前数据实体是当前文本查询到的所有数据实体的第几个 GUID:获取到一个随机的 GUID DATE:获取当天的时间,以 “yyyy-MM-dd” 格式化的字符串 TODAY:获取当天的时间,以 “yyyy-MM-dd” 格式化的字符串 DATETIME:获取当前时间,以 “yyyy-MM-dd HH:mm:ss” 格式化的字符串 NOW:获取当前时间,以 “yyyy-MM-dd HH:mm:ss” 格式化的字符串 MONTH:获取当月的第一天,以 “yyyy-MM-dd” 格式化的字符串 MONDAY:获取当前星期的星期一,以 “yyyy-MM-dd” 格式化的字符串 SPIDER_ID:获取当前爬虫的 ID REQUEST_HASH:获取当前数据实体所属请求的 HASH 值

3.Program类

using ConsoleTest; using Serilog.Events; using Serilog; ThreadPool.SetMaxThreads(255, 255); ThreadPool.SetMinThreads(255, 255); Log.Logger = new LoggerConfiguration() .MinimumLevel.Information() .MinimumLevel.Override("Microsoft.Hosting.Lifetime", LogEventLevel.Warning) .MinimumLevel.Override("Microsoft", LogEventLevel.Warning) .MinimumLevel.Override("System", LogEventLevel.Warning) .MinimumLevel.Override("Microsoft.AspNetCore.Authentication", LogEventLevel.Warning) .Enrich.FromLogContext() .WriteTo.Console().WriteTo.File("logs/spider.log") .CreateLogger(); await EntitySpider.RunMySqlQueueAsync(); Console.WriteLine("Bye!");

4.运行

[10:25:00 INF] _____ _ _ _____ _ _ | __ \ | | | | / ____| (_) | | | | | | ___ | |_ _ __ ___| |_| (___ _ __ _ __| | ___ _ __ | | | |/ _ \| __| '_ \ / _ \ __|\___ \| '_ \| |/ _` |/ _ \ '__| | |__| | (_) | |_| | | | __/ |_ ____) | |_) | | (_| | __/ | |_____/ \___/ \__|_| |_|\___|\__|_____/| .__/|_|\__,_|\___|_| version: 5.1.0.0 | | |_| [10:25:00 INF] RequestedQueueCount: 1000 [10:25:00 INF] Depth: 0 [10:25:00 INF] RetriedTimes: 3 [10:25:00 INF] EmptySleepTime: 60 [10:25:00 INF] Speed: 1 [10:25:00 INF] Batch: 4 [10:25:00 INF] RemoveOutboundLinks: False [10:25:00 INF] StorageType: DotnetSpider.MySql.MySqlEntityStorage, DotnetSpider.MySql [10:25:00 INF] RefreshProxy: 30 [10:25:00 INF] Agent is starting [10:25:00 INF] Agent started [10:25:00 INF] Initialize spider 63b8d7fcc541adfc0b6171ba, 博客园 [10:25:00 INF] Statistics service starting [10:25:00 INF] Statistics service started [10:25:01 INF] 63b8d7fcc541adfc0b6171ba DataFlows: DataParser`1 -> MySqlEntityStorage [10:25:01 INF] 63b8d7fcc541adfc0b6171ba register topic DotnetSpider_63b8d7fcc541adfc0b6171ba [10:25:02 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/1, SOwgJg completed [10:25:02 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/, aQJPCw completed [10:25:03 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/2/, NODtCA completed [10:25:04 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/3/, N7ixeQ completed [10:25:05 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/4/, iljmTA completed [10:25:06 INF] 63b8d7fcc541adfc0b6171ba total 11, speed: 0.92, success 5, failure 0, left 6 [10:25:06 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/5/, ipkqCA completed [10:25:07 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/6/, UiIVjQ completed [10:25:08 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/7/, ju9xLA completed [10:25:09 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/8/, Wt3OPA completed [10:25:10 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/9/, a96d8w completed [10:25:11 INF] 63b8d7fcc541adfc0b6171ba total 15, speed: 0.96, success 10, failure 0, left 5 [10:25:11 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/100/, oGrO3A completed [10:25:12 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/10/, /YND/g completed [10:25:13 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/11/, cD0+Ew completed [10:25:14 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/12/, OaGhdA completed [10:25:15 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/13/, gqmW8A completed [10:25:16 INF] 63b8d7fcc541adfc0b6171ba total 23, speed: 0.97, success 15, failure 0, left 8 [10:25:17 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/99/, 78Y6Jg completed [10:25:18 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/96/, GL/kWw completed [10:25:19 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/97/, OzipWg completed [10:25:20 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/98/, QJFreQ completed [10:25:21 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/14/, UOhcHA completed [10:25:21 INF] 63b8d7fcc541adfc0b6171ba total 28, speed: 0.98, success 20, failure 0, left 8 [10:25:22 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/15/, cMsXyw completed [10:25:23 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/16/, Q+5Riw completed [10:25:24 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/17/, 2DB6bg completed [10:25:25 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/95/, vDRh/A completed [10:25:26 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/92/, tHXcSw completed [10:25:26 INF] 63b8d7fcc541adfc0b6171ba total 35, speed: 0.98, success 25, failure 0, left 10 [10:25:27 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/93/, WPTb1g completed [10:25:28 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/94/, Ap/n4A completed [10:25:29 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/18/, lGhQOQ completed [10:25:30 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/19/, LU6gew completed [10:25:31 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/20/, XtoVSw completed [10:25:31 INF] 63b8d7fcc541adfc0b6171ba total 38, speed: 0.98, success 30, failure 0, left 8 [10:25:32 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/21/, SsTlZw completed [10:25:33 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/91/, ojdx1g completed [10:25:34 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/88/, Hh4qDw completed [10:25:35 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/89/, BAMFzQ completed [10:25:36 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/90/, fSM4Tg completed [10:25:36 INF] 63b8d7fcc541adfc0b6171ba total 43, speed: 0.99, success 35, failure 0, left 8 [10:25:37 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/22/, YTBXLw completed [10:25:38 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/23/, 4Y6uCA completed [10:25:39 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/24/, PapL8Q completed [10:25:40 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/25/, ZAf0IQ completed [10:25:41 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/87/, 2OrK4w completed [10:25:41 INF] 63b8d7fcc541adfc0b6171ba total 48, speed: 0.99, success 40, failure 0, left 8 [10:25:42 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/84/, XOLWTw completed [10:25:43 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/85/, XJ8kdQ completed [10:25:44 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/86/, uInV8A completed [10:25:45 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/26/, uvEDhw completed [10:25:46 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/27/, 8PI77Q completed [10:25:46 INF] 63b8d7fcc541adfc0b6171ba total 52, speed: 0.97, success 44, failure 0, left 8 [10:25:47 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/28/, Nr2Qyw completed [10:25:48 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/29/, LxR4WQ completed [10:25:49 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/83/, 3Uxp6g completed [10:25:50 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/80/, pUhrhA completed [10:25:51 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/81/, GkP5OA completed [10:25:51 INF] 63b8d7fcc541adfc0b6171ba total 59, speed: 0.97, success 49, failure 0, left 10 [10:25:52 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/82/, zRverg completed [10:25:53 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/30/, jAh2Tg completed [10:25:54 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/31/, GFEeuQ completed [10:25:55 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/32/, yU66JA completed [10:25:56 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/33/, 2CPdcg completed [10:25:56 INF] 63b8d7fcc541adfc0b6171ba total 62, speed: 0.97, success 54, failure 0, left 8 [10:25:57 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/79/, CHLgww completed [10:25:58 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/76/, dAXDqg completed [10:25:59 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/77/, 1XofyQ completed [10:26:00 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/78/, KtOnzg completed [10:26:01 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/34/, eeJHkg completed [10:26:01 INF] 63b8d7fcc541adfc0b6171ba total 67, speed: 0.98, success 59, failure 0, left 8 [10:26:02 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/35/, zyMklA completed [10:26:03 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/36/, 5vP0xw completed [10:26:04 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/37/, kRTM+w completed [10:26:05 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/75/, tVP91A completed [10:26:06 INF] 63b8d7fcc541adfc0b6171ba total 72, speed: 0.98, success 64, failure 0, left 8 [10:26:06 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/72/, QrGHwQ completed [10:26:07 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/73/, yJmVFQ completed [10:26:08 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/74/, t+r57A completed [10:26:09 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/38/, urOCpw completed [10:26:10 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/39/, c/4VkQ completed [10:26:11 INF] 63b8d7fcc541adfc0b6171ba total 77, speed: 0.98, success 69, failure 0, left 8 [10:26:11 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/40/, KXzkAw completed [10:26:12 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/41/, AS4+Bw completed [10:26:13 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/71/, Q+lfgQ completed [10:26:14 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/68/, tmM+Cg completed [10:26:15 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/69/, qR9BHQ completed [10:26:16 INF] 63b8d7fcc541adfc0b6171ba total 83, speed: 0.98, success 74, failure 0, left 9 [10:26:16 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/70/, J4RsOA completed [10:26:17 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/42/, 8gxdow completed [10:26:18 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/43/, mVidYA completed [10:26:19 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/44/, +QsxFQ completed [10:26:20 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/45/, LWGcYQ completed [10:26:21 INF] 63b8d7fcc541adfc0b6171ba total 87, speed: 0.98, success 79, failure 0, left 8 [10:26:21 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/67/, 6q7F8w completed [10:26:22 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/64/, aERDwA completed [10:26:23 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/65/, 4jFTbg completed [10:26:24 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/66/, qYSqPw completed [10:26:25 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/46/, 9jOk0Q completed [10:26:26 INF] 63b8d7fcc541adfc0b6171ba total 92, speed: 0.98, success 84, failure 0, left 8 [10:26:26 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/47/, o3hArg completed [10:26:27 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/48/, FGMOQw completed [10:26:28 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/49/, 6sgNQw completed [10:26:29 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/63/, LhVcGQ completed [10:26:30 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/60/, c0qfSg completed [10:26:31 INF] 63b8d7fcc541adfc0b6171ba total 99, speed: 0.98, success 89, failure 0, left 10 [10:26:31 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/61/, n04eyQ completed [10:26:32 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/62/, TWpUBg completed [10:26:33 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/50/, IZKwxw completed [10:26:34 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/51/, JYvaXQ completed [10:26:35 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/52/, Efk6DA completed [10:26:36 INF] 63b8d7fcc541adfc0b6171ba total 101, speed: 0.98, success 94, failure 0, left 7 [10:26:36 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/53/, aCoX6Q completed [10:26:37 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/59/, Egjyhw completed [10:26:38 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/56/, MEvTLg completed [10:26:39 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/57/, h/5ZUg completed [10:26:40 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/58/, si+4ag completed [10:26:41 INF] 63b8d7fcc541adfc0b6171ba total 101, speed: 0.98, success 99, failure 0, left 2 [10:26:41 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/54/, VnR7pg completed [10:26:42 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/55/, BrGBUg completed [10:26:46 INF] 63b8d7fcc541adfc0b6171ba total 101, speed: 0.96, success 101, failure 0, left 0

本文共计4465个文字,预计阅读时间需要18分钟。

愚公系列2023年DotnetSpider配置式爬虫,如何改写为长尾词?

(文章目录)+ 前言

1.DotnetSpider概述

DotnetSpider是一个轻量级、灵活、高性能、跨平台的分布式网络爬虫框架,可以帮助.NET工程师快速完成爬虫开发。

2. DotnetSpider模块介绍爬虫的基本流程

(文章目录)


前言

1.DotnetSpider概述

DotnetSpider 是一个轻量、灵活、高性能、跨平台的分布式网络爬虫框架,可以帮助 .NET 工程师快速的完成爬虫的开发。

2.DotnetSpider模块介绍

爬虫的基本流程是:下载数据(发送 HTTP 请求并获得返回的 resonse) -> 解析返回的文本(可以是 text、json、html) -> 存储解析到的数据,针对这三个主逻辑,我们可以再细下成以下模块。

  • Scheduler 调度器:用于对采集请求的去重、采集顺序控制,默认实现了广度优先和深度优先两种调度器。调度器可以采用不同的 Hash 去重器,通常使用默认的 HashSetDuplicateRemover 即可,若是采集量很大可以使用 BloomFilterDuplicateRemover。若想要调度海量的请求或者有重启续跑这样的需求,则需要自行实现基于数据库(关系型数据库、Redis等)的调度器。
  • 下载代理器:下载代理器可以部署在不同的机器上,若是单机爬虫则是每个爬虫实例会启动一个单独的下载代理器。下载代理器负责接收需要下载的请求并使用对应的下载器(HttpClient,Puppter 或者自定义实现的下载器)。
  • 下载代理器注册服务:此服务仅用于接收下载代理器的注册、心跳,即便不启用起服务也并不会影响爬虫的使用。单机爬虫会默认启用一个内存型的注册服务。
  • 统计服务:统计各个爬虫和下载代理器的运行状态,如爬虫总的请求数、成功的请求数等,下载代理器总的成功请求数、总的消耗时间等
  • 请求供应接口:在很多场景下可能下载请求是可以提前知道或存在某个地方(可以是文件、数据库)
  • 请求配置(Spider.ConfigureRequest):一般情况下请求都可以自动构建好,但在某些特别情况下如加 sign 等,可以统一处理。
  • DataFlow: 数据流分两种,解析器和存储器。最极端情况是你不想搞那么复杂,解析和存储都自己在一个 DataFlow中实现。一个爬虫可以有多个 DataFlow,执行顺序按添加顺序,在任意一个 DataFlow 中抛出异常都会中断整个处理流程。
  • 代理池:每个爬虫实例会启动一个代理后台服务,此后台服务定时从注册的 IProxySupplier中获取新的代理,每个获得的新代理需要经过检测成功才会入到代理池。在配置文件中或者 Builder创建时可以配置测试地址:ProxyTestUri
  • 并发控制器:并发控制器以一定速度从 Scheduler 中获取请求并推到到消息队列中,这些请求会缓存在 RequestedQueue中,这个队列是使用低开销的 HashedWheelTimer 实现的,若在一定时间内未收到下载代理器返回的消息,则认为是 Timeout 触发重试直到超过重试次数限制。

DotnetSpider官网:github.com/dotnetcore/DotnetSpider

一、DotnetSpider爬虫框架的配置式爬虫

1.安装包

Install-Package DotnetSpider Install-Package Serilog.AspNetCore Install-Package Serilog.Sinks.Console Install-Package Serilog.Sinks.File Install-Package Serilog.Sinks.PeriodicBatching Install-Package DotnetSpider.MySql

2.创建EntitySpider类

1、下面按一个配置式爬虫完整案例,页面和代码如下

愚公系列2023年DotnetSpider配置式爬虫,如何改写为长尾词?

using DotnetSpider.DataFlow.Parser.Formatters; using DotnetSpider.DataFlow.Parser; using DotnetSpider.DataFlow.Storage; using DotnetSpider.Http; using DotnetSpider.Infrastructure; using DotnetSpider.Selector; using DotnetSpider; using Microsoft.Extensions.Logging; using Microsoft.Extensions.Options; using System.ComponentModel.DataAnnotations; using DotnetSpider.Downloader; using Serilog; using DotnetSpider.Scheduler; using DotnetSpider.Scheduler.Component; using Microsoft.Extensions.Hosting; using DotnetSpider.MySql.Scheduler; public class EntitySpider : Spider { public static async Task RunAsync() { var builder = Builder.CreateDefaultBuilder<EntitySpider>(options => { options.Speed = 1; }); //使用下载器 builder.UseDownloader<HttpClientDownloader>(); //使用日志 builder.UseSerilog(); //忽略证书报错 builder.IgnoreServerCertificateError(); builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>(); await builder.Build().RunAsync(); } public static async Task RunMySqlQueueAsync() { var builder = Builder.CreateDefaultBuilder<EntitySpider>(options => { options.Speed = 1; }); builder.UseDownloader<HttpClientDownloader>(); builder.UseSerilog(); builder.IgnoreServerCertificateError(); builder.UseMySqlQueueBfsScheduler(x => { x.ConnectionString = builder.Configuration["SchedulerConnectionString"]; }); await builder.Build().RunAsync(); } public EntitySpider(IOptions<SpiderOptions> options, DependenceServices services, ILogger<Spider> logger) : base( options, services, logger) { } protected override async Task InitializeAsync(CancellationToken stoppingToken = default) { //添加数据解析器CnblogsEntry AddDataFlow(new DataParser<CnblogsEntry>()); // 使用默认存储器 AddDataFlow(GetDefaultStorage()); //添加异步请求地址 await AddRequestsAsync( new Request( "news.cnblogs.com/n/page/1", new Dictionary<string, object> { { "网站", "博客园" } })); } //生成的爬虫ID protected override SpiderId GenerateSpiderId() { return new(ObjectId.CreateId().ToString(), "博客园"); } [Schema("cnblogs", "news")]//数据库,数据表 [EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]//页面对象:表示使用 XPath 查询器查询出符合 .//div[@class='news_block'] 的所有内容块,每个内容块为一个数据对象,也即对应一条数据。 [GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]//表示使用 XPath 查询器 .//a[@class='current'] 结果若为 v 则保存为 { key: 类别, value: v },然后在数据实体中可以配置环境查询来设置值 [GlobalValueSelector(Expression = "//title", Name = "Title", Type = SelectorType.XPath)] [FollowRequestSelector(Expressions = new[] { "//div[@class='pager']" })]//默认表示使用 XPath 查询器 //div[@class='pager'] 查询到的页面元素里的所有链接都尝试加入到 Scheduler 中,也可以使用其它类型的查询器 public class CnblogsEntry : EntityBase<CnblogsEntry> { protected override void Configure() { HasIndex(x => x.Title); HasIndex(x => new { x.WebSite, x.Guid }, true); } public int Id { get; set; } [Required] [StringLength(200)] [ValueSelector(Expression = "类别", Type = SelectorType.Environment)] public string Category { get; set; } [Required] [StringLength(200)] [ValueSelector(Expression = "网站", Type = SelectorType.Environment)] public string WebSite { get; set; } [StringLength(200)] [ValueSelector(Expression = "Title", Type = SelectorType.Environment)] [ReplaceFormatter(NewValue = "", OldValue = " - 博客园")] public string Title { get; set; } [StringLength(40)] [ValueSelector(Expression = "GUID", Type = SelectorType.Environment)] public string Guid { get; set; } [ValueSelector(Expression = ".//h2[@class='news_entry']/a")] public string News { get; set; } [ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")] public string Url { get; set; } [ValueSelector(Expression = ".//div[@class='entry_summary']")] [TrimFormatter] public string PlainText { get; set; } [ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)] public DateTime CreationTime { get; set; } } }

配置文件代码如下

{ "StorageType": "DotnetSpider.MySql.MySqlEntityStorage, DotnetSpider.MySql", "MySql": { "ConnectionString": "Database='mysql';Data Source=localhost;password=123456;User ID=root;Port=3306;", "Mode": "InsertIgnoreDuplicate" }, "SchedulerConnectionString": "Database='scheduler';Data Source=localhost;password=123456;User ID=root;Port=3306;" }

Mode 表示数据存储器的模式:

  • Insert:直接插入,若遇到重复索引可能会有异常导致爬虫中止。所有数据库都支持
  • InsertIgnoreDuplicate:若数据没有违反重复约束则插入,若有重复则忽略,不是所有数据库都支持此种模式
  • InsertAndUpdate:若数据不存在则插入,重复则更新
  • Update:只做更新

2、参数说明:

  • 数据实体:实体必须继承自 EntityBase<>,只有继承自 EntityBase<> 的数据实体才能被框架默认实现的解析器 DataParse 和 实体存储器。
  • Schema:Schema 定义数据实体需要存到的哪个数据库、哪个表,可以支持表名后缀:周度、月度、当天
  • EntitySelector:定义如何从文本中要抽出数据对象,若是没有配置此特性,表示这个数据对象为页面级别的,即一个页面只产生一个数据对象,也即一条数据。
  • GlobalValueSelector:定义从文本中查询出的数据暂存到环境数据中,可以供数据实体内部属性查询,可以配置多个。
  • FollowRequestSelector:定义如何从当前文本抽取合适的链接加入到 Scheduler 中,可以定义 xpath 查询元素以获取链接,也可以配置 pattern 来确定请求是否符合要求,若是不符合的链接则会完全忽略,即便在爬虫 InitializeAsync 中加入到 Scheduler 的链接,也要受到 pattern 的约束。
  • ValueSelector:支持的查询类型有:XPath、Regex、Css、JsonPath、Environment。其中 Environment 表示为环境值,其数据来源有:
    • 构造 Request 时设置的 Properties
    • GlobalValueSelector 查询到的所有值
    • 某些系统定义的值:

ENTITY_INDEX: 表示当前数据实体是当前文本查询到的所有数据实体的第几个 GUID:获取到一个随机的 GUID DATE:获取当天的时间,以 “yyyy-MM-dd” 格式化的字符串 TODAY:获取当天的时间,以 “yyyy-MM-dd” 格式化的字符串 DATETIME:获取当前时间,以 “yyyy-MM-dd HH:mm:ss” 格式化的字符串 NOW:获取当前时间,以 “yyyy-MM-dd HH:mm:ss” 格式化的字符串 MONTH:获取当月的第一天,以 “yyyy-MM-dd” 格式化的字符串 MONDAY:获取当前星期的星期一,以 “yyyy-MM-dd” 格式化的字符串 SPIDER_ID:获取当前爬虫的 ID REQUEST_HASH:获取当前数据实体所属请求的 HASH 值

3.Program类

using ConsoleTest; using Serilog.Events; using Serilog; ThreadPool.SetMaxThreads(255, 255); ThreadPool.SetMinThreads(255, 255); Log.Logger = new LoggerConfiguration() .MinimumLevel.Information() .MinimumLevel.Override("Microsoft.Hosting.Lifetime", LogEventLevel.Warning) .MinimumLevel.Override("Microsoft", LogEventLevel.Warning) .MinimumLevel.Override("System", LogEventLevel.Warning) .MinimumLevel.Override("Microsoft.AspNetCore.Authentication", LogEventLevel.Warning) .Enrich.FromLogContext() .WriteTo.Console().WriteTo.File("logs/spider.log") .CreateLogger(); await EntitySpider.RunMySqlQueueAsync(); Console.WriteLine("Bye!");

4.运行

[10:25:00 INF] _____ _ _ _____ _ _ | __ \ | | | | / ____| (_) | | | | | | ___ | |_ _ __ ___| |_| (___ _ __ _ __| | ___ _ __ | | | |/ _ \| __| '_ \ / _ \ __|\___ \| '_ \| |/ _` |/ _ \ '__| | |__| | (_) | |_| | | | __/ |_ ____) | |_) | | (_| | __/ | |_____/ \___/ \__|_| |_|\___|\__|_____/| .__/|_|\__,_|\___|_| version: 5.1.0.0 | | |_| [10:25:00 INF] RequestedQueueCount: 1000 [10:25:00 INF] Depth: 0 [10:25:00 INF] RetriedTimes: 3 [10:25:00 INF] EmptySleepTime: 60 [10:25:00 INF] Speed: 1 [10:25:00 INF] Batch: 4 [10:25:00 INF] RemoveOutboundLinks: False [10:25:00 INF] StorageType: DotnetSpider.MySql.MySqlEntityStorage, DotnetSpider.MySql [10:25:00 INF] RefreshProxy: 30 [10:25:00 INF] Agent is starting [10:25:00 INF] Agent started [10:25:00 INF] Initialize spider 63b8d7fcc541adfc0b6171ba, 博客园 [10:25:00 INF] Statistics service starting [10:25:00 INF] Statistics service started [10:25:01 INF] 63b8d7fcc541adfc0b6171ba DataFlows: DataParser`1 -> MySqlEntityStorage [10:25:01 INF] 63b8d7fcc541adfc0b6171ba register topic DotnetSpider_63b8d7fcc541adfc0b6171ba [10:25:02 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/1, SOwgJg completed [10:25:02 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/, aQJPCw completed [10:25:03 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/2/, NODtCA completed [10:25:04 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/3/, N7ixeQ completed [10:25:05 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/4/, iljmTA completed [10:25:06 INF] 63b8d7fcc541adfc0b6171ba total 11, speed: 0.92, success 5, failure 0, left 6 [10:25:06 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/5/, ipkqCA completed [10:25:07 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/6/, UiIVjQ completed [10:25:08 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/7/, ju9xLA completed [10:25:09 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/8/, Wt3OPA completed [10:25:10 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/9/, a96d8w completed [10:25:11 INF] 63b8d7fcc541adfc0b6171ba total 15, speed: 0.96, success 10, failure 0, left 5 [10:25:11 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/100/, oGrO3A completed [10:25:12 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/10/, /YND/g completed [10:25:13 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/11/, cD0+Ew completed [10:25:14 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/12/, OaGhdA completed [10:25:15 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/13/, gqmW8A completed [10:25:16 INF] 63b8d7fcc541adfc0b6171ba total 23, speed: 0.97, success 15, failure 0, left 8 [10:25:17 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/99/, 78Y6Jg completed [10:25:18 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/96/, GL/kWw completed [10:25:19 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/97/, OzipWg completed [10:25:20 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/98/, QJFreQ completed [10:25:21 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/14/, UOhcHA completed [10:25:21 INF] 63b8d7fcc541adfc0b6171ba total 28, speed: 0.98, success 20, failure 0, left 8 [10:25:22 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/15/, cMsXyw completed [10:25:23 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/16/, Q+5Riw completed [10:25:24 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/17/, 2DB6bg completed [10:25:25 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/95/, vDRh/A completed [10:25:26 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/92/, tHXcSw completed [10:25:26 INF] 63b8d7fcc541adfc0b6171ba total 35, speed: 0.98, success 25, failure 0, left 10 [10:25:27 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/93/, WPTb1g completed [10:25:28 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/94/, Ap/n4A completed [10:25:29 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/18/, lGhQOQ completed [10:25:30 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/19/, LU6gew completed [10:25:31 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/20/, XtoVSw completed [10:25:31 INF] 63b8d7fcc541adfc0b6171ba total 38, speed: 0.98, success 30, failure 0, left 8 [10:25:32 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/21/, SsTlZw completed [10:25:33 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/91/, ojdx1g completed [10:25:34 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/88/, Hh4qDw completed [10:25:35 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/89/, BAMFzQ completed [10:25:36 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/90/, fSM4Tg completed [10:25:36 INF] 63b8d7fcc541adfc0b6171ba total 43, speed: 0.99, success 35, failure 0, left 8 [10:25:37 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/22/, YTBXLw completed [10:25:38 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/23/, 4Y6uCA completed [10:25:39 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/24/, PapL8Q completed [10:25:40 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/25/, ZAf0IQ completed [10:25:41 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/87/, 2OrK4w completed [10:25:41 INF] 63b8d7fcc541adfc0b6171ba total 48, speed: 0.99, success 40, failure 0, left 8 [10:25:42 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/84/, XOLWTw completed [10:25:43 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/85/, XJ8kdQ completed [10:25:44 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/86/, uInV8A completed [10:25:45 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/26/, uvEDhw completed [10:25:46 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/27/, 8PI77Q completed [10:25:46 INF] 63b8d7fcc541adfc0b6171ba total 52, speed: 0.97, success 44, failure 0, left 8 [10:25:47 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/28/, Nr2Qyw completed [10:25:48 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/29/, LxR4WQ completed [10:25:49 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/83/, 3Uxp6g completed [10:25:50 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/80/, pUhrhA completed [10:25:51 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/81/, GkP5OA completed [10:25:51 INF] 63b8d7fcc541adfc0b6171ba total 59, speed: 0.97, success 49, failure 0, left 10 [10:25:52 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/82/, zRverg completed [10:25:53 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/30/, jAh2Tg completed [10:25:54 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/31/, GFEeuQ completed [10:25:55 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/32/, yU66JA completed [10:25:56 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/33/, 2CPdcg completed [10:25:56 INF] 63b8d7fcc541adfc0b6171ba total 62, speed: 0.97, success 54, failure 0, left 8 [10:25:57 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/79/, CHLgww completed [10:25:58 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/76/, dAXDqg completed [10:25:59 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/77/, 1XofyQ completed [10:26:00 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/78/, KtOnzg completed [10:26:01 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/34/, eeJHkg completed [10:26:01 INF] 63b8d7fcc541adfc0b6171ba total 67, speed: 0.98, success 59, failure 0, left 8 [10:26:02 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/35/, zyMklA completed [10:26:03 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/36/, 5vP0xw completed [10:26:04 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/37/, kRTM+w completed [10:26:05 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/75/, tVP91A completed [10:26:06 INF] 63b8d7fcc541adfc0b6171ba total 72, speed: 0.98, success 64, failure 0, left 8 [10:26:06 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/72/, QrGHwQ completed [10:26:07 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/73/, yJmVFQ completed [10:26:08 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/74/, t+r57A completed [10:26:09 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/38/, urOCpw completed [10:26:10 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/39/, c/4VkQ completed [10:26:11 INF] 63b8d7fcc541adfc0b6171ba total 77, speed: 0.98, success 69, failure 0, left 8 [10:26:11 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/40/, KXzkAw completed [10:26:12 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/41/, AS4+Bw completed [10:26:13 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/71/, Q+lfgQ completed [10:26:14 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/68/, tmM+Cg completed [10:26:15 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/69/, qR9BHQ completed [10:26:16 INF] 63b8d7fcc541adfc0b6171ba total 83, speed: 0.98, success 74, failure 0, left 9 [10:26:16 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/70/, J4RsOA completed [10:26:17 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/42/, 8gxdow completed [10:26:18 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/43/, mVidYA completed [10:26:19 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/44/, +QsxFQ completed [10:26:20 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/45/, LWGcYQ completed [10:26:21 INF] 63b8d7fcc541adfc0b6171ba total 87, speed: 0.98, success 79, failure 0, left 8 [10:26:21 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/67/, 6q7F8w completed [10:26:22 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/64/, aERDwA completed [10:26:23 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/65/, 4jFTbg completed [10:26:24 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/66/, qYSqPw completed [10:26:25 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/46/, 9jOk0Q completed [10:26:26 INF] 63b8d7fcc541adfc0b6171ba total 92, speed: 0.98, success 84, failure 0, left 8 [10:26:26 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/47/, o3hArg completed [10:26:27 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/48/, FGMOQw completed [10:26:28 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/49/, 6sgNQw completed [10:26:29 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/63/, LhVcGQ completed [10:26:30 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/60/, c0qfSg completed [10:26:31 INF] 63b8d7fcc541adfc0b6171ba total 99, speed: 0.98, success 89, failure 0, left 10 [10:26:31 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/61/, n04eyQ completed [10:26:32 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/62/, TWpUBg completed [10:26:33 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/50/, IZKwxw completed [10:26:34 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/51/, JYvaXQ completed [10:26:35 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/52/, Efk6DA completed [10:26:36 INF] 63b8d7fcc541adfc0b6171ba total 101, speed: 0.98, success 94, failure 0, left 7 [10:26:36 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/53/, aCoX6Q completed [10:26:37 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/59/, Egjyhw completed [10:26:38 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/56/, MEvTLg completed [10:26:39 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/57/, h/5ZUg completed [10:26:40 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/58/, si+4ag completed [10:26:41 INF] 63b8d7fcc541adfc0b6171ba total 101, speed: 0.98, success 99, failure 0, left 2 [10:26:41 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/54/, VnR7pg completed [10:26:42 INF] 63b8d7fcc541adfc0b6171ba download news.cnblogs.com/n/page/55/, BrGBUg completed [10:26:46 INF] 63b8d7fcc541adfc0b6171ba total 101, speed: 0.96, success 101, failure 0, left 0