我有相当常见的任务,有几千个网站,并且必须尽可能多地解析(当然,以适当的方式)。
首先,我使用JSoup解析器进行了类似于风暴爬虫的配置。生产力非常好,非常稳定,大约8k的抓取马上。
然后我想增加解析PDF/doc/等等的可能性,所以我增加了Tika解析器来解析非HTML文档。但我看到了这类指标:
所以有时候会有好的几分钟,有时马上就会下降到几百分钟。当我删除Tika流记录-一切恢复正常。所以一般的问题是,如何找出这种行为的原因,瓶颈。也许我错过了什么环境?
下面是我在Storm中的爬虫拓扑中所看到的:
喷射器:
name: "injector"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-custom-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "feeds.txt"
- true
bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBol t"
parallelism: 1
streams:
- from: "spout"
to: "status"
grouping:
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byHost"
streamId: "status"
S-爬行器。通量:
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-custom-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 10
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 5
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 4
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 1
- id: "redirection_bolt"
className: "com.digitalpebble.stormcrawler.tika.RedirectionBolt"
parallelism: 1
- id: "parser_bolt"
className: "com.digitalpebble.stormcrawler.tika.ParserBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
# This is not needed as long as redirect_bolt is sending html content to index?
# - from: "parse"
# to: "index"
# grouping:
# type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "redirection_bolt"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "redirection_bolt"
to: "parser_bolt"
grouping:
type: LOCAL_OR_SHUFFLE
streamId: "tika"
- from: "redirection_bolt"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parser_bolt"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
更新:我发现我正在摆脱workers.log中的内存错误,即使我已经将workers.heap.size设置为4Gb,工作进程将提高到10-15GB。
Update2:在限制内存使用之后,我没有看到OutOfMemory错误,但是性能很低。
如果没有Tika,我看到每分钟有15k次抓取。对蒂卡-所有的都是高酒吧,每分钟几百。
我在工作日志中看到了这个:https://paste.ubuntu.com/p/WKBTBf8HMV/
CPU使用率很高,但日志中没有。
发布于 2019-03-09 11:01:25
正如您在UI上的统计数据中所看到的,Tika解析器螺栓是瓶颈:它的容量为1.6 (值>1意味着它不能足够快地处理输入)。如果您给予它与JSOUP解析器相同的并行性,即4或更多的并行性,这应该会得到改善。
发布于 2019-04-30 15:02:33
迟回复,但可能对其他人有用。
在开放爬行中使用Tika所发生的是,Tika解析器获得了JSOUPParser螺栓没有处理的所有内容:压缩、图像、视频等.通常,这些URL处理起来很重,速度很慢,并且会很快地将传入的元组返回到内部队列中,直到内存爆炸为止。
我刚刚提交了为Tika Parser #712设置mimetype白名单,它允许您定义一组正则表达式,这些正则表达式将在文档的内容类型上尝试。如果有匹配,文档将被处理,如果没有,元组将作为错误发送到状态流。
您可以将白名单配置为:
parser.mimetype.whitelist:
- application/.+word.*
- application/.+excel.*
- application/.+powerpoint.*
- application/.*pdf.*
这将使您的拓扑结构更快、更稳定。告诉我事情进展如何。
https://stackoverflow.com/questions/55071077
复制相似问题