问Nutch segments文件夹每天都在增长
EN

Stack Overflow用户

提问于 2013-06-21 23:19:53

回答 1查看 2.1K关注 0票数 2

我已经将nutch/Solr1.6配置为每12小时爬行/索引一次包含大约4000个文档和html页面的内部网。

如果我在一个空数据库中执行爬虫，这个过程大约需要30分钟。当爬行执行了几天后，它会变得非常慢。查看日志文件，似乎今天晚上最后一步(SolrIndexer)在1小时20分钟后开始，花了1个多小时。

因为索引的文档数量没有增长，所以我想知道为什么它现在这么慢。

使用以下命令执行Nutch：

bin/nutch crawl -urlDir urls -solr http://localhost:8983/solr -dir nutchdb -depth 15 -topN 3000

nutch-site.xml包含：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>http.agent.name</name>
        <value>Internet Site Agent</value>
    </property>
    <property>
        <name>plugin.includes</name>
        <value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata|more|http-header)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>
    <!-- Used only if plugin parse-metatags is enabled. -->
    <property>
        <name>metatags.names</name>
        <value>description;keywords;published;modified</value>
        <description> Names of the metatags to extract, separated by;.
            Use '*' to extract all metatags. Prefixes the names with 'metatag.'
            in the parse-metadata. For instance to index description and keywords,
            you need to activate the plugin index-metadata and set the value of the
            parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
        </description>
    </property>
    <property>
        <name>index.parse.md</name>
        <value>metatag.description,metatag.keywords,metatag.published,metatag.modified</value>
        <description> Comma-separated list of keys to be taken from the parse metadata to generate fields.
            Can be used e.g. for 'description' or 'keywords' provided that these values are generated
            by a parser (see parse-metatags plugin)
        </description>
    </property>       
    <property>
    <name>db.ignore.external.links</name>
    <value>true</value>
    <description>Set this to false if you start crawling your website from
       for example http://www.example.com but you would like to crawl
       xyz.example.com. Set it to true otherwise if you want to exclude external links
    </description>
    </property>
    <property>
        <name>http.content.limit</name>
        <value>10000000</value>
        <description>The length limit for downloaded content using the http
            protocol, in bytes. If this value is nonnegative (>=0), content longer
            than it will be truncated; otherwise, no truncation at all. Do not
            confuse this setting with the file.content.limit setting.
        </description>
    </property> 

    <property>
        <name>fetcher.max.crawl.delay</name>
        <value>1</value>
        <description>
            If the Crawl-Delay in robots.txt is set to greater than this value (in
            seconds) then the fetcher will skip this page, generating an error report.
            If set to -1 the fetcher will never skip such pages and will wait the
            amount of time retrieved from robots.txt Crawl-Delay, however long that
            might be.
        </description>
    </property>

    <property>
        <name>fetcher.threads.fetch</name>
        <value>10</value>
        <description>The number of FetcherThreads the fetcher should use.
        This is also determines the maximum number of requests that are
        made at once (each FetcherThread handles one connection). The total
        number of threads running in distributed mode will be the number of
        fetcher threads * number of nodes as fetcher has one map task per node.
        </description>
    </property>

    <property>
        <name>fetcher.threads.fetch</name>
        <value>10</value>
        <description>The number of FetcherThreads the fetcher should use.
            This is also determines the maximum number of requests that are
            made at once (each FetcherThread handles one connection). The total
            number of threads running in distributed mode will be the number of
            fetcher threads * number of nodes as fetcher has one map task per node.
        </description>
    </property>

    <property>
        <name>fetcher.server.delay</name>
        <value>1.0</value>
        <description>The number of seconds the fetcher will delay between
            successive requests to the same server.</description>
    </property>

    <property>
        <name>http.redirect.max</name>
        <value>0</value>
        <description>The maximum number of redirects the fetcher will follow when
            trying to fetch a page. If set to negative or 0, fetcher won't immediately
            follow redirected URLs, instead it will record them for later fetching.
        </description>
    </property>

    <property>
        <name>fetcher.threads.per.queue</name>
        <value>2</value>
        <description>This number is the maximum number of threads that
           should be allowed to access a queue at one time. Replaces
           deprecated parameter 'fetcher.threads.per.host'.
        </description>
    </property>

    <property>
        <name>link.delete.gone</name>
        <value>true</value>
        <description>Whether to delete gone pages from the web graph.</description>
   </property>

   <property>
       <name>link.loops.depth</name>
       <value>20</value>
       <description>The depth for the loops algorithm.</description>
   </property>

<!-- moreindexingfilter plugin properties -->

    <property>
      <name>moreIndexingFilter.indexMimeTypeParts</name>
      <value>false</value>
      <description>Determines whether the index-more plugin will split the mime-type
      in sub parts, this requires the type field to be multi valued. Set to true for backward
      compatibility. False will not split the mime-type.
      </description>
    </property>

    <property>
      <name>moreIndexingFilter.mapMimeTypes</name>
      <value>false</value>
      <description>Determines whether MIME-type mapping is enabled. It takes a
      plain text file with mapped MIME-types. With it the user can map both
      application/xhtml+xml and text/html to the same target MIME-type so it
      can be treated equally in an index. See conf/contenttype-mapping.txt.
      </description>
    </property>

    <!-- Fetch Schedule Configuration --> 
    <property>
      <name>db.fetch.interval.default</name>
              <!-- for now always re-fetch everything -->
      <value>10</value>
      <description>The default number of seconds between re-fetches of a page (less than 1 day).
      </description>
    </property>

    <property>
      <name>db.fetch.interval.max</name>
              <!-- for now always re-fetch everything -->
      <value>10</value>
      <description>The maximum number of seconds between re-fetches of a page
      (less than one day). After this period every page in the db will be re-tried, no
       matter what is its status.
      </description>
    </property>

    <!--property>
      <name>db.fetch.schedule.class</name>
      <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
      <description>The implementation of fetch schedule. DefaultFetchSchedule simply
      adds the original fetchInterval to the last fetch time, regardless of
      page changes.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.inc_rate</name>
      <value>0.4</value>
      <description>If a page is unmodified, its fetchInterval will be
      increased by this rate. This value should not
      exceed 0.5, otherwise the algorithm becomes unstable.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.dec_rate</name>
      <value>0.2</value>
      <description>If a page is modified, its fetchInterval will be
      decreased by this rate. This value should not
      exceed 0.5, otherwise the algorithm becomes unstable.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.min_interval</name>
      <value>60.0</value>
      <description>Minimum fetchInterval, in seconds.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.max_interval</name>
      <value>31536000.0</value>
      <description>Maximum fetchInterval, in seconds (365 days).
      NOTE: this is limited by db.fetch.interval.max. Pages with
      fetchInterval larger than db.fetch.interval.max
      will be fetched anyway.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.sync_delta</name>
      <value>true</value>
      <description>If true, try to synchronize with the time of page change.
      by shifting the next fetchTime by a fraction (sync_rate) of the difference
      between the last modification time, and the last fetch time.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.sync_delta_rate</name>
      <value>0.3</value>
      <description>See sync_delta for description. This value should not
      exceed 0.5, otherwise the algorithm becomes unstable.</description>
    </property>

    <property>
      <name>db.fetch.schedule.adaptive.sync_delta_rate</name>
      <value>0.3</value>
      <description>See sync_delta for description. This value should not
      exceed 0.5, otherwise the algorithm becomes unstable.</description>
    </property-->

    <property>
      <name>fetcher.threads.fetch</name>
      <value>1</value>
      <description>The number of FetcherThreads the fetcher should use.
         This is also determines the maximum number of requests that are
         made at once (each FetcherThread handles one connection). The total
         number of threads running in distributed mode will be the number of
         fetcher threads * number of nodes as fetcher has one map task per node.
      </description>
    </property>

    <property>
       <name>hadoop.tmp.dir</name>
       <value>/opt/apache-nutch/tmp/</value>
    </property>

    <!-- Boilerpipe -->
    <property>
      <name>tika.boilerpipe</name>
      <value>true</value>
    </property>
    <property>
      <name>tika.boilerpipe.extractor</name>
      <value>ArticleExtractor</value>
    </property>
</configuration>

如您所见，我已经将nutch配置为总是重新获取所有文档。因为站点很小，所以现在应该可以重新获取所有内容(第一次只需要30分钟……)。

我注意到，在crawldb/segments文件夹中，每天都会创建大约40个新的段。当然，数据库的磁盘大小增长非常快。

这是预期的行为吗？配置有问题吗？

performance

solr

nutch

segments

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-06-25 19:37:45

有必要从nutchdb中删除比db.default.fetch.interval旧的段。此时间间隔定义了何时应重新获取页面。

如果页面已被重新获取，则可以删除旧的段。

如果段没有被删除，步骤solrindexer必须读取太多的段，并且变得非常慢(在我的例子中是一个小时而不是4分钟)。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/17238813

复制

相似问题

问Nutch segments文件夹每天都在增长
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Nutch segments文件夹每天都在增长EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Nutch segments文件夹每天都在增长
EN