我用nutch 1.3抓取网站。当nutch抓取我的站点时,我在日志中看到以下异常:
Malformed URL: '', skipping (java.net.MalformedURLException: no protocol:
at java.net.URL.<init>(URL.java:567)
at java.net.URL.<init>(URL.java:464)
at java.net.URL.<init>(URL.java:413)
at org.apache.nutch.crawl.Generator$Selector.
我正在使用iOS上的Nutch1.4本地,来爬行一个网站,Nutch readseg dump没有返回任何相关信息。我遗漏了什么?
I am trying to extract 'category' as new metadata from url. I am using replace to extract substring from the url. I am able to run the code and index the documents in Google Cloud Search. But it is not capturin
我想使用命令bin/nutch inject注入我的爬网URL,但是我得到了一个错误
'nutch' is not recognized as an internal or external command,
operable program or batch file.
我应该在哪里输入这个命令?我目前正在命令提示符下的路径C:\Users\Gaurav Kandpal\Desktop\elastic\apache-nutch-2.3-src\apache-nutch-2.3\runtime\local\b上键入此命令。
我试图将爬行数据从nutch索引到solr中,但收到以下错误。任何帮助都将不胜感激。
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication
我使用的是Solr 5.4.1和Apache Nutch 1.12。我能够抓取数据,但在Solr中索引的最后一步中,我遇到了以下错误。
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mappin