blocks|key|227113|text|不能使用wget：http://linuxgazette.net/160/misc/lg/how_to_make_wget_exclude_a_particular_link_when_mirroring.html|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|227114|不过，我对更新的版本不是很确定。|227115|大约401代码，不保留状态(cookie不用于HTTP身份验证)，因此用户名和密码必须随每个请求一起发送。wget先尝试不使用user+&+pass的请求，然后再使用它。|227116|entityMap|0|LINK|mutability|MUTABLE|url|http://linuxgazette.net/160/misc/lg/how_to_make_wget_exclude_a_particular_link_when_mirroring.html^0|9|2Q|0|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]]|C|$]]|$1|D|3|E|5|6|7|T|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|U|8|@]|9|@]|C|$]]|$1|H|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]]]

Not possible with wget: <a href="http://linuxgazette.net/160/misc/lg/how_to_make_wget_exclude_a_particular_link_when_mirroring.html" rel="nofollow">http://linuxgazette.net/160/misc/lg/how_to_make_wget_exclude_a_particular_link_when_mirroring.html</a>

Well, I am not sure about newer versions, though.

About 401 code, no state is kept (cookie is not used for HTTP authentication), so the username and password must be sent with every request. wget try the request w/o user &amp; pass first before resorting to it.

blocks|key|3096066|text|Pavuk+(http://www.pavuk.org)看起来像是一个很有前途的替代方案，它允许你镜像网站，排除基于url模式和文件扩展名的文件。但pavuk+0.9.35在长时间传输过程中随机出现故障/死亡&似乎没有得到积极开发(此版本构建于2008年11月)。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|3096067|仅供参考，以下是我是如何使用它的：|3096068|pavuk+-mode+mirror+-force_reget+-preserve_time+-progress+-Robots+-auth_scheme+3+-auth_name+x+-auth_passwd+x+-dsfx+'html,bam,bai,tiff,jpg'+-dont_leave_site+-remove_old+-cdir+/path/to/root+-subdir+/path/to/root+-skip_url_pattern+’*icons*’+-skip_url_pattern+'*styles*'+-skip_url_pattern+'*images*'+-skip_url_pattern+'*bam*'+-skip_url_pattern+'*solidstats*'+http://web.server.org/folder+2>&1+%7C+tee+pavuk-date.log|style|CODE|3096069|最后，wget+--exclude-directories做到了这一点：|3096070|wget+--mirror+--continue+--progress=dot:mega+--no-parent+\
--no-host-directories+--cut-dirs=1+\
--http-user+x+--http-password+x+\
--exclude-directories='folder/*/folder_containing_large_data*'+--reject+"index.html*"+\
--directory-prefix+/path/to/local/mirror
http://my.server.org/folder|code-block|syntax|javascript|3096071|由于--exclude-directories通配符不会跨越'/'，因此您需要非常明确地构造查询，以避免下载整个文件夹。|3096072|标记|3096073|entityMap|0|LINK|mutability|MUTABLE|url|http://www.pavuk.org/^0|7|K|0|0|0|0|B4|B8|4|0|3|Q|0|0|2|L|0|0^^$0|@$1|2|3|4|5|6|7|12|8|@]|9|@$A|13|B|14|1|15]]|C|$]]|$1|D|3|E|5|6|7|16|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|17|8|@$A|18|B|19|H|I]|$A|1A|B|1B|H|I]]|9|@]|C|$]]|$1|J|3|K|5|6|7|1C|8|@$A|1D|B|1E|H|I]]|9|@]|C|$]]|$1|L|3|M|5|N|7|1F|8|@]|9|@]|C|$O|P]]|$1|Q|3|R|5|6|7|1G|8|@$A|1H|B|1I|H|I]]|9|@]|C|$]]|$1|S|3|T|5|6|7|1J|8|@]|9|@]|C|$]]|$1|U|3|-4|5|6|7|1K|8|@]|9|@]|C|$]]]|V|$W|$5|X|Y|Z|C|$10|11]]]]

Pavuk (<a href="http://www.pavuk.org" rel="noreferrer">http://www.pavuk.org</a>) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers &amp; does not appear to be actively developed (this version was built Nov 2008).

FYI, here's how I was using it: 
<code>pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2&gt;&amp;1 | tee pavuk-</code>date<code>.log</code>

in the end, <code>wget --exclude-directories</code> did the trick:

<pre><code>wget --mirror --continue --progress=dot:mega --no-parent \
--no-host-directories --cut-dirs=1 \
--http-user x --http-password x \
--exclude-directories='folder/*/folder_containing_large_data*' --reject "index.html*" \
--directory-prefix /path/to/local/mirror
http://my.server.org/folder
</code></pre>

Since the <code>--exclude-directories</code> wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.

Mark

blocks|key|227138|text|实际上，Parameter+--reject+'pattern'在wget+1.14中为我工作。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|BOLD|entityRanges|data|227139|例如：|227140|wget+--reject+rpm+http://somerpmmirror.org/site/|code-block|syntax|javascript|227141|所有的*.rpm文件都没有下载，只有索引。|227142|227143|+警告:如果文件模式与工作目录中的文件相匹配，则bash可能会无意中对其进行扩展。请使用引号来避免这种情况：|blockquote|227144|227145|227146|touch+blahblah.rpm
#+working
wget+-R+'*.rpm'+....
#+working
wget+-R+"*.rpm"+....
#+not+working
wget+-R+*.rpm+....|227147|entityMap^0|4|S|X|9|0|0|0|3|5|0|0|1|5|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Z|8|@$9|10|A|11|B|C]|$9|12|A|13|B|D]]|E|@]|F|$]]|$1|G|3|H|5|6|7|14|8|@]|E|@]|F|$]]|$1|I|3|J|5|K|7|15|8|@]|E|@]|F|$L|M]]|$1|N|3|O|5|6|7|16|8|@$9|17|A|18|B|C]]|E|@]|F|$]]|$1|P|3|-4|5|6|7|19|8|@]|E|@]|F|$]]|$1|Q|3|R|5|S|7|1A|8|@$9|1B|A|1C|B|D]]|E|@]|F|$]]|$1|T|3|-4|5|6|7|1D|8|@]|E|@]|F|$]]|$1|U|3|-4|5|6|7|1E|8|@]|E|@]|F|$]]|$1|V|3|W|5|K|7|1F|8|@]|E|@]|F|$L|M]]|$1|X|3|-4|5|6|7|1G|8|@]|E|@]|F|$]]]|Y|$]]

<code>Parameter --reject 'pattern'</code> actually worked for me with wget 1.14.

For example:

<pre><code>wget --reject rpm http://somerpmmirror.org/site/
</code></pre>

All the <code>*.rpm</code> files were not downloaded at all, only indexes.

<blockquote>
 Warning: File patterns can be unintentionally expanded by bash if they match a file located in working directory. Please use quotes to avoid that:
</blockquote>

<pre><code>touch blahblah.rpm
# working
wget -R '*.rpm' ....
# working
wget -R "*.rpm" ....
# not working
wget -R *.rpm ....
</code></pre>

I'd like to mirror a simple password-protected web-portal to some data that i'd like to keep mirrored &amp; up-to-date. Essentially this website is just a directory listing with data organised into folders &amp; I don't really care about keeping html files &amp; other formatting elements.
However there are some huge file types that are too large to download, so I want to ignore these.

Using the <code>wget -m -R/--reject</code> flag nearly does what I want, except that all files get downloaded, then if they match the -R flag, then they get deleted.

Here's how i'm using <code>wget</code>:

<pre><code>wget --http-user userName --http-password password -R index.html,*tiff,*bam,*bai -m http://web.server.org/
</code></pre>

Which produces output like this, confirming that an excluded file (index.html) (a) gets downloaded, and (b) then gets deleted:

<blockquote>
 ... 
 --2012-05-23 09:38:38-- <a href="http://web.server.org/folder/" rel="noreferrer">http://web.server.org/folder/</a> 
 Reusing existing connection to web.server.org:80. 
 HTTP request sent, awaiting response... 401 Authorization Required 
 Reusing existing connection to web.server.org:80. 
 HTTP request sent, awaiting response... 200 OK 
 Length: 2677 (2.6K) [text/html] 
 Saving to: `web.server.org/folder/index.html'
 100%[======================================================================================================================>] 2,677 --.-K/s in 0s
 
 Last-modified header missing -- time-stamps turned off. 
 2012-05-23 09:38:39 (328 MB/s) - `web.server.org/folder/index.html' saved [2677/2677] 
 
 Removing web.server.org/folder/index.html since it should be rejected. 
 
 ... 
 
</blockquote>

is there a way to force wget to reject the file before downloading it? 
Is there an alternative that I should consider?

Also, why do i get a <code>401 Authorization Required</code> error for every downloaded file, despite supplying username &amp; password. It's like <code>wget</code> tries to connect un-authenticated every time, before trying the username/password.

thanks, Mark

mirror http website, excluding certain files

身份验证

我想把一个简单的有密码保护的门户网站镜像到一些我想保持镜像和更新的数据上。本质上，这个网站只是一个目录列表，将数据组织到文件夹中&我并不真正关心是否保存html文件和其他格式化元素。但是有一些很大的文件类型太大，无法下载，所以我想忽略它们。使用wget -m -R/--reject标志几乎做了我想要的事情，除了下载所有...

问镜像http网站，不包括某些文件
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问镜像http网站，不包括某些文件EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问镜像http网站，不包括某些文件
EN