Tell wget not to spider URL matching a pattern?

Question

I want to test how my site would be behave when being spidered. However, I want to exclude all URLs containing the word "page". I tried:

$ wget -r -R "*page*" --spider --no-check-certificate -w 1 http://mysite.com/

The -R flag is supposed to reject URL pattern containing the word "page". Except that it doesn't seem to work:

Spider mode enabled. Check if remote file exists.
--2014-06-10 12:34:56--  http://mysite.com/?sort=post&page=87729
Reusing existing connection to [mysite.com]:80.
HTTP request sent, awaiting response... 200 OK

How do I exclude spidering of such URL?

score 12 · Accepted Answer · answered Jun 12 '14 at 09:59

12

After some trial and error, I realise the solution is simply to use --reject-regex like this:

wget -r --reject-regex page --spider --no-check-certificate -w 1 http://mysite.com/

The urlregex must not contain wildcard and hence *page* is invalid, but page is.

answered Jun 12 '14 at 09:59

Question Overflow

4,568
19
57
84

Thanks! After many trial and errors it worked! I'm curious how it would be to add more words to reject together? – Satoshi Nakamoto Dec 26 '20 at 11:48

hellodanylo · Answer 2 · 2014-06-11T07:12:26.497

2

From man wget:

-R rejlist --reject rejlist
           Specify comma-separated lists of file name suffixes or patterns to
           accept or reject.

This option will only reject files that match the pattern.

Strictly speaking, in your URL page is a request parameter, not the last part of the path (e.g. file name).

You might want to dump all URLs that wget found (e.g. grep the log for all downloaded URLs), remove those URLs that do not satisfy you (with grep -v, for example) and finally make wget retrieve the URLs left. For example:

# dump the whole website
wget ... -P dump -o wget.log  ...

# extract URLs from the log file
cat wget.log | grep http | tr -s " " "\012" | grep http >urls

# excludes URLs with the word page anywhere in it
cat urls | grep -v page >urls 

# delete previous dump, since it probably contains unwanted files
rm -rf dump

# Fetch URLs
cat urls | xargs wget -x

You might want to add other wget options (e.g. --no-check-certificate) according to your needs.

edited Jun 11 '14 at 07:12

answered Jun 11 '14 at 06:09

hellodanylo

2,393
1
17
19

I did read the manual a few times. As you can see, it says "file name suffixes _or_ patterns", so it isn't that clear if the pattern must be a file name. Nevertheless, I am looking for a solution that allows exclusion of a specific URL pattern. – Question Overflow Jun 11 '14 at 06:28
@QuestionOverflow See the edit for one example of how you might do it. – hellodanylo Jun 11 '14 at 06:44
Your second option would spider the entire site, downloading everything. Then it would download almost everything a second time. It would be more efficient to download everything and then delete the parts that don't satisfy you. – dhasenan Jan 14 '17 at 16:53

Tell wget not to spider URL matching a pattern?

2 Answers2