13

wget has such option as -np which disables getting files from any parent directory. I need something similar but a bit more flexible. Consider:

www.foo.com/bar1/bar2/bar3/index.html

I would like to get everything but not "higher" (in the tree hierarchy) than bar2 (!). So bar2 should also be fetched but not bar1.

Is there a way to make wget more selective?

Background: I'm trying to mirror a website, with a similar logical structure -- starting point, then up, then down. If there is another tool than wget, better suited for such layout, please let me know as well.

Update

Or instead of specifying possible depth up, maybe something like "no parents, unless they match this or that URL".

Update 2

There is some structure on the server, right? You can visualize it as a tree. So normally with "--no-parent" you start from some point A and go only down.

My wish, is ability to go up -- expressed by saying, it is allowed to go up X nodes, or (which is 100% equivalent) that it is allowed to go up to B node (where the distance B-A=X).

In all cases, the rules for going down stays as were defined by users (for examples -- go down only by Y levels).

How to store it? Actually it is not the question really -- wget by default recreates the server structure, there is nothing here to be afraid, or there is no need for fixing anything. So, in 2 words -- as usual.

Update 3

Directory structure below -- let's assume that in each directory there is only one file, in R -- R.html and so on. This is simplified of course because you can have more than one page.

        R 
       / \
      B   G
     / \
    C   F
   / \
  A   D
 /
E 

A (A.html) is my starting point, X = 2 (so B is the most top level node I would like to fetch). In this particular example this means fetching all pages except R.html and G.html. A.html is called "starting point" because I have to start from it, not from B.

Update 4

Naming is used from Update 3.

wget OPTIONS www.foo.com/B/C/A/A.html

The question is what are the options to get all pages from directory B and below (knowing that you have to start from A.html).

greenoldman
  • 6,086
  • 16
  • 54
  • 65
  • You want `bar2` fetched but not `bar1`? Where is `bar2` going to reside? What if two or more dirs that you don't want have identically-named subdirs, should their contents be merged? It is almost certainly easier to just get the whole damn site and then prune/move things around as you desire. – Kilian Foth Dec 15 '11 at 15:21
  • @Kilian Foth, What do you mean by "get whole damn site"? Fetching it? In general it is overkill, it could mean fetching TBs when MBs are needed. For the rest, see update2. – greenoldman Dec 15 '11 at 15:36
  • Not sure what you mean. The only interprestation I can come up with is, you want the `bar2` directory and all its contents. If that is not it, please clarify. – Faheem Mitha Dec 15 '11 at 18:11
  • @Faheem Mitha, "its content" = "entire subtree". Yes, this is only interpretation I believe, and that is exactly what I mean. – greenoldman Dec 15 '11 at 20:58

4 Answers4

15

I haven't tried it, but using -I and -X could give you what you want. My first tries would be along the line of

wget -m -I bar1/bar2 -X "*" http://www.foo.com/bar1/bar2/bar3/index.html

Explanation of options:

-m: 
   --mirror
       Turn on options suitable for mirroring.  This option turns on recursion and time-stamping, sets
       infinite recursion depth and keeps FTP directory listings.  It is currently equivalent to -r -N -l
       inf --no-remove-listing.
-I: list
   --include-directories=list
       Specify a comma-separated list of directories you wish to follow when downloading.  Elements of
       list may contain wildcards.
-X: list
   --exclude-directories=list
       Specify a comma-separated list of directories you wish to exclude from download.  Elements of list
       may contain wildcards.
Charles
  • 80
  • 7
AProgrammer
  • 2,278
  • 18
  • 15
4

I think the right answer here is the --no-parent option:

   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.
       This is a useful option, since it guarantees that only the files below
       a certain hierarchy will be downloaded.
Jonathon Reinhart
  • 1,821
  • 1
  • 16
  • 20
4

You need to add a final / to the URL, else you'll not get what you want.

If you wanted to get all content at www.myhostname.com/somedirectory then the syntax should read like:

wget -r -nH http://www.myhostname.com/somedirectory/

Try it without the end / and see what happens. Then try it with the /.

rahmu
  • 19,673
  • 28
  • 87
  • 128
Adrian
  • 41
  • 1
  • 1
    It will still ascend into higher directories if pages linked therein refer to such – EkriirkE Oct 12 '15 at 02:03
  • Thanks a lot for the hint with trailing slash! It helped me to solve the issue with irrelevant files fetched by wget from neighbouring directories (siblings). – AntonK Jan 28 '18 at 19:29
1

Maybe I'm missing something, but if that is what you want then

wget -c -np -r www.foo.com/bar1/bar2

works for me (using your example). Of course, with those options you'll get all the directory structure above that too, from www.foo.com on down. If you just want bar2 at top level, then do

wget -c -np -r -nH --cut-dirs=1 www.foo.com/bar1/bar2

-nH gets rid of the www.foo.com, and --cut-dirs=1 gets rid of bar1, so you'll get bar2 and its subdirectories downloaded to the current directory. For further information, see man wget, which is quite readable and has examples.

Faheem Mitha
  • 34,649
  • 32
  • 119
  • 183
  • You omitted starting point, you **have to** follow the links. You assumed that starting point is at the same time top-level (this is trivial case of `np`), but I am looking for general solution, when top-level is above starting point. – greenoldman Dec 16 '11 at 07:50
  • @macias: Sorry, I'm not following you. Can you illustrate with an example? – Faheem Mitha Dec 16 '11 at 08:05
  • I just added an ASCII "screenshot". I hope this will help. In this example A is the starting point. – greenoldman Dec 16 '11 at 08:19
  • @macias: So you don't want to specify the path to `B` (as per your example), but rather `A`? If so, why? Is this because you want to automate some script or for some other reason? I'm also not sure what you mean by X=2. Does that mean level 2? If you are trying to fetch directories further down in the tree, I'm not sure how you distinguish `B` from `G`. – Faheem Mitha Dec 16 '11 at 08:26
  • A is the starting point, because it is starting point -- look, I am on client side, not a server. IOW -- I do **NOT** own the server, and I didn't make this structure. I have to deal with what I see. **X** is symbol from Update 2, the "depth" how many levels you can go up. You distinguish B from G, because B is B, and G is not B, and you see B, because it is part of URL for A. I rephrased the question in Update 4. – greenoldman Dec 16 '11 at 08:49
  • @macias: Ok, I see what you mean. I'm not clear why you don't just truncate the url, but I assume that is because you can't see the whole directory tree? In any case, sticking `../..` at the end of the url seems to work (if you want to go up two levels), though I have not tested it carefully. – Faheem Mitha Dec 16 '11 at 13:52
  • I cannot truncate just a directory, because there are no default html pages per directory. You have to know the address exactly. The second thing -- a link is a one direction way, if you have A -> B, and you start from B, there is no way, you can figure out the url of A, unless there is a link B -> A. Nevertheless, I now have working solution, see AProgrammer answer. – greenoldman Dec 16 '11 at 14:22