5

Is there a way to separate wget's download and --convert-links functionality? For those unfamiliar with wget and/or --convert-links, long story short, wget can be used to download a website. --convert-links modifies the downloaded html files so the downloaded website works off-line. It does that by converting the href/src/etc. attributes to reference local files instead of the remote website.

This is the official explanation:

-k --convert-links

After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

Each link will be changed in one of the two ways:

• The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link.

Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also downloaded, then the link in doc.html will be modified to point to ../bar/img.gif. This kind of transformation works reliably for arbitrary combinations of directories.

• The links to files that have not been downloaded by Wget will be changed to include host name and absolute path of the location they point to.

Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to ../bar/img.gif), then the link in doc.html will be modified to point to http://hostname/bar/img.gif.

Because of this, local browsing works reliably: if a linked file was downloaded, the link will refer to its local name; if it was not downloaded, the link will refer to its full Internet address rather than presenting a broken link. The fact that the former links are converted to relative links ensures that you can move the downloaded hierarchy to another directory.

Note that only at the end of the download can Wget know which links have been downloaded. Because of that, the work done by -k will be performed at the end of all the downloads.

If a (recursive) download gets interrupted & resumed manually, or if one fails to specify -k to begin with, how can one get sane links inside the html files?

It seems not even --backup-converted can make the process more robust, as either wget converts links right after downloading everything (no missing files), or you're on your own (xpath etc)

Cristian Ciupitu
  • 2,430
  • 1
  • 22
  • 29
usretc
  • 564
  • 3
  • 13

1 Answers1

2

Since .html files are ASCII text, you can post-process the .html files, with sed. Files containing, for example http://bad.url/good.part and https://bad.url/good.part and should have good.url instead, leaving the unmodified *.html files as *.html.bak.

find . -type f -name '*.html' -print0 | \
  xargs -0 -r sed -i.bak -e 's%://bad\.url/%://good.url/%'

Naturally, read man find xargs sed

terdon
  • 234,489
  • 66
  • 447
  • 667
waltinator
  • 4,439
  • 1
  • 16
  • 21
  • I found a completely different answer to this question here: https://unix.stackexchange.com/a/7378/101311 Your answer seems far simpler than theirs, and my **gut** tells me the complexity is there for a good reason. Like, there might be edge cases where your solution doesn't work? Any idea why nobody else has thought of such a simple answer before you? – Daniel Kaplan Mar 24 '22 at 23:46
  • 3
    @DanielKaplan "complexity" increases the bug surface, unnecessarily. Set up a local webserver, lie to `/etc/hosts`, re-transfer all the files, oh and remember the `--timestamping` option. Remember to undo all of this when you're done? Just to change text in an ASCII file? My "solution" has been known among the Unix/Linux community since the beginning. I regard it as a well-known general algorithm applied to a specific case. Which "edge cases:? That's why I leave the `.bak` files behind. `diff` to check changes. You can adjust the match string to get the results you want. – waltinator Mar 25 '22 at 00:43
  • 1
    "Which "edge cases:?" Well, for example, what if a link starts with a `/` instead of having a `bad.url`? – Daniel Kaplan Mar 25 '22 at 03:25
  • 1
    @DanielKaplan In that case, the script would need to detect that and adapt. Why not look at how wget does it? – 9pfs Mar 25 '22 at 03:32
  • 1
    @9pfssupportsUkraine That's a fair question. I assumed I wouldn't be able to find the logic or understand it. The last time I read C code was a decade ago and I never knew how to program specifically for linux. I'll give it a try. – Daniel Kaplan Mar 25 '22 at 04:05
  • @DanielKaplan C is basically just JavaScript with additional types for data. – 9pfs Mar 25 '22 at 04:41
  • 1
    @9pfssupportsUkraine http://bzr.savannah.gnu.org/lh/wget/trunk/view/head:/src/convert.c Looks like the logic is ~1000 lines. I may look into it later, but it might take considerable effort to translate/extract what I'm looking for. – Daniel Kaplan Mar 25 '22 at 04:52
  • @DanielKaplan Why not compile it to WebAssembly and run it with nodejs without analyzing it at all? – 9pfs Mar 25 '22 at 06:36
  • 1
    @9pfssupportsUkraine Great idea! Then I'd have the same problem, but in web assembly. – Daniel Kaplan Mar 25 '22 at 06:37
  • But you could then create a program using the `pkg` package that'd do what you want, _and_ avoid analyzing the source! – 9pfs Mar 25 '22 at 06:38
  • 1
    Unconverted URLs can contain relative paths (`../x.html`), absolute paths (`/x.html`), and URL links. Relative paths also get affected by ``. This is nasty, and that's the whole point of the question. As the ending says, "or you're on your own (xpath etc)" -- thanks for proposing a downgrade from xpath to sed :) – usretc Mar 27 '22 at 06:16
  • I mean no personal offense, but after sleeping on it, I agree with @usretc. In context, this should be a comment. The majority of the answer is an exercise left to the reader. In addition, usretc points out this answer doesn't even work in a lot of situations. For these reasons, I've flagged it as "not an answer." – Daniel Kaplan Mar 28 '22 at 03:43
  • @DanielKaplan you may find the opinions vary on this, but IMHO, waltinator's Answer here is an *attempt to answer the question*. I can't speak to the validity of the answer (as I don't personally understand the question), but I would reserve "NAA" for things that are obviously further questions or comments that belong under another post instead of as an Answer. – Jeff Schaller Mar 28 '22 at 18:49
  • @JeffSchaller In an attempt to clarify the question, I have suggested an edit to the original post. Hope that helps. re: `I would reserve "NAA" for ... comments that belong under another post instead of as an Answer.` I was thinking it falls under this, but I suppose it could be edited further to make it complete; as it stands, there are myriad examples -- `href`s in the original that start with `/`, `../`, and `./` -- where this falls short of `--convert-links` functionality. – Daniel Kaplan Mar 29 '22 at 10:10
  • 1
    Thank you, @DanielKaplan. Again, it's a bit of a range in my experience, but *attempts* to answer the question, even if they fall short, can stand as Answers where others can comment on the limitations of the answer or even edit it to improve it. I personally reserve "NAA" for answers that say "thanks" to another answer or "I have this *other* problem...". – Jeff Schaller Mar 29 '22 at 11:34
  • @JeffSchaller makes sense, thanks. – Daniel Kaplan Mar 29 '22 at 14:22