3

I would like to convert an online book (linked html files) into a pdf file.

I tried the two-step way in http://kmkeen.com/mirror/2009-02-05-14-00-00.html

  1. First, download the html files by

    wget -nd -mk http://www.unknownroad.com/rtfm/gdbtut/
    

    But it has downloaded a lot of nonrelated files. So I have to remove the unrelated files.

  2. Then, I try to convert the downloaded html book into a pdf file:

    htmldoc --webpage -f gdb.pdf html/index.html html/*.html
    

    but the order of pages in the pdf file isn't correct.

I wonder what good way to download and convert an online book (linked html files) into a pdf file?

My OS is Ubuntu 12.04.

countermode
  • 7,373
  • 5
  • 31
  • 58
Tim
  • 98,580
  • 191
  • 570
  • 977

1 Answers1

3

As mentioned in the instructions you linked:

The default glob expansion puts the pages in alphabetical order.

The index page links to nine different documents, whose names aren't in alphabetical order. When you say htmldoc ... *.html, the tools sees them in that order and puts the pages into the document alphabetically. You need to list the files on the command line in the order you want htmldoc to process them.

In this specific case you can produce an ordered list of filenames as they're linked in the index with:

awk '/http:|\.\./ {next}; /<a href.*\.html/ { gsub(/.*href="/, "") ; gsub(".html.*", ".html") ; print }' index.html | uniq

so

htmldoc --webpage -f gdb.pdf index.html $(awk '/http:|\.\./ {next}; /<a href.*\.html/ { gsub(/.*href="/, "") ; gsub(".html.*", ".html") ; print }' index.html | uniq)

will have the effect you want.

Michael Homer
  • 74,824
  • 17
  • 212
  • 233
  • Thanks. about 1, `wget -nd -mk http://www.unknownroad.com/rtfm/gdbtut/` also downloads files under `http://www.unknownroad.com`. How could I download the files under `http://www.unknownroad.com/rtfm/gdbtut/` only? – Tim Aug 20 '14 at 13:08
  • See wget's `--domains` and `--no-parent` options. – Michael Homer Aug 20 '14 at 21:46
  • 1
    Thanks, Michael. Can htmldoc create multilevel bookmarks in the resulting pdf file? http://unix.stackexchange.com/questions/153433/can-htmldoc-create-multi-level-pdf-bookmarks – Tim Sep 03 '14 at 01:19