How to Wget with Subset Condition + generate CHM/... e-book?

Question

I want to generate a CHM/... e-book by wgetting with a subset condition: download a subset of data recursively in the website that is within HTML class .container for a CHM book. Pseudocode

wget recursively all links of chapters

# TODO returns only index.html
wget --random-wait -r -p -nd -e robots=off -A".html" \ 
 -U mozilla https://wwwnc.cdc.gov/travel/yellowbook/2018/table-of-contents

Contents in the current main page in .container of Fig. 1 and contents in the daughter pages of links.
create CHM e-book and/or other format

Fig. 1 Inspection of CDC Yellow Book .container

Output: just index.html

Expected output: e-book CHM and/or other format

Wget Proposals

TimS

wget -w5 --random-wait -r -nd -e robots=off -A".html" -U mozilla https://wwwnc.cdc.gov/travel/yellowbook/2018/table-of-contents

Output: same as with the first code.

With Rejection List

wget -w5 --random-wait -r -nd -e robots=off -A".html" \
 -U mozilla -R css https://wwwnc.cdc.gov/travel/yellowbook/2018/table-of-contents

Output: same as without rejection lists.

Another variant

wget -w5 --random-wait -r -nd -e robots=off -A".html" \
 -U mozilla https://wwwnc.cdc.gov/travel/yellowbook/2018/table-of-contents

Output: similar as before.

The tool www.html2pdf.it gives

Cannot get http://wwwnc.cdc.gov/travel/yellowbook/2016/table-of-contents: http status code 404

OS: Debian 8.7

Is this the only site that you're testing it on? My first guess would be the random wait time, perhaps it's triggering a rejection on the server end. Try setting the wait time to something like 5 seconds (which should be more than enough for a server to allow the connection) just to see if that works. — Tim S., Apr 19 '16 at 21:49
`wget` may not be flexible enough for this. AFAIK, it has no capability to only look inside specific named elements (like `div.container`). You may need to write your own web robot, e.g. in `perl` with `LWP` (aka `libwww-perl`) https://metacpan.org/release/libwww-perl — cas, Apr 19 '16 at 23:36
@Masi i was pointing you in a viable direction, not volunteering to write a web bot for you. (that's a tedious, PITA job that I hate doing even when I really need the data myself. And then the web site changes and your bot breaks, fix it and repeat forever) — cas, May 29 '16 at 11:05
@cas Yes, I know it can be tedious. Therefore, I want to get an overview what could be done here. I do not understand why such robots break often. Any other proposal is also welcome! Actually, I do not understand why you need to use here a robot. I do not understand why wget actually possibly fails. — Léo Léopold Hertz 준영, May 29 '16 at 11:09

Tim S. · Answer 1 · 2016-04-19T22:46:45.433

2

I found your problem. The -A".html" restricts it to only accepting files that end in .html. If you remove that section, you will start to download all of the files.

wget -w5 -r -nd -e robots=off -U mozilla http://wwwnc.cdc.gov/travel/yellowbook/2016/table-of-contents

Edit: If you want to exclude js/css/etc files, then you'd be better off using -R to form a rejection list rather than including only html.

edited Apr 19 '16 at 22:46

answered Apr 19 '16 at 22:01

Tim S.

359
3
13

1

Visit the site in your regular browser. If you restrict to only html files, it will stop after index.html every time. Sounds like you need to refine your filter then, to reject what you don't want and keep the rest. – Tim S. Apr 19 '16 at 22:10
1

I have edited my answer to include a rejection list with `-R` rather than only accepting html files. – Tim S. Apr 19 '16 at 22:47

score 1 · Answer 2 · answered Jun 25 '17 at 09:55

1

I do not think you should include/exclude stuff, download it all. CHM is compiled HTML, so you will need a CSS to replace the existing one - what better solution than to use the existing CSS as a base.

As for the JavaScript, you might want to inspect what it does because you never know, by default, some data might be hidden ...

Remember, you can define what you include/exclude in your master.hhc (for your CHM).

You will need the Microsoft HTML Help Workshop to compile the CHM, I advise using FAR as well for editing what you want and what you do not want.

These tools are designed to work on Windows, I am pretty sure they work in wine, however, I have not tested this.

answered Jun 25 '17 at 09:55

thecarpy

3,885
1
16
35

Great to understand how to do this also in Windows! - - Is there any tools designed for Linux? Any format is ok for me. – Léo Léopold Hertz 준영 Jun 25 '17 at 14:07
CHM is Windows Help. You are probably better off doing this whole thing in PDF. – thecarpy Jun 25 '17 at 14:09
Check out https://stackoverflow.com/questions/391005/convert-html-css-to-pdf-with-php – thecarpy Jun 25 '17 at 14:14
1

Google is your friend, try this: http://www.html2pdf.it/ it is open source, uses webkit with node.js, is platform independent. – thecarpy Jul 05 '17 at 05:11
The tool gives 404 error when you insert the page url there. What do you think? – Léo Léopold Hertz 준영 Jul 05 '17 at 13:53

How to Wget with Subset Condition + generate CHM/... e-book?

2 Answers2