Get page in curl as text

Question

Is there any option for curl that saves a page as text?

I mean, in the same way a page can be saved in browser as Text Files. At least, Firefox has that option.

I need it for a script, I simply do something like

curl -s http://...

But it would make things much easier to deal with it without all html code.

I found an option for lynx that makes what I want: lynx -dump, but I'd rather use curl.

Thanks.

score 9 · Answer 1 · answered Feb 23 '21 at 15:19

You can consider pandoc, a powerful tool to convert files from one markup format into another.

curl -s URL | pandoc -f html -t plain

It's just simple to use:

pandoc [OPTIONS] [FILES]
  -f FORMAT, -r FORMAT  --from=FORMAT, --read=FORMAT                    
  -t FORMAT, -w FORMAT  --to=FORMAT, --write=FORMAT                     
  -o FILE               --output=FILE                                   
                        --data-dir=DIRECTORY

Type pandoc --list-input-format and pandoc --list-output-formats to know the formats in which you can move.

score 5 · Accepted Answer · answered Jan 15 '16 at 14:15

5

No. You can use lynx for this:

lynx -dump URL

UPDATE. Ops. Sorry. I did not see you know about lynx.

I advice to use lynx for this purpose. It often produces very readable output. Sometimes you should use -width option to increase width of the output.

answered Jan 15 '16 at 14:15

appomsk

314
1
3

That's disappointing... because `lynx` behaves different under `GNU/Linux` than under `cygwin` (add/remove an extra blank line). It makes me do different scripts for each platform or I have to check whether I'm under `cygwin`.Thanks anyway ;) – Albert Jan 15 '16 at 15:16
2

I've just checked - lynx from xubunutu 14.04 from Virtual Box and Cygwin from the same my windows box gives the same outputs. Diff shows no diffs) – appomsk Jan 15 '16 at 15:37
Don't know why, but for me show different outputs. I _resolved_ removing all blank lines with `sed`, so I can use same script. – Albert Jan 16 '16 at 02:26
1

@Albert: if you are already using sed, I suggest you try with `curl -s http://... | sed 's/<\/*[^>]*>//g' | `, to remove all (well almost all) of the html tags. – ikaerom Sep 09 '18 at 19:38
This was great for my screenscraping with `xargs`, avoiding going through a long-running podcast with pagerized old episodes. – Sridhar Sarnobat Jun 20 '20 at 02:34

score 3 · Answer 3 · answered Sep 11 '20 at 12:15

You can still use your curl command, and pipe it into lynx. This is useful if you need to pass authentication or any specific curl parameters. For example:

curl --config auth.cfg $URL | lynx -stdin -dump -width=100

This passes auth.cfg file parameters to access the URL, and prints the html page in plain text (without html tags and escape characters).

score 1 · Answer 4 · answered Feb 23 '21 at 15:50

curl is a command to retrieve files from web servers, in the exact form as they are sent by the server. What you expect is to convert the HTML file to plain text, which is a completely different task. So you need another tool for this, as it's not the purpose curl has been designed for.

Get page in curl as text

4 Answers4