curl get all links of a web-page

Question

I used to utilize following command to get all links of a web-page and then grep what I want:

curl $URL 2>&1 | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | egrep $CMP-[0-9].[0-9].[0-9]$ | cut -d'-' -f3

It was doing great till yesterday. I tried to run curl itself and I saw it returns:

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

Was there any possible updates which causes the command not working or what?

EDIT 1:

I changed my approach to wget regarding this answer:

wget -q $URL -O - | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | egrep $CMP-[0-9].[0-9].[0-9]$ | cut -d'-' -f3

But still doesn't know why curl approach suddenly stopped working.

`curl` doesn't follow redirects by default, `wget` does. Has the URL got a redirect now? — muru, Aug 29 '17 at 07:48
@muru, I don't know what you actually mean by *redirect*, but I solved the issue by using `wget`. — Zeinab Abbasimazar, Aug 29 '17 at 08:54
a bit late to the party but you could force curl to output to stdout with `curl -o - $URL` — slowko, Jun 29 '18 at 10:58
Post the output of `curl -I $URL`. Also, use `-s` instead of `2>&1` — GMaster, May 03 '20 at 13:11
Feel free to accept any of the accepted answers, if you want. Just a friendly reminder. If none works for you, maybe add more details to your post, so people can try to help you. — Nordine Lotfi, Dec 04 '22 at 16:46
To have `curl` follow _redirects_ as you seem to had, you need `curl -L` — Gilles Quénot, Dec 05 '22 at 18:25

Nordine Lotfi · Accepted Answer · 2023-01-03T17:28:15.407

Warning: Using regex for parsing HTML in most cases (if not all) is bad, so proceed at your own discretion.

This should do it:

curl -f -L URL | grep -Eo "https?://\S+?\""

or

curl -f -L URL | grep -Eo '"(http|https)://[a-zA-Z0-9#~.*,/!?=+&_%:-]*"'

Note:

This does not take into account links that aren't "full" or basically are what I call "half a link", where only a part of the full link is shown. I don't recall where I saw this, but it should appear on certain sites under certain/particular HTML tags. EDIT: Gilles Quenot kindly provided a solution for what I wrongly described as "half-link" (the correct term being relative link):

curl -Ls URL |  grep -oP 'href="\K[^"]+'

This also doesn't "clean" whatever won't be part of the link (eg: a "&" character, etc). If you want to remove that, make/use sed or something else like so:

curl -f -L URL | grep -Eo "https?://\S+?\"" | sed 's/&.*//'

Lastly, this does not take into account every possible way a link is displayed. Thus certain knowledge of the webpage structure or HTML is required. Given you can't/don't show an example of said structure or the webpage itself, it is difficult to make an answer that works on it, unless more HTML knowledge was involved.
P.S.: This may or may not be obvious, but this also doesn't take into account links/URLs that are generated dynamically (eg: PHP, JS, etc) since curl mostly works on static links.
P.S.(2): You should, if you want to make use of a better way to parse HTML, instead of using my solution above (which doesn't handle every corner case, given the lack of an HTML example/sample) use the better answer from Gilles Quenot that is more fit for general (eg: complete) and more optimized support of HTML syntaxes.

I am in no way recommending regex for parsing HTML, unless you know what you're doing or have very limited needs (eg: only want links), like in this case.

Plus one, but I do not understand your last comment. `php` is serverside and `js` clientside. Both can manipulate/insert html. So links generated from scripting languages should be seen by `curl` I think. If links are only visible after a user input like `click` or `hover`, then curl will not see them because they are injected after the curl call. — Timo, Dec 04 '22 at 06:21
That's what I meant yeah @Timo if links are generated dynamically (eg: without being statically written on the HTML page) they won't be seen by curl at any point unless it's server-side, but most of the time, it won't work with just curl, and would require a full browser/other ways (eg: nodejs, etc). Thanks by the way :) — Nordine Lotfi, Dec 04 '22 at 08:54
To see what I mean, you can use a curses/tui browser like `w3m`, `lynx`, or other, and look around the web for different webpages. A lot of them won't, even if you use the right cookies/user-agent, deliver to you their full content, _because_ a lot of those links either are dynamically generated, or the site check for certain functions that the browser do not have (or because of other stuff, such as captcha and what not that do not work with those browsers). It's basically similar with `curl` too. @Timo — Nordine Lotfi, Dec 04 '22 at 16:44
Having space in URL still not URI encoded is a valid link. Better use `curl -Ls 'https://stackoverflow.com' | grep -oP 'href="\K[^"]+'` if you insist to use regex even if you said it's a bad habit. Then, you will have also _relative_ links — Gilles Quénot, Dec 04 '22 at 22:40
Thank you! Did not think about that at the time. I'll add this in and credit you. :) @GillesQuenot — Nordine Lotfi, Dec 04 '22 at 22:44

Gilles Quénot · Answer 2 · 2022-12-05T18:14:35.877

3

Parsing HTML with regex is a regular discussion: this is a bad idea. Instead, use a proper parser:

`mech-dump`

mech-dump --links --absolute --agent-alias='Linux Mozilla' <URL>

This comes with the package www-mechanize-perl (Debian based distro).

_{^{(Written by Andy Lester, the author of ack and more...)}}

mech-dump doc

`xidel` or `saxon-lint`

Or a xpath & network aware tool like xidel or saxon-lint:

xidel -se '//a/@href' <URL>
saxon-lint --html --xpath 'string-join(//a/@href, "^M")' <URL>

^M is Control+v Enter

`xmlstarlet`:

curl -Ls <URL> |
    xmlstarlet format -H - 2>/dev/null |  # convert broken HTML to HTML 
    xmlstarlet sel -t -v '//a/@href' -    # parse the stream with XPath expression

`javascript` generated web page

You even can use XPath in a puppeteer javascript script

const puppeteer = require('puppeteer');

var base_url = 'https://stackoverflow.com';

(async () => {
    const browser = await puppeteer.launch({
        headless: true,
    });
    
    // viewportSize
    await page.setViewport({'width': 1440, 'height': 900});

    // UA
    await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0')

    // open main URL
    await page.goto(base_url, { waitUntil: 'networkidle2' }); 

    const xpath_expression = '//a[@href]';
    await page.waitForXPath(xpath_expression);
    const links = await page.$x(xpath_expression);
    const link_urls = await page.evaluate((...links) => {
        return links.map(e => e.href);
    }, ...links);

    await browser.close();

    link_urls.forEach((elt) => console.log(elt));

})();

Usage:

nodejs retrieve_all_links.js

edited Dec 05 '22 at 18:14

answered Dec 04 '22 at 19:39

Gilles Quénot

31,569
7
64
82

I find this answer very helpful, and love seeing alternatives (didn't know some of these tools, except for `xmlstarlet`), however, I find that you might find my answer dissatisfactory (which is fair) given it mainly uses regex, which is frowned upon when it comes to parsing HTML. But, while I do know the popular post on SO that portray this well, this doesn't mean that you should never in any given circumstance, use regex for parsing HTML. A lot of Perl parsers and others use regex to split tokens/tokenize XML/HTML. Besides that, yes, if it's valid HTML, and given, *you know what you're doing* – Nordine Lotfi Dec 04 '22 at 22:05
(cont 2) you could, as an alternative, especially if your wants/goals are limited (eg: only want links, nothing else) use regex, because of the speed efficiency compared to a full parser. I discussed this at length in the `/dev/chat` room on unix.SE, if you're curious, [here](https://chat.stackexchange.com/transcript/message/58221156#58221156). I would never use regex as a full parser, unless I was a regex master/expert, with great knowledge of valid and broken HTML syntax, but, given that's not the case, I was only giving an option/alternative, to the OP :) (since they also used regex too) – Nordine Lotfi Dec 04 '22 at 22:10
+1, by the way, since I was just explaining my position on this, given what you said might apply to me here. I'll edit this in on my own answer too when possible. – Nordine Lotfi Dec 04 '22 at 22:11
1

One disadvantage about your solution (apart using regex), is that relative URL like `/questions` will not be processed at all. (it's a valid link) – Gilles Quénot Dec 04 '22 at 22:17
yep, I mentioned this too in my "notes" on my answer :D well aware it doesn't take these into account. I only wanted to give an alternative, even if it's subpar. But I agree, it could be better – Nordine Lotfi Dec 04 '22 at 22:19
Edited my answer. Feel free to give feedback if you want. Thank you – Nordine Lotfi Dec 04 '22 at 22:30
I just now remembered an interesting answer I upvoted once, that portrayed what I meant better (when it comes to using regex for parsing HTML): https://stackoverflow.com/questions/4231382/what-to-do-regular-expression-pattern-doesnt-match-anywhere-in-string/4234491#4234491 just thought this was interesting :) – Nordine Lotfi Jan 03 '23 at 17:27

score 2 · Answer 3 · answered Aug 29 '17 at 07:31

2

You can use argument -s for curl, it is for the quiet mode. It will not show progress meter or error message.

answered Aug 29 '17 at 07:31

Benoît Zu

121
3

Actually I don't care about the progress bar; I just want to get all links in the web-page which `curl` does not provide me anymore. – Zeinab Abbasimazar Aug 29 '17 at 07:40
Which URL are you using ? – Benoît Zu Aug 29 '17 at 09:46
It's an internal address of my company. – Zeinab Abbasimazar Aug 29 '17 at 09:59

score 0 · Answer 4 · edited Dec 04 '22 at 20:07

0

I realize it's not what was asked by OP, but lynx, the text browser is an easier option, eg:

pages=($(lynx -dump -hiddenlinks=listonly "$1" | awk "/EpisodeDownload/{print $2}"))

above sample for scraping podcasts URLs from a particular site.

edited Dec 04 '22 at 20:07

Gilles Quénot

31,569
7
64
82

answered Aug 25 '19 at 17:09

Bruce Edge

101
1

score -1 · Answer 5 · edited May 03 '20 at 13:35

-1

The problem is because curl sends its output to STDERR, while | passes on STDOUT. See examples here.

Two possible solutions are:

Pipe STDERR to STDOUT and then pipe that to grep. curl -v http://vimcasts.org/episodes/archive/ 2>&1 | grep archive
Use the --stderr flag and give it a hyphen as argument. This will tell curl to use STDOUT instead. curl -v --stderr - http://vimcasts.org/episodes/archive/ | grep archive

edited May 03 '20 at 13:35

Jeff Schaller

66,199
35
114
250

answered May 03 '20 at 11:57

Max

189
2
9

2

technically, you could instead use `-L` with curl and it will output exactly as intended :) just FYI – Nordine Lotfi May 19 '21 at 05:25

curl get all links of a web-page

5 Answers5

mech-dump

xidel or saxon-lint

xmlstarlet:

javascript generated web page

`mech-dump`

`xidel` or `saxon-lint`

`xmlstarlet`:

`javascript` generated web page