7

I used to utilize following command to get all links of a web-page and then grep what I want:

curl $URL 2>&1 | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | egrep $CMP-[0-9].[0-9].[0-9]$ | cut -d'-' -f3

It was doing great till yesterday. I tried to run curl itself and I saw it returns:

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

Was there any possible updates which causes the command not working or what?

EDIT 1:

I changed my approach to wget regarding this answer:

wget -q $URL -O - | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | egrep $CMP-[0-9].[0-9].[0-9]$ | cut -d'-' -f3

But still doesn't know why curl approach suddenly stopped working.

Zeinab Abbasimazar
  • 293
  • 1
  • 4
  • 14

5 Answers5

4

Warning: Using regex for parsing HTML in most cases (if not all) is bad, so proceed at your own discretion.


This should do it:

curl -f -L URL | grep -Eo "https?://\S+?\""

or

curl -f -L URL | grep -Eo '"(http|https)://[a-zA-Z0-9#~.*,/!?=+&_%:-]*"'

Note:

  • This does not take into account links that aren't "full" or basically are what I call "half a link", where only a part of the full link is shown. I don't recall where I saw this, but it should appear on certain sites under certain/particular HTML tags. EDIT: Gilles Quenot kindly provided a solution for what I wrongly described as "half-link" (the correct term being relative link):
curl -Ls URL |  grep -oP 'href="\K[^"]+'
  • This also doesn't "clean" whatever won't be part of the link (eg: a "&" character, etc). If you want to remove that, make/use sed or something else like so:
curl -f -L URL | grep -Eo "https?://\S+?\"" | sed 's/&.*//'
  • Lastly, this does not take into account every possible way a link is displayed. Thus certain knowledge of the webpage structure or HTML is required. Given you can't/don't show an example of said structure or the webpage itself, it is difficult to make an answer that works on it, unless more HTML knowledge was involved.

  • P.S.: This may or may not be obvious, but this also doesn't take into account links/URLs that are generated dynamically (eg: PHP, JS, etc) since curl mostly works on static links.

  • P.S.(2): You should, if you want to make use of a better way to parse HTML, instead of using my solution above (which doesn't handle every corner case, given the lack of an HTML example/sample) use the better answer from Gilles Quenot that is more fit for general (eg: complete) and more optimized support of HTML syntaxes.

I am in no way recommending regex for parsing HTML, unless you know what you're doing or have very limited needs (eg: only want links), like in this case.

Nordine Lotfi
  • 2,200
  • 12
  • 45
  • 1
    Plus one, but I do not understand your last comment. `php` is serverside and `js` clientside. Both can manipulate/insert html. So links generated from scripting languages should be seen by `curl` I think. If links are only visible after a user input like `click` or `hover`, then curl will not see them because they are injected after the curl call. – Timo Dec 04 '22 at 06:21
  • That's what I meant yeah @Timo if links are generated dynamically (eg: without being statically written on the HTML page) they won't be seen by curl at any point unless it's server-side, but most of the time, it won't work with just curl, and would require a full browser/other ways (eg: nodejs, etc). Thanks by the way :) – Nordine Lotfi Dec 04 '22 at 08:54
  • 1
    To see what I mean, you can use a curses/tui browser like `w3m`, `lynx`, or other, and look around the web for different webpages. A lot of them won't, even if you use the right cookies/user-agent, deliver to you their full content, _because_ a lot of those links either are dynamically generated, or the site check for certain functions that the browser do not have (or because of other stuff, such as captcha and what not that do not work with those browsers). It's basically similar with `curl` too. @Timo – Nordine Lotfi Dec 04 '22 at 16:44
  • 1
    Having space in URL still not URI encoded is a valid link. Better use `curl -Ls 'https://stackoverflow.com' | grep -oP 'href="\K[^"]+'` if you insist to use regex even if you said it's a bad habit. Then, you will have also _relative_ links – Gilles Quénot Dec 04 '22 at 22:40
  • 1
    Thank you! Did not think about that at the time. I'll add this in and credit you. :) @GillesQuenot – Nordine Lotfi Dec 04 '22 at 22:44
3

Parsing HTML with regex is a regular discussion: this is a bad idea. Instead, use a proper parser:

mech-dump

mech-dump --links --absolute --agent-alias='Linux Mozilla' <URL>

This comes with the package www-mechanize-perl (Debian based distro).

(Written by Andy Lester, the author of ack and more...)

mech-dump doc

xidel or saxon-lint

Or a & aware tool like xidel or saxon-lint:

xidel -se '//a/@href' <URL>
saxon-lint --html --xpath 'string-join(//a/@href, "^M")' <URL>

^M is Control+v Enter

xmlstarlet:

curl -Ls <URL> |
    xmlstarlet format -H - 2>/dev/null |  # convert broken HTML to HTML 
    xmlstarlet sel -t -v '//a/@href' -    # parse the stream with XPath expression

javascript generated web page

You even can use XPath in a puppeteer javascript script

const puppeteer = require('puppeteer');

var base_url = 'https://stackoverflow.com';

(async () => {
    const browser = await puppeteer.launch({
        headless: true,
    });
    
    // viewportSize
    await page.setViewport({'width': 1440, 'height': 900});

    // UA
    await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0')

    // open main URL
    await page.goto(base_url, { waitUntil: 'networkidle2' }); 

    const xpath_expression = '//a[@href]';
    await page.waitForXPath(xpath_expression);
    const links = await page.$x(xpath_expression);
    const link_urls = await page.evaluate((...links) => {
        return links.map(e => e.href);
    }, ...links);

    await browser.close();

    link_urls.forEach((elt) => console.log(elt));

})();

Usage:

nodejs retrieve_all_links.js
Gilles Quénot
  • 31,569
  • 7
  • 64
  • 82
  • I find this answer very helpful, and love seeing alternatives (didn't know some of these tools, except for `xmlstarlet`), however, I find that you might find my answer dissatisfactory (which is fair) given it mainly uses regex, which is frowned upon when it comes to parsing HTML. But, while I do know the popular post on SO that portray this well, this doesn't mean that you should never in any given circumstance, use regex for parsing HTML. A lot of Perl parsers and others use regex to split tokens/tokenize XML/HTML. Besides that, yes, if it's valid HTML, and given, *you know what you're doing* – Nordine Lotfi Dec 04 '22 at 22:05
  • (cont 2) you could, as an alternative, especially if your wants/goals are limited (eg: only want links, nothing else) use regex, because of the speed efficiency compared to a full parser. I discussed this at length in the `/dev/chat` room on unix.SE, if you're curious, [here](https://chat.stackexchange.com/transcript/message/58221156#58221156). I would never use regex as a full parser, unless I was a regex master/expert, with great knowledge of valid and broken HTML syntax, but, given that's not the case, I was only giving an option/alternative, to the OP :) (since they also used regex too) – Nordine Lotfi Dec 04 '22 at 22:10
  • +1, by the way, since I was just explaining my position on this, given what you said might apply to me here. I'll edit this in on my own answer too when possible. – Nordine Lotfi Dec 04 '22 at 22:11
  • 1
    One disadvantage about your solution (apart using regex), is that relative URL like `/questions` will not be processed at all. (it's a valid link) – Gilles Quénot Dec 04 '22 at 22:17
  • yep, I mentioned this too in my "notes" on my answer :D well aware it doesn't take these into account. I only wanted to give an alternative, even if it's subpar. But I agree, it could be better – Nordine Lotfi Dec 04 '22 at 22:19
  • Edited my answer. Feel free to give feedback if you want. Thank you – Nordine Lotfi Dec 04 '22 at 22:30
  • I just now remembered an interesting answer I upvoted once, that portrayed what I meant better (when it comes to using regex for parsing HTML): https://stackoverflow.com/questions/4231382/what-to-do-regular-expression-pattern-doesnt-match-anywhere-in-string/4234491#4234491 just thought this was interesting :) – Nordine Lotfi Jan 03 '23 at 17:27
2

You can use argument -s for curl, it is for the quiet mode. It will not show progress meter or error message.

Benoît Zu
  • 121
  • 3
0

I realize it's not what was asked by OP, but lynx, the text browser is an easier option, eg:

pages=($(lynx -dump -hiddenlinks=listonly "$1" | awk "/EpisodeDownload/{print $2}"))

above sample for scraping podcasts URLs from a particular site.

Gilles Quénot
  • 31,569
  • 7
  • 64
  • 82
Bruce Edge
  • 101
  • 1
-1

The problem is because curl sends its output to STDERR, while | passes on STDOUT. See examples here.

Two possible solutions are:

  1. Pipe STDERR to STDOUT and then pipe that to grep. curl -v http://vimcasts.org/episodes/archive/ 2>&1 | grep archive
  2. Use the --stderr flag and give it a hyphen as argument. This will tell curl to use STDOUT instead. curl -v --stderr - http://vimcasts.org/episodes/archive/ | grep archive
Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
Max
  • 189
  • 2
  • 9