4

I have a HTML document that looks (when oversimplified) like this:

<html>
  <body>
    <a href="...">...</a>
    <a href="...">...</a>
    <a href="...">...</a>
    ...
  </body>
</html>

What I'd like to do would be to extract the URLs in line-delimited output. Enter xmllint:

$ xmllint --html --xpath //a/@href
href="..." href="..." href="..."

It's getting the attribute, the whole attribute including the name, and it's outputting them space-delimited. How can I just get a list of lines with the values of the href attribute? I want output like this:

...
...
...

where ... is the URL found in the href attribute of each a element.

How can I format this output properly?

Naftuli Kay
  • 38,686
  • 85
  • 220
  • 311

2 Answers2

1

Given file.html:

<html>
  <body>
    <a href="url1">link text 1</a>
    <a href="url2">link text 2</a>
    <a href="url3">link text 3</a>
    ...
  </body>
</html>

We can use Unix pipes to send existing xmllint's output, to sed and see this result:

$ xmllint --html --xpath //a/@href input.html | sed 's/ href="\([^"]*\)"/\1\n/g'
url1
url2
url3

Explanation

With xmllint alone, we only get:

$ xmllint --html --xpath //a/@href input.html
 href="url1" href="url2" href="url3"%
  • the trailing % indicates there is no trailing newline

One of the benefits of Unix-like systems is we can benefit from Doug McIlroy's pipes feature, so we don't have to have one program try to do everything, we are in fact encouraged to combine programs to suit our needs.

So, finding xmllint's output unsatisfactory, we pipe to combine it with our sed command, which:

  • searches for href="URL" units
  • using \( \) grouping to surround the URL part
  • and replacing it with \1\n so it references the group we defined around the URL, while also adding a new line after that matched \1

In this way we combine xmllint and sed to obtain the desired line-delimited output, one URL per line.

clarity123
  • 3,519
  • 1
  • 12
  • 16
0

Have you considered using sed:

sed -n 's/.*href="\([^"]*\).*/\1/p'

cesar
  • 577
  • 5
  • 15