24

I have a couple of hundred html source code files. I need to extract the contents of a particular <div> element from each of these file so I'm going to write a script to loop through each file. The element structure is like this:

<div id='the_div_id'>
  <div id='some_other_div'>
  <h3>Some content</h3>
  </div>
</div>

Can anyone suggest a method by which I can extract the div the_div_id and all the child elements and content from a file using the linux command line?

kenorb
  • 20,250
  • 14
  • 140
  • 164
conorgriffin
  • 1,513
  • 6
  • 16
  • 23
  • I like that the answers here went the HTML route, rather than [the XML one](https://stackoverflow.com/a/22021857/785213). I for one would _much_ rather use CSS-style selectors than XPath ones, typically, unless I'm feeling particularly masochistic that day. – TheDudeAbides Feb 06 '20 at 02:39

4 Answers4

31

The html-xml-utils package, available in most major Linux distributions, has a number of tools that are useful when dealing with HTML and XML documents. Particularly useful for your case is hxselect which reads from standard input and extracts elements based on CSS selectors. Your use case would look like:

hxselect '#the_div_id' <file

You might get a complaint about input not being well formed depending on what you are feeding it. This complaint is given over standard error and thus can be easily suppressed if needed. An alternative to this would to be to use Perl's HTML::PARSER package; however, I will leave that to someone with Perl skills less rusty than my own.

Steven D
  • 45,310
  • 13
  • 119
  • 114
  • 2
    `hxselect` is more picky about input format than `pup`. For instance, I'm getting `Input is not well-formed. (Maybe try normalize?)` with `hxselect` where `pup` just parsing it. – A B Jul 19 '16 at 22:32
  • @AB True, but `hxselect` is available in my distro's apt repository for me, and `pup` isn't. – starbeamrainbowlabs Mar 07 '20 at 01:02
  • Try `hxnormalize` on the file before `hxselect` – Eyal Jan 04 '21 at 05:35
14

Try pup, a command line tool for processing HTML. For example:

pup '#the_div_id' < file.html
kenorb
  • 20,250
  • 14
  • 140
  • 164
Trevor Dixon
  • 241
  • 2
  • 4
4

Here's an untested Perl script that extracts <div id="the_div_id"> elements and their contents using HTML::TreeBuilder.

#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
foreach my $file_name (@ARGV) {
    my $tree = HTML::TreeBuilder->new;
    $tree->parse_file($file_name);
    for my $subtree ($tree->look_down(_tag => "div", id => "the_div_id")) {
        my $html = $subtree->as_HTML;
        $html =~ s/(?<!\n)\z/\n/;
        print $html;
    }
    $tree = $tree->delete;
}

If you're allergic to Perl, Python has HTMLParser.

P.S. Do not try using regular expressions..

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
1

Here is Ex one-liner to extract that part from each file:

ex -s +'bufdo!/<div.*id=.the_div_id/norm nvatdggdG"2p' +'bufdo!%p' -cqa! *.html

To save/replace in-place, change -cqa! into -cxa and remove %p section. For recursivity, consider using globbing (**/*.html).

It basically for each buffer/file (bufdo), it's doing the following actions:

  • /pattern - find the pattern
  • norm - start simulating normal Vi keystrokes
    • n - jump into next pattern (required in Ex mode)
    • vatd - remove the selected outer tag section (see: jumping between html tags)
    • ggdG - remove the whole buffer (equivalent to :%d)
    • "2p - re-paste previosly deleted text

Maybe not very efficient and not POSIX (:bufdo), but it should work.

kenorb
  • 20,250
  • 14
  • 140
  • 164
  • note bufdo is not POSIX http://pubs.opengroup.org/onlinepubs/9699919799/utilities/ex.html – Zombo Apr 17 '16 at 00:43