How to parse hundred html source code files in shell?

Question

I have a couple of hundred html source code files. I need to extract the contents of a particular <div> element from each of these file so I'm going to write a script to loop through each file. The element structure is like this:

<div id='the_div_id'>
  <div id='some_other_div'>
  <h3>Some content</h3>
  </div>
</div>

Can anyone suggest a method by which I can extract the div the_div_id and all the child elements and content from a file using the linux command line?

I like that the answers here went the HTML route, rather than [the XML one](https://stackoverflow.com/a/22021857/785213). I for one would _much_ rather use CSS-style selectors than XPath ones, typically, unless I'm feeling particularly masochistic that day. — TheDudeAbides, Feb 06 '20 at 02:39

Steven D · Accepted Answer · 2011-01-24T19:32:06.763

31

The html-xml-utils package, available in most major Linux distributions, has a number of tools that are useful when dealing with HTML and XML documents. Particularly useful for your case is hxselect which reads from standard input and extracts elements based on CSS selectors. Your use case would look like:

hxselect '#the_div_id' <file

You might get a complaint about input not being well formed depending on what you are feeding it. This complaint is given over standard error and thus can be easily suppressed if needed. An alternative to this would to be to use Perl's HTML::PARSER package; however, I will leave that to someone with Perl skills less rusty than my own.

edited Jan 24 '11 at 19:32

answered Jan 24 '11 at 19:22

Steven D

45,310
13
119
114

2

`hxselect` is more picky about input format than `pup`. For instance, I'm getting `Input is not well-formed. (Maybe try normalize?)` with `hxselect` where `pup` just parsing it. – A B Jul 19 '16 at 22:32
@AB True, but `hxselect` is available in my distro's apt repository for me, and `pup` isn't. – starbeamrainbowlabs Mar 07 '20 at 01:02
Try `hxnormalize` on the file before `hxselect` – Eyal Jan 04 '21 at 05:35

score 14 · Answer 2 · edited Apr 10 '18 at 23:02

14

Try pup, a command line tool for processing HTML. For example:

pup '#the_div_id' < file.html

edited Apr 10 '18 at 23:02

kenorb

20,250
14
140
164

answered Jan 15 '16 at 17:57

Trevor Dixon

241
2
4

score 4 · Answer 3 · edited May 23 '17 at 12:40

4

Here's an untested Perl script that extracts <div id="the_div_id"> elements and their contents using HTML::TreeBuilder.

#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
foreach my $file_name (@ARGV) {
    my $tree = HTML::TreeBuilder->new;
    $tree->parse_file($file_name);
    for my $subtree ($tree->look_down(_tag => "div", id => "the_div_id")) {
        my $html = $subtree->as_HTML;
        $html =~ s/(?<!\n)\z/\n/;
        print $html;
    }
    $tree = $tree->delete;
}

If you're allergic to Perl, Python has HTMLParser.

P.S. Do not try using regular expressions..

edited May 23 '17 at 12:40

Community

1

answered Jan 24 '11 at 20:43

Gilles 'SO- stop being evil'

807,993
194
1,674
2,175

1

Python has whole http://doc.scrapy.org/en/latest/intro/overview.html ;) – A B Jul 19 '16 at 22:35

score 1 · Answer 4 · edited Apr 13 '17 at 12:51

Here is Ex one-liner to extract that part from each file:

ex -s +'bufdo!/<div.*id=.the_div_id/norm nvatdggdG"2p' +'bufdo!%p' -cqa! *.html

To save/replace in-place, change -cqa! into -cxa and remove %p section. For recursivity, consider using globbing (**/*.html).

It basically for each buffer/file (bufdo), it's doing the following actions:

/pattern - find the pattern
norm - start simulating normal Vi keystrokes
- n - jump into next pattern (required in Ex mode)
- vatd - remove the selected outer tag section (see: jumping between html tags)
- ggdG - remove the whole buffer (equivalent to :%d)
- "2p - re-paste previosly deleted text

Maybe not very efficient and not POSIX (:bufdo), but it should work.

note bufdo is not POSIX http://pubs.opengroup.org/onlinepubs/9699919799/utilities/ex.html — Zombo, Apr 17 '16 at 00:43

How to parse hundred html source code files in shell?

4 Answers4

Linked

Related