5

I have difficulty parsing a huge XML file (about 100GB with large nodes). I am trying to reduce the node sizes by deleting unnecessary tags. For example, any <text> tags.

If I use native XML parsers such as xmlstarlet

xmlstarlet ed -P -d '//text' file.xml

I face the same problem of being out of memory.

Is there a safe way (with little memory footprint) to remove all <text></text> pairs without breaking the XML structure?

muru
  • 69,900
  • 13
  • 192
  • 292
Googlebot
  • 1,909
  • 3
  • 25
  • 40
  • 2
    You're probably going to have to write your own tool that processes the XML iteratively rather than loading and parsing the entire document at once. e.g. using perl's [XML::Parser](https://metacpan.org/pod/XML::Parser) or [XML::Twig](https://metacpan.org/release/XML-Twig), or python's [lxml](http://lxml.de/). You may even find that by DIY, you don't even need to reduce the size prior to whatever actual processing you really want to do. Of course, this will inevitably sacrifice speed - neither perl nor python are anywhere near as fast as C but "works slowly" is better than "fails speedily". – cas Jun 04 '22 at 06:25

2 Answers2

6

I suggest you to give a try with xml_grep, it will be slow but memory efficient. It's a part of perl-XML-Twig (or xml-twig-tools) - Perl Module for Processing Huge Xml Documents in Tree Mode. You can use -v to exclude nodes by their name. See man xml_grep, test your commands with small inputs.

Example:

xml_grep --nowrap -v 'text' input.xml > output.xml

Or with a progress bar to watch it, as it will take a lot of time:

pv input.xml | xml_grep --nowrap -v 'text' > output.xml

For the general case, you could use Python, Perl, Java, Ruby (nokogiri) or similar, a language with some sax/stream module.

thanasisp
  • 7,802
  • 2
  • 26
  • 39
4

The following XSLT 3.0 stylesheet will do the job:

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0">
 <xsl:mode streamable="yes" on-no-match="shallow-copy"/>
 <xsl:template match="text"/>
</xsl:template>

Caveat: you'll need a streaming XSLT processor, which in practice probably means Saxon Enterprise Edition, which is a commercial product from my company Saxonica.

Also note, processing speed is likely to be about 2Gb/minute, depending of course on the hardware.

The alternative is to write your own code to do it, using a SAX-like API.

Michael Kay
  • 222
  • 1
  • 1