[HINT] Preprocessing mongo XML files for use with XML::Simple

If you are a reasonable Perlista, the first thing you will do when you
have to do some modest but non-trivial munging of data locked up in XML
is to use XML::Simple.  The API is nearly perfect (absent the lack
of some defaults that could be more helpfully set for strictness) for
purposes of comprehensibility and transparency.

However, if you prototype on a small document, and then try to use your
code on a much bigger XML document, you will find the drawback:
tree-building is costly, and you may spend the vast majority of your
program's time parsing in the document.  One handy solution is to
preprocess your XML — just run XML::Simple's XMLin sub, and use
Data::Dumper to spit out the structure that results to a file. 
When you want to use it, you can simply “eval” it, for it defines a
native Perl structure, and you can use the remainder of your code
unchanged.  This resulted for me in a 2x – 10x speedup for certain
documents and certain sizes.

However — now imagine that you have some real torture-test data — 10
MB, heavily nested monstrosities of XML.  The Dumper output of the
parsed tree is now working on 100 MB!  Slurping this in and
evaling it is now the real problem.

Here's an idea: rather than slurping and evaling, try inlining it at
the compilation stage.  That's right — make use of Perl's much
more efficient way of slurping and evaling a filehandle with a pipe:

cat preprocessed_xml.dd myscript.pl | perl

It's somewhat unorthodox, but entirely functional.  Combined with
judicious use of gzip, this could be a very efficient way to get
little-changing XML documents into perl quickly — often very important
when doing dev work for which numerous iterations are required and for
which a minutes-long parse stage would adversely affect progress.

Update: It occurred to me that
using Storable or a Cache::* module might be faster yet.  At this
point, my work proceeds with tolerable speed using Data::Dumper, plus I
like using Dumper so that I can edit the output structures by hand if
need be.  But perhaps you should try those modules if you need
even better performance, or cringe at the hackishness of catenating
files piped to perl.

Leave a Reply