rlucas.net: The Next Generation Rotating Header Image

October 12th, 2018:

Self-learning, fault-tolerant parsing

I was reading this piece on “Human Grade Parsers” from jckarter and it reminded me of something that’s been sticking in my brain for the past 18 months or so and refusing to go away.

The idea is for a fault-tolerant adaptive engine that I can apply to data ingestion, combining the ideas of ML and parsers.

For (what has now become) my side project, I get a lot of data files that come in, typically in CSV but also XLS, fixed width, and some oddball proprietary forms, generally with about a kilobyte of numerical data per record.  Happily, though, they mostly all should resolve to something similar: a rectangle of (mostly) floats.  (Sometimes it’s strings or dates or ints but the interesting bits are largely dollar values.)

As I (or my teammates / contractors) have banged out adhoc parsers for this mess, it gets tedious: over and over again, there is a particular step applied.  Sometimes it’s a bit of code you can copy-paste or, if feeling ambitious, try and abstract into a separate method / function.  Sometimes, it’s a bit of analysis you need to apply, hopefully only once, while writing code, usually by looking at a few of the input files: is this column the unique ID?  Is this column a zip code?  A zip+4?  A “only the first five of the zip, if there’s a +4 it’s in the next column?”  Etc., etc.  Only rarely is there something that’s truly unfamiliar or tricky.

The temptation grows to try to block out these various steps into modules like “try breaking this on commas” or “try breaking this on tabs” or more relevantly, “try using some heuristics and a fitness function to search all the possible fixed-width column boundaries to find a list of column break indices.”

Of course, writing this — even if the code itself is modular — as a series of try/catch blocks that sort of tend to fit the shape of the input file/string/stream is still tedious.  What I get to thinking is this: I should have a grammar that defines valid productions, and goes through like a recdescent parser, and tries to find the best valid production for a given input.  At various steps, there may be fitness measures, and there is an ultimate set of fitness measures (or features?) that can be observed: things like, the number of cols per record, the agreement as to number of cols per record, the average size of a record, the presence of certain characteristics (at least one unique ID column, at least one mostly uniqueish string name column), and the overall number of records per input.

The valid candidate productions from the grammar would basically be programs.  If they execute successfully on an input, they would generate an output.  Some of the productions, though, would be variadic in certain elements.  For example, the “find the column break indices” step would require a list of column breaks.

Is this just recapitulating “genetic programming?”  It’s not quite random.  But it does generate a possibly large number of candidates that I will rely upon ML type techniques to optimize for.

Am I just exercising recency bias to think about it as a parsing type problem now?  Parsers usually (in my limited experience) try to return the first or best production but the rules are kind of binary — works or doesn’t — and I will have some rules that could result in several OK but not optimal productions that will need to later be sorted and picked from.

Copious free time here.  But the allure of writing one CSV/XLS(X)/fixed-width parser to rule them all is strong…