rlucas.net: The Next Generation Rotating Header Image

tools

Self-learning, fault-tolerant parsing

I was reading this piece on “Human Grade Parsers” from jckarter and it reminded me of something that’s been sticking in my brain for the past 18 months or so and refusing to go away.

The idea is for a fault-tolerant adaptive engine that I can apply to data ingestion, combining the ideas of ML and parsers.

For (what has now become) my side project, I get a lot of data files that come in, typically in CSV but also XLS, fixed width, and some oddball proprietary forms, generally with about a kilobyte of numerical data per record.  Happily, though, they mostly all should resolve to something similar: a rectangle of (mostly) floats.  (Sometimes it’s strings or dates or ints but the interesting bits are largely dollar values.)

As I (or my teammates / contractors) have banged out adhoc parsers for this mess, it gets tedious: over and over again, there is a particular step applied.  Sometimes it’s a bit of code you can copy-paste or, if feeling ambitious, try and abstract into a separate method / function.  Sometimes, it’s a bit of analysis you need to apply, hopefully only once, while writing code, usually by looking at a few of the input files: is this column the unique ID?  Is this column a zip code?  A zip+4?  A “only the first five of the zip, if there’s a +4 it’s in the next column?”  Etc., etc.  Only rarely is there something that’s truly unfamiliar or tricky.

The temptation grows to try to block out these various steps into modules like “try breaking this on commas” or “try breaking this on tabs” or more relevantly, “try using some heuristics and a fitness function to search all the possible fixed-width column boundaries to find a list of column break indices.”

Of course, writing this — even if the code itself is modular — as a series of try/catch blocks that sort of tend to fit the shape of the input file/string/stream is still tedious.  What I get to thinking is this: I should have a grammar that defines valid productions, and goes through like a recdescent parser, and tries to find the best valid production for a given input.  At various steps, there may be fitness measures, and there is an ultimate set of fitness measures (or features?) that can be observed: things like, the number of cols per record, the agreement as to number of cols per record, the average size of a record, the presence of certain characteristics (at least one unique ID column, at least one mostly uniqueish string name column), and the overall number of records per input.

The valid candidate productions from the grammar would basically be programs.  If they execute successfully on an input, they would generate an output.  Some of the productions, though, would be variadic in certain elements.  For example, the “find the column break indices” step would require a list of column breaks.

Is this just recapitulating “genetic programming?”  It’s not quite random.  But it does generate a possibly large number of candidates that I will rely upon ML type techniques to optimize for.

Am I just exercising recency bias to think about it as a parsing type problem now?  Parsers usually (in my limited experience) try to return the first or best production but the rules are kind of binary — works or doesn’t — and I will have some rules that could result in several OK but not optimal productions that will need to later be sorted and picked from.

Copious free time here.  But the allure of writing one CSV/XLS(X)/fixed-width parser to rule them all is strong…

 

 

Vim 7 is Incompatible with the Vimspell Plugin

On my Cygwin environment on Win XP, Vim 7 appears to run fine with one exception: the vimspell.vim plugin. It apepars that Vimspell conflicts with the new built-in spell check functionality in Vim 7. The symptom of this is that one starts to type and a massive amount of doubled or missed letters start to appear (and, no, I was not drunk when I noticed this). Removing the vimspell.vim plugin works fine.

However, I like to use a consistent ~/.vim directory across all my shells, so that I can store it in CVS and enjoy the same settings on every system. To do this across a heterogeneous environment of Vim 6 and Vim 7 boxen, I have made the following change to vimspell.vim:

61c61 < if exists("loaded_vimspell") || &compatible --- > if exists("loaded_vimspell") || &compatible || v:version >= 700  

This will short-circuit out of vimspell before it gets loading if the version is Vim 7.0 or above.

Outlook to Remind (out2rem) Converter Script v0.0.1

Update: I have fixed some stuff (time format and placement of AT keyword) and have posted v0.0.2 at the link below.

Please find here a short Perl script to dump out your Microsoft Outlook appointments in Remind format.

This should be useful to those of you who, like me, are tracking the whole plaintext / console / CLI resurgence as indicated here among other places.

If you have suggestions, please drop me an email at rlucas at tercent.com, and / or add helpful notes to the 43Folders Wiki.

Vim 7.0 Delights and Amazes with Beautified Auto-completion

Randall Lucas 2006-06-21

Vim, the text editor extraordinaire, came out with version 7.0 last month. At some point, unwittingly, I had updated my work computer — Cygwin under Windows XP — using the Cygwin setup.exe file, not expecting any major version number changes. Hence, I didn’t even realize that Vim 7 was now on my machine.

This morning, I was doing as I normally do in writing a document with many recurrences of the same word — “ctrl-N” to cycle through possible completions of the word — when an odd grey-and-purple blob appeared below my cursor, filled with words! Unsettled, I lifted my fingers from the keys — what was this colorized monstrosity?

And then I realized. Vim was giving me “tool tips.” Here, in a console window, using naught but VT100 control codes. Jaded IDE addicts will say: “sure, but my GUI IDE has had those for years.” Perhaps. But I can use my tool tips in a German cybercafe, over an SSH session from a Danger Hiptop, or over a serial line in a generic data center.

For Perl, populating the tool tips with syntactically valid items (method names, operators, etc.) will be hard, at least according to the conventional notion that “only perl can parse Perl.” But, for Ruby, Python, and, should the need arise, C or Java, adding syntax-awareness (see “:help complete-items”) should be just an exercise in glue coding.

If you manipulate text (and if you aren’t already an adept of another cult editor), then by all means get Vim!