rlucas.net: The Next Generation Rotating Header Image

June 26th, 2007:

Relational Database Problems

Many of these problems arise because RDBMSs are designed typically with the conceit of being the sources of truth within the organization (e.g. invoice #1234 exists because the database says so), but are often used to reflect external truths about the world, which are input messily and which themselves are often shifting or subject to change in ways not contemplated by the schema.

I. The entity deduplication problem.

Alice is entering information about companies, and creates a record #1001 for “Acme Widgets.” She then adds a bunch of information, including relating other tables to Company #1001. For example, she may put in a press clipping about Acme Widgets, which gets linked by a link table to Company.id==1001.

Then, Bob comes along entering information about companies, and creates a record #1234 for “Acme Widget Corporation.” He adds stuff, and includes a different press clipping, which gets linked to Company.id==1234.

Later, Charlie arrives, and notices that there are two records starting with “Acme.” He investigates a bit, and discovers that “Acme Widget Corporation” is the full legal name of the company that is familiarly called “Acme Widgets.”

How can Charlie cause the database only to reflect a single Acme which is associated correctly with both press clippings?

II. The undo problem.

I do some stuff. Then I change some stuff. Then later, I want to see how it was before I changed it. Tough luck.

III. Bitemporality

Let’s include financials on our companies! Acme revenues are $12 M. (wait a year.) Now Acme revenues are $15 M. But wait: we now need a couple of slots for revenues. Oh, ok, a revenue report is associated with a year. Acme 2006 revenues were $12 M, 2007 revenues were $15 M. Wait. I was wrong. Acme 2006 revenues were $9 M in reality. OK, update that. Now the boss calls up and wants to know who it was that we told the wrong 2006 number to. But we can’t, since we don’t know what we thought we knew when we thought we knew it.

(In fairness, bitemporality is somewhat easier than the other ones; you just double-time-stamp everything. But it’s still a pain in the ass.)

IV. The incomplete multi-table entity problem

This one is not really an RDBMS problem so much as a Web applicaiton architecture problem.

Doug wants to create a record in our database — let’s say, a “publication.” The publication must have an “editor” (n:1) and at least one “author” (n:m, m >= 1). Doug gets to inserting the publication, but is stopped because it has a null editor_id, violating the editor constraint. D’oh! OK, we can work around this: good databases permit deferring of constraints until the end of a transaction. But in order to make that work with your application, you now must tightly couple the transactionality between the RDBMS and the server part of the Web application.

Let’s try again with Doug. He inserts the publication, no sweat, and now can insert an editor. Then, he adds some other stuff, like some authors. (Maybe Doug is using an AJAX-y Web front-end that lets him add new authors on the fly without going to a new “screen,” or maybe he has to navigate between modal “screens” to do this.) Because he’s added multiple different entities (a publication, an editor, some authors), he gets to feeling that what he’s done is already written to the database. He leaves without hitting save. Do’oh! Two things have now happened: first, the entire graph of all the entities he’s added is in limbo, and second, the server part of the Web app is holding open a (possibly scarce) connection to the RDBMS (which may or may not, depending on how fancy the example gets, be holding locks with a SELECT FOR UPDATE…).

The first sub-problem — the limbo — seems not that bad, because Doug is used to losing all his data when he forgets to hit “save” on a desktop application, so he’s fairly well trained and will avoid this. But if the program is “DRY” (Don’t Repeat Yourself), and especially if it had modal “screens,” Doug probably hit “save” or “update” or “submit” several times on interfaces that look like the normal (non-multi-table-entity graph) ones in order to add his multiple required entities (editors and authors). Therefore, Doug has a not unreasonable expectation that he’s done his part for saving, based on the fact that when he edits or creates an author entity in isolation, he uses that same workflow and it saves OK.

The second sub-problem — the RDBMS connection — is sort of more pernicious, because now Doug isn’t just losing (or jeopardizing) data he though he’d saved, but because he’s potentially contributing to a denial or degradation of service for all users. This is sort of an artifact of the way that DB connection pooling evolved from the mid-90s to today. Traditionally, DB connections have been scarce and costly to set up (network IO in addition to whatever socket / semaphore / locks / whatever had to be written to disk). Therefore, the widespread practice evolved in Web application design to use connection pooling — where some subsystem in the server layer makes a bunch of connections to the database, and then does fast handoffs of DB connections to application requests, freeing them up when each request is done. That way, you can run, say, 300 near-simultaneous requests with, say, 30 database connections.

Of course, if you’re pooling database connections, you pretty much have to run your entire transaction, succeed or fail, within the space of one application request (hopefully a sub-500ms affair), since transactions bind to database connections, but connections don’t bind to end-clients (stateless HTTP again). You can try and bind the DB connection to a particular client’s cookie- or token-identified session (like Seaside does, I think), but then you lose out on the ostensible benefits of pooling — now you need 300 RDBMS connections for your 300 clients, and your DB machine is choked. What’s more, if Doug leaves his transaction open, and you don’t fairly quickly time it out and kill it, you could end up needing more than 300 DB connections for your 300 clients, because you also have 300 old “Dougs” who’ve left stale transactions open — and now your DB machine is crashed.

I will update this post periodically with other relational database problems, and, I hope, solutions.