When you hear “The Blockchain,” grab your wallet

February 2nd, 2018

All the “thoughtful” executive types (VCs, C-levels, analysts) now seem to share the same low-risk opinion about crypto-currencies.  (“Superficial contrarianism,” perhaps)

It goes like this: “Well, Bitcoin itself is in a bit of a speculative frenzy [knowing chuckle, transitioning to sober contemplation] … however, we think there’s tremendous promise in The Blockchain”

A year or three ago, this seemed like a reasonable stance (give or take the adjective “tremendous”). In 2014, mail-order heroin and low-stakes gambling were basically the only uses for Bitcoin itself.

But 2017 has been the year of the ICO, and the major coins have pretty demonstrably become targets of Real Money by now. So, where are all these promising applications?

Here are some real live examples of the cockamamie schemes that are purporting to use “The Blockchain” as of January 2018. Names have been withheld to protect the fatuous:

  • Copyright (or other intellectual property) registration
  • Paying people to look at advertisements on the Internet
  • CARFAX (automobile history)
  • Employment resume / C.V. verification
  • Basically all of corporate finance (M&A, debt, equity)
  • Foreign Exchange

What do all these things have in common? They are real-world difficulties that are worth paying money to solve. What else do they have in common? They are all things that can be very acceptable solved with pencil and paper, or at most, 1990s Oracle RDBMS type technology.

Why would people change their behavior and pay good money to move these processes to The Blockchain? Surely these proponents have a logic: there must be some particularly compelling piece about The Blockchain, right?

  • “It’s decentralized!”
  • “It is totally immutable!”
  • “It’s totally automatic [‘smart contracts!’] with no way to cheat!”

If you hear this level of glib bullshittery being slung your way as rationale for using The Blockchain for some application, better hold onto your wallet.

Thinking about markets, not technology

In a parallel to the abuse of terminology around the word “disruption,” where people misappropriate Clay Christensen’s theory and talk about technologies themselves as “disruptive,” blockchain cheerleaders talk about technological features as if they were ipso facto benefits.

But “decentralized” or “immutable” or “automatic” or whatever – aren’t necessarily benefits, much less unique and transformative catalysts for market value.

Starting with The Blockchain and trying to shoehorn it in to various market needs is backwards-minded and almost always going to fail. Here’s why.

The Blockchain is not a fundamental technological capability on its own. Rather, it’s a clever combination of three particular features – each of which is a well-understood [if arcane] and widely-available capability. If you need these three particular capabilities for your market application, then The Blockchain is your huckleberry:

1. Path-dependent, unchanging history,
… that works decentralized with
2. untrusted nodes and counterparties,
…and wherein you need to have
3. cheating discouraged by math, not by secrets.

What do these three things mean?

Path-dependent, unchanging history – More or less, this is a “ledger,” where the final state is dependent upon all the changes that went before. One key difference, though, is that in ledgers, you can often go back and insert transactions you forgot to record. Another is that, with ledgers, the final state doesn’t really depend on whether you cashed check #101 before #102, as long as they both got cashed before the reporting date. If you need to keep a ledger that can never be changed (fixed; or, in fairness, cooked) and where it really matters what order everything happened, you need yourself one of these. Happily, we more or less have them and we call them “append-only databases.”

Untrusted nodes and counterparties – In other words, do you need to reliably transact while also expecting that each person you deal with is trying to screw you, possibly with the collusion of at least several others? If you do, my apologies, and you should seek better friends, but at least you have the benefit of decades of academic research on things like the “Byzantine Generals problem,” along with algorithms that can be proven to keep things on the level even when dealing with a network of partially treacherous counterparties.

Cheating discouraged by math, not by secrets – Most of the time when we want to discourage cheaters or thieves, we require some secret-ish piece of information, like a PIN number, a password, or even the pattern of notches carved into a physical metal key. As long as you can make sure the non-cheaters have the secrets (and none of the cheaters have the secrets) this works well. If you can’t be sure of this, though, you can make sure that would-be cheaters have to prove that they’ve done lots of long division and shown their work longhand on paper, which takes some amount of time no matter how clever they are. This is more or less the idea of “proof of work.”

[Note: After circulating a draft of this post privately, I realized that I almost certainly read Tim Bray’s “I Don’t Believe in Blockchain” at some point last year and unconsciously plagiarized much of the above.  But, even on a re-read this is still mostly true and definitely how I’ve been thinking about this stuff.  So, although I promise I wrote all the words in the section above, let’s consider most of the ideas, right down to the use of specific phrases, to have come from Tim.]

Here’s the obligatory Venn diagram:

http://rlucas.net/crypto_currency_venn.pdf

Now, what kind of ice-cream sundae do you make with these three particular scoops? Bitcoin, of course. But what is it really useful for? Well, I would say, a medium-sized economy among participants who expect many of the others to be thieves, who are afraid of getting cheated by welchers yet need to trade with these snakes, and who have no way outside of their interaction to share secrets between themselves.

In other words, you have a technological mélange purpose-built for mail-order heroin. That’s basically it.

OK, snarky guy, are you saying there are absolutely no apps for The Blockchain?

Let’s try to be fair-minded here. What are some use cases that really could leverage (or perhaps even require) all of these core capabilities?

The Domain Name System et al. For example, Namecoin. You can imagine that the DNS might one day be (or is already) too big or unwieldy for ICANN to administer through its ICANN/Registry/Registrar/Registrant system. You might then imagine that it becomes much more practical to allow a blockchain-based system where a node can lay claim to a particular unused name after a proof of work, even if others upon seeing it would like to steal it away. This checks the boxes as being a good fit. However, for DNS (and for ARIN and BGP tables and many other things) simply having a means to definitively establish the truth isn’t enough; there’s also the operational side of giving up to date answers very quickly to billions or trillions of queries, for which the actual blockchain piece is useless.

Voting. If you want to have an actually fair election, being able to prove all the votes in an immutable and universally verifiable way, even with the vote-counters as adversaries, this could be a really good app. (Unfortunately it’s a dismal business.) Consider that here you expect everyone else to be a thief/welcher/vote-rigger, and the scale of the problem begs decentralization, plus, there are real practical issues with handing out secrets ahead of time.

Distributed Computation 1: Storage. For example, Filecoin. If you want censorship-proof storage of information, but you need to incentivize participants, Filecoin seems well thought-through. In order to serve both privacy and censor-resistance needs, you need to treat everyone as a thief, and you need to prevent anyone from changing history to be truly censorship-proof. Finally, in order to be sure someone has earned their incentive you need a proof of work. Very clever. Unfortunately, this is also a dismal business, mainly because the cost of storage keeps dropping, and the sorts of things that you really need to keep in a censor-resistant data store, at least in a rule-of-law Western society, are either truly nasty (kiddie porn, nuclear weapons secrets) or simply not that lucrative (proof of governmental wrongdoing / whistleblower leaks).

Distributed Computation 2: CPU. This isn’t built yet. But you could imagine a Filecoin-type system meant to incentivize participants to conduct computation on their own devices and send the results home – sort of a SETI@Home for the blockchain. The problem here is that to really reach venture scale, you will need to allow not just specialized processing of SETI signals, or protein folding, or various of the other specialized decomposed computation problems that are well known, but truly general computing that allows customers to reliably and securely get ad hoc compute jobs of all sorts done. This is much harder than storage; in storage, I can encrypt files or blocks and have a reasonable certainty that the actual storage nodes will never be able to read or tamper with them. In compute, this gets a lot trickier. Imagine doing, say, massively distributed OCR or video rendering. The algorithms to do the heavy lifting will require the input data in an unencrypted, unobscured form – the OCR will need to actually “see” the image in order to read the words. If you are running a malicious node, you can “see” the page I’m sending to you and you could likewise interfere with the results, without me ever knowing. It seems technologically plausible to be able to transform at least some subset of compute tasks in a way that effectively encrypts or obscures both the computation and the I/O – but to do so in a way that is efficient seems quite tricky. If this nut gets cracked, you could see something truly transformative, as it would have the functional properties of Filecoin but with potentially far more attractive unit economics.

What about applications that are partial fits, using at least 2/3 of these core capabilities?

Title plants (land ownership). In the U.S., thanks to some ancient English traditions we’ve inherited, nobody definitively knows who owns what piece of land. The way we solve for this is the quiet but giant title insurance industry, who guarantees the ownership interest. They in turn underwrite the ownership by accumulating “title plants,” which are basically databases from individual jurisdictions (counties, parishes, states, etc.) going back as far as they possibly can, showing all of the valid changes in title. A title plant is an immutable historical append-only record. It’s presently centralized (and this makes for quite a lot of jobs in county recorders’ offices around the country). But it also means that for transactions that “touch” the title to real estate, there’s a small but significant delay and hassle in recording all such transactions with the centralized keeper of the record. If title plants were maintained on blockchains, you could imagine that real estate transactions – not only sales, but mortgages, liens, etc. – could close in seconds at the swipe of a finger or the call of an API. Is this a venture scale opportunity? Certainly if you controlled it all, you could charge a handsome fee and displace the entire title insurance industry. However, it’s unclear that there is any market pressure to do this: most sales, mortgages, and liens involve many other moving parts and changing the turnaround time from one business day to a few seconds on the recordation is of questionable standalone value.

Anti-spam (and robocalls, etc.). It could be that a mix of the proof-of-work and byzantine generals capabilities might be a great recipe for collaborative approaches to stopping or reducing spam and robocalls, by imposing costs on nodes and sub-networks for bad behavior. There are lots of approaches to this problem, though, and it’s not clear that even 2/3 of the blockchain is important – proof-of-work alone might be enough (or you could also just impose tiny charges on each message without fancy crypto math).

That’s it. I’m sure there are more to be found, but not vast greenfields of more applications.

The Blockchain is more like XML than like the Web.

The Web exposed a way to connect mostly-already-existing information and transactions to millions, and shortly thereafter, billions, of people. Subsequently, and building upon that network, new kinds of interactions that were now possible became commonplace (i.e. “Web 2.0” or “social” and “collaborative” technologies).

But The Blockchain isn’t like that. The Blockchain is a technological mishmash of capabilities or features that, in the right spot, can help solve some tricky coordination problems. It doesn’t, on its own, connect people to things they largely couldn’t do before.

In this way, The Blockchain is more like XML. Does anyone remember the 1998-2003 timeframe and the hoopla about XML? At the time, a very valid point was made that markup languages (well, really just HTML) were part of a huge important wave (Web 1.0), and the thought was that this could be extended to all parts of the information economy. XML, as the sort of generalized version of HTML (pace, markup language nerds, I know “Generalized” has a special meaning to you but work with me here), was going to be transformative! After all, it was:

– Vendor agnostic! (hah)
– Text-based and human readable! (hah)
– Stream or Document parseable with off-the-shelf tools! (hah)

Well, it turns out that those things, even when they were all true, were just capabilities that were well suited for some real-world problems and not so much for others. In fact, it was mainly suited for data interchange in a B2B context, or for certain kinds of document transformations. Not particularly compelling for much else.

What was actually important then was the particular flavor of XML called HTML, and the absurd gold rush around that time surrounding it. HTML seemed approachable and it was El Dorado. Everyone and her brother were “coding” HTML and learning how to FTP GIFs to their servers. Eventually, most of those people and companies failed, but in the midst of that gold rush they created a ton of real value.

[One commenter who was early in the promulgation of XML objected a bit here.  I think this is fair.  I’m not saying HTML came from XML; rather, I’m saying that attempts to generalize the basic underpinnings of HTML/HTTP — the human-readable-ish markup for wide interop — fell short, because the tech underpinnings were less important than the entrepreneurial and speculative frenzy.]

What the real story of Bitcoin is.

The real story here is quite the opposite of the new sober conventional wisdom. It’s not the underlying blockchain technology that’s fundamental, transformative, or even particularly interesting.

What’s interesting is precisely the speculative bubble: the specific crypto-coin(s), or more importantly, the frenzy of activity by fortune-seekers drawn to this El Dorado.

The actual underlying technology happens to be a bunch of particular features and capabilities, some of which have been around decades, that happen to be very useful for the main original Bitcoin use-case (mailing around LSD blotter paper with the “B$” logo, or whatever).

The Blockchain of the dilettante investor’s imagination is some kind of brave new philosopher’s stone that can upend any business process. Bullshit.

The real opportunities here are going to be generated by lots of aggressive and more or less smart people feverishly trying new, heretofore “unfundable” things, financed by their own crypto-currency gains or more likely by the wild gambling of the pilers-on who missed the first wave. Some of these gold-rushers are going to create things of real value and of venture scale, if only by a stochastic process.

s3put fails with ssl.CertificateError suddenly after upgrade

September 17th, 2015

We had been using periods / dots in Amazon S3 bucket names in order to create some semblance of namespace / order. Pretty common convention.

A short while ago a cron job doing backups stopped working after some Python upgrades. Specifically, we were using s3put to upload a file to “my.dotted.bucket“. The error was:

ssl.CertificateError: hostname 'my.dotted.bucket.s3.amazonaws.com' doesn't match either of '*.s3.amazonaws.com', 's3.amazonaws.com'

It turns out that per Boto issue #2836 a recent strictifying of SSL certificate validation breaks the ability to validate the SSL cert when there are extra dots on the LHS of the wildcard. Boo.

If you don’t have the luxury of monkey-patching (or actually patching) the code that sits atop this version of boto, you can put the following section into your (possibly new) ~/.boto config file:

[s3]
calling_format = boto.s3.connection.OrdinaryCallingFormat

(Of course, expect that all of the nasty MITM attacks that stricter SSL validation is meant to mitigate to come back and bite you!)

DB Transaction “BEGIN” in Django shell

September 3rd, 2015

Django provides a handy “shell” which can be invoked using the manage.py for a project, and which will usefully setup the necessary Django environment and even invoke ipython for completion, syntax highlighting, debugger, etc.

Also usefully, but very much separate from the shell functionality, Django provides a nice framework for dealing with database transactions through its ORM. One can use django.db.transaction.rollback() for example.

However, the shell by default will be invoked with autocommit, meaning that each individual SQL statement gets committed. When one is poking around freehand in the shell, this might not be for the best, so one may want to turn off autocommit and resort to the choice of being able to rollback().

Unfortunately for that use case, all of the Django infrastructure for beginning database transactions is focused on how to begin a transaction in your code, where it rightly would be expected to be within a function or at least a “with” block. Hence, the docs and the code focus on using decorators, e.g. “@transaction.commit_on_success” or context managers, e.g. “with transaction.commit_on_success():“. Obviously not helpful in the shell / REPL.

If you are in your “manage.py shell” and need to do some romping around in your single-database Django app while being wrapped up in the warm fuzzy security blanket of a DB transaction lest you fat-finger something, you can get the same effect for your subsequent few commands in the shell with:

from django.db import transaction
transaction.enter_transaction_management()
transaction.managed(True)
# do stuff
imp = my_models.ImportantObject(title="Emabrassing Tpyos In the Titel")
imp.save()
# oops
transaction.rollback()
# this is too stressful, let's quit
transaction.leave_transaction_management()

Caveats: this only works in a one-database-connection setup where using the default connection does what you want; newer versions of Django may have a nice way to do this; don’t trust my random blog post with your production data!

s3put just stops working with “broken pipe”

July 21st, 2015

So your cron job, which has been dutifully stuffing away into s3 your backups nightly or hourly or whatever, just stops working. s3put just breaks with the unhelpful complaint, “broken pipe.”

You can try running s3put with “–debug 2” added to your flags, and watch the lower protocol-level stuff seem to go along just fine until it barfs with the same error.

Check the size of your file. If you’ve got a backup that’s been slowing creeping up in size and is now over 5.0 GB, that’s your issue. AWS apparently has a 5 GB s3 limit for single-part HTTP PUT.

s3put accepts a “–multipart” option, but only if it can find the necessary Python libraries including “filechunkio,” so install filechunkio and try again. With any luck, you can just add –multipart to your s3put command and it will Just Work.

Python matrix initialization gotcha

June 30th, 2015

If you want to spin up a list of lists — a poor man’s matrix — in Python, you may want to initialize it first. That way you can use indices to point directly (random access) into the matrix, with something like:

matrix[i][j][k]

without having to worry whether you’ve managed to make the matrix “big enough” through appending , looping, whatever.

If you are an idiot like me, you will skim StackOverflow and come away with the naive use of the “*” operator to create lists.

In [1]: lol = [[[None]*1]*3]*2

In [2]: lol
Out[2]: [[[None], [None], [None]], [[None], [None], [None]]]

That seems to work fine for our case — a small 3-D matrix (trivial in the third dimension I admit) initialized to None, the pseudo-undefined object of Python. Sounds good. Wait…

In [3]: lol[0][0][0] = 'asdf'

In [4]: lol
Out[4]: [[['asdf'], ['asdf'], ['asdf']], [['asdf'], ['asdf'], ['asdf']]]

Um. Since the same singleton None object was assigned to each of the slots in the matrix, changing it in one place changes it everywhere.

Facepalm.

To do what you actually want to do, use the list comprehension syntax and leave the monstrosity of the * operator alone:


In [21]: lolfixed = [[[None for k in range(1)] for j in range(3)] for i in range(2)]

In [22]: lolfixed = [[[None for k in range(1)] for j in range(3)] for i in range(2)]
KeyboardInterrupt

In [22]: lolfixed
Out[22]: [[[None], [None], [None]], [[None], [None], [None]]]

In [23]: lolfixed[0][0][0] = 'asdf'

In [24]: lolfixed
Out[24]: [[['asdf'], [None], [None]], [[None], [None], [None]]]

Danielle Morrill is mostly right about VC deal sourcing – here’s how she’s wrong.

March 13th, 2015

Danielle Morrill has put out a fascinating TechCrunch article about the art vs. science of how VC “source deals” (find investments). It’s a rare candid peek into a side of venture capital, and from a perspective, that is foreign to most writing from outside the industry. Danielle is spot on with most of her article, but there are a few glaring holes in the picture she paints.

First, what’s right:

1. Old guard vs. new guard. Danielle is absolutely correct that in VC, as in most professions, the older cohort is in conflict with the newer, rising cohort. And in general, the older cohort is the group that controls the power and the economics: by definition, they’re the survivors who’ve made it to late-career stage and have done pretty well, and so they will tend to bring a status quo bias. Change always threatens the status quo — even in an industry that outwardly worships “disruption.”

2. Cargo cult or “pigeon superstition.” Danielle nails it on the head that most VCs — firms and individuals — refer to “pattern matching” and rely on it to a degree that is often indistinguishable from superstition. Generals are often guilty of “fighting the last war” instead of seeing new situations for what they are, and the same is true with investors.

Now, Danielle is a promoter and a hustler (and I mean both terms in the good sense), and it’s natural for her to think about the world — and to critique the VC industry — in terms of an aggressive outbound sales process. But there’s a problem with this approach. VC is not sales, and seeing only through a sales lens will give you a distorted view.

Here’s what Danielle’s article missed by a mile:

1. Deal Types, and why sourcing doesn’t suffice. From the investor’s perspective, there are two types of VC deals.

  • Type 1: Deals that will get done whether you do them or not.
  • Type 2: Deals that won’t necessarily get done if you don’t do them.


Type 1 deals
are “obvious.” Given decent market conditions (e.g. not in a crash/panic), they are going to get funded by *somebody*. For example: I just heard a pitch from two Stanford grads, one of whom had already started and sold a company, and whose traffic was growing 10-12% a week in a reasonably hot sector. They want a reasonably sized and priced Series A round. That deal is going to happen, no matter what.

If you want to invest in a Type 1 deal, you have to *win* it. You probably have to “source” it to win it, but that alone won’t do. You either need to pay a higher price (valuation), and/or bring more value to the table (domain expertise, industry connections, personal competence / trust).

    • Pricing: To be able to profitably pay a higher valuation, you need to have a pricing knowledge edge (which is, in more traditional areas of finance, *the* edge — why do you think Wall Street pays all that money for high-performance computing?). To get that edge, you must know something that other investors don’t about the industry, technology, or people.
    • Value-add: To bring more value to the table, you need to have something rare and desirable, and which you can *apply* from the board / investor level. That typically means you’ve previously made deep “investment” of skill, connections, and knowledge in the relevant industry, technology, or people.

Winning Type 1 deals isn’t about sourcing. It’s about front-loaded work: work spent building up knowledge, connections, reputation, skill, etc., and then demonstrating and exploiting that front-loaded work to add value.

Type 2 deals aren’t obvious. Maybe the team is somehow incomplete; maybe the sector is out of favor. Maybe there’s “hair on the deal,” as we say when things are complicated. Maybe there’s no actual company yet, as happens with spin-outs or “incubated” ideas.

If you want to invest in a Type 2 deal, you have to *build* it. It’s true you need to “source” it, but often times “it” doesn’t even look like a deal when you first learn of it. Maybe that means building up a team or negotiating a technology license from a corporate parent. Maybe it means helping make crucial first customer intros (and watching the results). Maybe it means “yak shaving,” getting some of that wooly hair off of the deal and cleaning it up. (I like that metaphor better than polishing the diamond-in-the-rough, but same idea.)

Almost always, it means building a syndicate. That starts with building consensus and credibility with the entrepreneurs and within one’s own firm. Then, it usually means building the co-investor relationship and trust needed to get the round filled out. (For Type 1 deals, it’ll tend to be easy to win over partners and co-investors. Not so Type 2.)

Winning Type 2 deals isn’t about sourcing. It’s about back-loaded work: building up a fundable entity and a syndicate to support it.

2. Pattern matching and non-obvious rationality.

The way that VCs approach “pattern matching” seems irrational when Danielle describes it. Why do those behaviors persist? It’s because they’re rational, in a non-obvious way.

If you lose money on a deal, you’ll be asked “what happened.” If your answer is “X,” and you’ve never encountered X before, there’s a narrative tidiness to the loss, especially if you resolve not to let X happen again.

If you lose money on a subsequent deal, and you also say it’s due to “X,” then things start to get problematic. Losing money on X twice starts to sound like folly. What’s the George W. Bush line? “Fool me twice … um, … you can’t get fooled again.”

If you lose money three times due to “X,” well, then, you will have real problems explaining to your upstream investors why that was a good use of their money. (Even if, in a Platonic, rational sense, it was.)

In an early-stage tech world swirling with risks, so many you can’t possibly control them all, you grab a hold of a few risk factors that you *can* control, which risks — if they bite you again — will have outsized career / reputation / longevity risk for you. And that gets called “pattern matching.”

(The same applies on the upside. If you attribute making money to Y once, it’s nice. But if you make money twice in a row, and claim that it was due to Y, and your early identification and exploitation of Y, you look like a prescient investing genius.)

Now, I don’t believe that the “X factors” and “Y factors” are all meaningless, or that pattern matching is a worthless idea. But even if you did believe that (overly cynical) idea, given the reasoning above, you should still consider it rational for VCs to behave exactly the same regarding “pattern recognition.”

3. Teams do work.

Although Danielle is right that sourcing, winning, and, ultimately, exiting profitable deals is the formula for individual success in VC, that ignores the very real role that firm “franchise” and teamwork can and should play.

Throwing ambitious investor types into the same ring and letting them fight it out like wild dogs isn’t the route to VC firm success. Well, in certain markets it probably works very well — but it’s incredibly wasteful of time and talent to have unmitigated, head-on competition between a firm’s own investors.

No. In fact, teams can and do work. Danielle’s own article manages to quote both Warren Buffett and his longtime partner, the less well-known but still mind-bogglingly successful investor Charlie Munger. Do you think Berkshire Hathaway board meetings are dominated by infighting as to whether Charlie or Warren deserves credit for the latest M&A deal? Hell, no.

Teams work in investing when, between teammates, there’s enough similarity to ease the building of mutual trust and respect, but enough difference to bring something new and useful to the shared perspective. That can be a difference in geographic, sector, or stage focus, as is classically the case. Or, I would argue, it can even be a difference in the part of the lifecycle of a VC investment that best suits a particular investor.

Let’s do a thought experiment. Let’s assume we have two partners in our firm, GedankenVC: Danielle Morrill’s clone, Danielle2, and Charlie Munger’s clone, Charlie2.

Assume there’s a hot new startup out there, let’s call it Software-Defined Uber for Shoes (SDUfS), led by a young charismatic team, who’s intent on building out a social media presence, throwing parties and events to attract energetic new employees, handing out free custom shoes around San Francisco, and otherwise making the best of their recent oversubscribed $2.5 M seed round.

Who’s going to source that deal? Danielle2 or Charlie2? (Sorry, Charlie.)

Now, fast-forward 3.5 years, and everything is amazing, if complicated. They’re on three continents, Goldman Sachs has done a private placement from their private client group, bringing equity capital raised to $600 M, and they’ve floated the first ever tranche of $750 M in Software-Defined Shoe Bonds (SDSBs). Revenues are forecast for $3 B next year, but there’s trouble going public because of regulatory uncertainty around the Argentine government’s treatment of their main on-demand shoe 3-D printing factory in suburban Buenos Aires, and the complicated capital structure. Underwriters and investors are skittish about ponying up for the IPO.

Who should be on the board of directors of SDUfS? Danielle2 or Charlie2? (No offense, Danielle.)

GedankenVC will do best if the person who can source that deal sources it, and if the person who can manage that complex cap structure to exit manages it. Teams do work.

(Disclaimer: I help “source deals” for Seattle-based B2B VC firm, Voyager Capital, but on this blog I speak only for myself.)

Avoid sequential scan in PostgreSQL link table with highly variant density

January 9th, 2015

I had a particularly knotty problem today. I have two tables:


data_file (id, filename, source, stuff)
extracted_item (id, item_name, stuff)

A data file comes in and we store mention of it in data_file. The file can have zero, or more commonly, some finite positive number of items in it. We extract some data, and store those extracted items in, you guessed it, extracted_item.

There are tens of sources, and over time, tens of thousands of data files processed. No one source accounted for more than, say, 10% of the extracted items.

Now, sometimes the same extracted item appears in more than one file. We don’t want to store it twice, so what we have is the classic “link table,” “junction table,” or “many-to-many table,” thus:

data_file_extracted_item_link (data_file_id, extracted_item_id)

There are of course indices on both data_file_id and extracted_item_id.

Now, most data files have a tiny few items (1 is the modal number of items per file), but a couple of my data sources send files with almost 1 million items per file. Soon my link table grew to over 100 million entries.

When I went to do a metrics query like this:

select count(distinct data_file.filename),
count(data_file_extracted_item_link.*) from data_file left join
data_file_extracted_item_link on
data_file.id=data_file_extracted_item_link.data_file_id
where source=$1 and [SOME OTHER CONDITIONS]

I would sometimes get instant (40 ms) responses, and sometimes get minutes-long responses. It depended upon the conditions and the name of the source, or so it seemed.

ANALYZE told me that sometims the Postgresql planner was choosing a sequential scan (seqscan) of the link table, with its 100 million rows. This was absurd, since 1. there were indices available to scan, and 2. no source ever accounted for more than a few percent of the total link table entries.

It got to the point where it was faster by orders of magnitude for me to write two separate queries instead of using a join. And I do mean “write” — I could manually write out a new query and run it in a different psql terminal minutes before Postgres could finish the 100 million + row seqscan.

When I examined pg_stats, I was shocked to find this gem:

select tablename, attname, null_frac, avg_width, n_distinct, correlation from pg_stats where tablename='data_file_extracted_item_link';
tablename | attname | null_frac | avg_width | n_distinct | correlation
-------------------------------+-------------------+-----------+-----------+------------+-------------
data_file_extracted_item_link | extracted_item_id | 0 | 33 | 838778 | -0.0032647
data_file_extracted_item_link | data_file_id | 0 | 33 | 299 | 0.106799

What was going on? Postgres though there were only 299 different data files represented among the 100 million rows. Therefore, when I went to look at perhaps 100 different data files from a source, the query planner sensibly thought I’d be looking at something like a third of the entire link table, and decided a seqscan was the way to go.

It turns out that this is an artifact of the way the n_distinct is estimated. For more on this, see “serious under-estimation of n_distinct for clustered distributions” http://postgresql.nabble.com/serious-under-estimation-of-n-distinct-for-clustered-distributions-td5738290.html

Make sure you have this problem, and then, if you do, you can fix it by issuing two DDL statments (be sure to put these in your DDL / migrations with adequate annotation, and be aware they are PostgreSQL-specific).

First, choose a good number for n_distinct using guidance from http://www.postgresql.org/docs/current/static/sql-altertable.html

(In a nutshell, if you don’t want to be periodically querying and adjusting this with an actual empirical number, you can choose a negative number from (-1, 0] to force the planner to guess that the sparsity is abs(number), such that -1 => 100% sparsity.)

Then, you can simply

alter table data_file_extracted_item_link alter column data_file_id set (n_distinct = -0.5);
analyze data_file_extracted_item_link;

After which, things are better:

select tablename, attname, null_frac, avg_width, n_distinct, correlation from pg_stats where tablename='data_file_extracted_item_link';
tablename | attname | null_frac | avg_width | n_distinct | correlation
-------------------------------+-------------------+-----------+-----------+------------+-------------
data_file_extracted_item_link | extracted_item_id | 0 | 33 | 838778 | -0.0032647
data_file_extracted_item_link | data_file_id | 0 | 33 | -0.5 | 0.098922

and no more grody seqscan.

Postgresql speedup of length measurements: use octet_length

January 8th, 2015

I was looking at some very rough metrics of JSON blobs stored in Postgres, mainly doing some stats on the total size of the blob. What I really cared about was the amount of data coming back from an API call. The exact numbers not so much; I mainly cared if an API usually sent back 50 kilobytes of JSON but today was sending 2 bytes (or 2 megabytes) — which is about the range of sizes of blobs I was working with.

Naively, I used

SELECT source_id, sum(length(json_blob_serialized)) FROM my_table GROUP BY source_id WHERE ;

But for larger (> 10k rows) aggregates, I was running into performance issues, up to minutes-long waits.

Turns out that length(text) is a slow function, or at least in the mix of locales and encodings I am dealing with, it is.

Substituting octet_length(text) was a 100x speedup. Be advised.

Finally, I wouldn’t have known this necessarily without a process-of-elimination over the aggregates I was calculating in the query, COMBINED with the use of “EXPLAIN ANALYZE VERBOSE.” Remember to add “VERBOSE” or else you won’t be given information on when, where, and how the aggregates get calculated.

“AttributeError: exp” from numpy when calling predict_proba()

November 13th, 2014

If you’ve been trying out different types of Scikit-learn classifier algorithms, and have been merrily going along calling predict(X) and predict_proba(X) on various classifiers (e.g. DecisionTreeClassifier, RandomForestClassifier), you might decide to try something else (like LogisticRegression), which will seem to work for calling predict(X) but maddenly fails with “AttributeError: exp”

If you follow the stack trace and the error is when “np.exp” is being called within _predict_proba_lr, you might have my problem, namely, you have some un-casted booleans within your X. This causes the predict_proba method to fail with linear models (though not with classifiers).

You can fix this by converting your X to floats with X.astype(float) explicitly before passing X to predict_proba. Careful; if you have values that ACTUALLY don’t cast to float intelligently this will probably do terrible things to your model.

If you formed up your X as an np.array natively, you probably don’t get this behavior, since np.array’s constructor seems to convert your bools for you. But if you started with a pd.DataFrame or pd.Series, *even if you converted it to an np.array*, it will consider the bools as objects and they will bomb out in predict_proba.

(np = numpy, pd = pandas by convention)


import numpy as np
import pandas as pd
a = np.array([1,2,3,True, False])
b = pd.Series([1,2,3,True, False])
c = np.array(b)
d = c.astype(float)

## native np.array is OK, because bools were converted.
np.exp(a)
Out[64]: array([ 2.71828183, 7.3890561 , 20.08553692, 2.71828183, 1. ])

## pd.Series can usually be used where np.array, but not when exp can't handle bools:
np.exp(b)
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
np.exp(b)
AttributeError: exp

## merely explicitly creating a np.array first won't solve your problem.
np.exp(c)
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
np.exp(c)
AttributeError: exp

## explicit cast to float works.
np.exp(d)
Out[67]: array([ 2.71828183, 7.3890561 , 20.08553692, 2.71828183, 1. ])

Vagrant/Ansible SSH problem with older OpenSSH lacking ControlPersist

September 16th, 2014

Intro; skip to the meat below if you found this through a google search on the error about “-c ssh … ControlPersist”

Vagrant is a Ruby-based abstraction layer that manages a mixture of VirtualBox (or other VM software), SSH, and “Provisioners” like Chef, Puppet, or in our case, Ansible. It’s meant mainly for setting up development and testing environments consistently; it lets you ignore the vagaries of each dev’s local box mess by actually running and testing the software inside a well-defined, consistently configured virtual machine.

Ansible is a Python-based configuration management tool that has a much more straightforward “up and running” learning curve than its ostensible peers. Notably, it is generally “agentless,” in the sense that all of the Ansible software gets run on your local box (your devops guy’s box at world headquarters) without any part of Ansible being installed on your remote nodes, and the actual process of configuring each node is done mainly by opening up ssh connections to each box and running generic, non-Ansible software (such as that remote box’s package manager).

Vagrant can invoke Ansible as a provisioner. Ansible can also be invoked to provision “real” machines, like EC2 instances or (does anyone even have anymore?) actual physical machines.

The holy grail of devops here would be to re-use your dev, test, and prod configs, varying them only in the necessary parts. Ansible is modular enough to do this, and so in theory, you do something very schematically like this:

core_software: x, y, z ...
some_debugging_stuff: a, b, c ...
real_live_security_stuff: m, n, o ...

dev_vm: core_software, some_debugging_stuff
prod_box: core_software, real_live_security_stuff

You can now invoke Vagrant to create a VM and provision it with Ansible to give you a “dev_vm”, while directly using Ansible to create a “prod_box” at your data center. Theoretically, you now have some assurance that the two boxen have exactly the same core software of x, y, z.

The meat of it

Ansible’s heavy reliance upon outbound SSH connections from your local box is OK but throws some kinks in the works when you try to use identical Vagrant + Ansible configurations on two machines that do not share identical software versions like SSH (say, one brand-new Mac OS X and one older Linux). Specifically, you may see this fatal error while performing a “vagrant up” or “vagrant provision”:

using -c ssh on certain older ssh versions may not support ControlPersist

If it’s not clear, that error is coming from Ansible which is sanity-checking the SSH options which it’s being asked to use by Vagrant. Test your local system with:

$ man ssh | grep ControlPersist

If that fails, you have an SSH which is too old to support the ControlPersist option, but Ansible thinks it’s being asked to use that. (ControlPersist is used by default by recent Ansible versions to speed up the reinvocation of SSH connections, since Ansible uses lots and lots and lots of them.)

Optional: to help you understand debug this, you’ll need to get Ansible more verbose. You can do this through the Vagrantfile you’re using by giving the option:

ansible.verbose = "vvv"

The error message you get from Ansible will suggest that you set ANSIBLE_SSH_ARGS=”” as a remedy. If you try this on the command line while invoking Vagrant merely by prepending it, like ‘ANSIBLE_SSH_ARGS=”” vagrant provision’, it won’t work; the “-vvv” output from Ansible will show that it’s been invoked with a long list of ANSIBLE_SSH_ARGS including the troublesome ControlPersist.

Further Googling may suggest that you can override the ssh args either in an “ansible.cfg” file (in one of /etc/ansible/ansible.cfg, ./ansible.cfg, or ~/.ansible.cfg) or in the Vagrantfile with “ansible.raw_ssh_args=[”]”. It is possible that none of these will seem to work; read on.

After much stomping around and examining of the Vagrant source as of 4ef996d at https://github.com/mitchellh/vagrant/blob/master/plugins/provisioners/ansible/provisioner.rb the problem became clear: Vagrant’s “get_ansible_ssh_args” function WILL permit you to set an empty list of ssh_args (thereby leading to Vagrant setting ANSIBLE_SSH_ARGS=”” for you), but only if there are NONE of the following set: an array (more than one) in config.ssh.private_key_path, true in config.ssh.forward_agent, or ANYTHING in the raw_ssh_args. If anything is set in any of those at that point, the ControlMaster and ControlPersist options will be set.

It’s kind of vexing because you don’t expect that setting forward_agent will cause these other things always to be set, even when you have tried explicitly to set raw_ssh_args to empty.

So in sum:

  • No ssh_args in any of the ansible.cfg files that may be looked at
  • Vagrantfile: ensure no more than one private ssh key in config.ssh.private_key_path
  • Vagrantfile: ensure config.ssh.forward_agent=false
  • Vagrantfile: ensure ansible block has ansible.raw_ssh_args=[]

(My problem was with OpenSSH OpenSSH_5.3p1 Debian-3ubuntu7.1, Ansible ansible 1.7.1, and Vagrant Vagrant 1.6.3, and was specifically triggered by the config.ssh.forward_agent=true.)

That solves it for me — I am happy with it because it’s almost 100% used for local Vagrant VMs. I have yet to see how managing remote boxes works from my older Linux machine with the ControlPersist optimization removed (though remember, in the case that you’re using Ansible directly and not through Vagrant, the above fix won’t apply.)