rlucas.net: The Next Generation Rotating Header Image

bugfix

s3put fails with ssl.CertificateError suddenly after upgrade

We had been using periods / dots in Amazon S3 bucket names in order to create some semblance of namespace / order. Pretty common convention.

A short while ago a cron job doing backups stopped working after some Python upgrades. Specifically, we were using s3put to upload a file to “my.dotted.bucket“. The error was:

ssl.CertificateError: hostname 'my.dotted.bucket.s3.amazonaws.com' doesn't match either of '*.s3.amazonaws.com', 's3.amazonaws.com'

It turns out that per Boto issue #2836 a recent strictifying of SSL certificate validation breaks the ability to validate the SSL cert when there are extra dots on the LHS of the wildcard. Boo.

If you don’t have the luxury of monkey-patching (or actually patching) the code that sits atop this version of boto, you can put the following section into your (possibly new) ~/.boto config file:

[s3]
calling_format = boto.s3.connection.OrdinaryCallingFormat

(Of course, expect that all of the nasty MITM attacks that stricter SSL validation is meant to mitigate to come back and bite you!)

DB Transaction “BEGIN” in Django shell

Django provides a handy “shell” which can be invoked using the manage.py for a project, and which will usefully setup the necessary Django environment and even invoke ipython for completion, syntax highlighting, debugger, etc.

Also usefully, but very much separate from the shell functionality, Django provides a nice framework for dealing with database transactions through its ORM. One can use django.db.transaction.rollback() for example.

However, the shell by default will be invoked with autocommit, meaning that each individual SQL statement gets committed. When one is poking around freehand in the shell, this might not be for the best, so one may want to turn off autocommit and resort to the choice of being able to rollback().

Unfortunately for that use case, all of the Django infrastructure for beginning database transactions is focused on how to begin a transaction in your code, where it rightly would be expected to be within a function or at least a “with” block. Hence, the docs and the code focus on using decorators, e.g. “@transaction.commit_on_success” or context managers, e.g. “with transaction.commit_on_success():“. Obviously not helpful in the shell / REPL.

If you are in your “manage.py shell” and need to do some romping around in your single-database Django app while being wrapped up in the warm fuzzy security blanket of a DB transaction lest you fat-finger something, you can get the same effect for your subsequent few commands in the shell with:

from django.db import transaction
transaction.enter_transaction_management()
transaction.managed(True)
# do stuff
imp = my_models.ImportantObject(title="Emabrassing Tpyos In the Titel")
imp.save()
# oops
transaction.rollback()
# this is too stressful, let's quit
transaction.leave_transaction_management()

Caveats: this only works in a one-database-connection setup where using the default connection does what you want; newer versions of Django may have a nice way to do this; don’t trust my random blog post with your production data!

s3put just stops working with “broken pipe”

So your cron job, which has been dutifully stuffing away into s3 your backups nightly or hourly or whatever, just stops working. s3put just breaks with the unhelpful complaint, “broken pipe.”

You can try running s3put with “–debug 2” added to your flags, and watch the lower protocol-level stuff seem to go along just fine until it barfs with the same error.

Check the size of your file. If you’ve got a backup that’s been slowing creeping up in size and is now over 5.0 GB, that’s your issue. AWS apparently has a 5 GB s3 limit for single-part HTTP PUT.

s3put accepts a “–multipart” option, but only if it can find the necessary Python libraries including “filechunkio,” so install filechunkio and try again. With any luck, you can just add –multipart to your s3put command and it will Just Work.

Python matrix initialization gotcha

If you want to spin up a list of lists — a poor man’s matrix — in Python, you may want to initialize it first. That way you can use indices to point directly (random access) into the matrix, with something like:

matrix[i][j][k]

without having to worry whether you’ve managed to make the matrix “big enough” through appending , looping, whatever.

If you are an idiot like me, you will skim StackOverflow and come away with the naive use of the “*” operator to create lists.

In [1]: lol = [[[None]*1]*3]*2

In [2]: lol
Out[2]: [[[None], [None], [None]], [[None], [None], [None]]]

That seems to work fine for our case — a small 3-D matrix (trivial in the third dimension I admit) initialized to None, the pseudo-undefined object of Python. Sounds good. Wait…

In [3]: lol[0][0][0] = 'asdf'

In [4]: lol
Out[4]: [[['asdf'], ['asdf'], ['asdf']], [['asdf'], ['asdf'], ['asdf']]]

Um. Since the same singleton None object was assigned to each of the slots in the matrix, changing it in one place changes it everywhere.

Facepalm.

To do what you actually want to do, use the list comprehension syntax and leave the monstrosity of the * operator alone:


In [21]: lolfixed = [[[None for k in range(1)] for j in range(3)] for i in range(2)]

In [22]: lolfixed = [[[None for k in range(1)] for j in range(3)] for i in range(2)]
KeyboardInterrupt

In [22]: lolfixed
Out[22]: [[[None], [None], [None]], [[None], [None], [None]]]

In [23]: lolfixed[0][0][0] = 'asdf'

In [24]: lolfixed
Out[24]: [[['asdf'], [None], [None]], [[None], [None], [None]]]

Avoid sequential scan in PostgreSQL link table with highly variant density

I had a particularly knotty problem today. I have two tables:


data_file (id, filename, source, stuff)
extracted_item (id, item_name, stuff)

A data file comes in and we store mention of it in data_file. The file can have zero, or more commonly, some finite positive number of items in it. We extract some data, and store those extracted items in, you guessed it, extracted_item.

There are tens of sources, and over time, tens of thousands of data files processed. No one source accounted for more than, say, 10% of the extracted items.

Now, sometimes the same extracted item appears in more than one file. We don’t want to store it twice, so what we have is the classic “link table,” “junction table,” or “many-to-many table,” thus:

data_file_extracted_item_link (data_file_id, extracted_item_id)

There are of course indices on both data_file_id and extracted_item_id.

Now, most data files have a tiny few items (1 is the modal number of items per file), but a couple of my data sources send files with almost 1 million items per file. Soon my link table grew to over 100 million entries.

When I went to do a metrics query like this:

select count(distinct data_file.filename),
count(data_file_extracted_item_link.*) from data_file left join
data_file_extracted_item_link on
data_file.id=data_file_extracted_item_link.data_file_id
where source=$1 and [SOME OTHER CONDITIONS]

I would sometimes get instant (40 ms) responses, and sometimes get minutes-long responses. It depended upon the conditions and the name of the source, or so it seemed.

ANALYZE told me that sometims the Postgresql planner was choosing a sequential scan (seqscan) of the link table, with its 100 million rows. This was absurd, since 1. there were indices available to scan, and 2. no source ever accounted for more than a few percent of the total link table entries.

It got to the point where it was faster by orders of magnitude for me to write two separate queries instead of using a join. And I do mean “write” — I could manually write out a new query and run it in a different psql terminal minutes before Postgres could finish the 100 million + row seqscan.

When I examined pg_stats, I was shocked to find this gem:

select tablename, attname, null_frac, avg_width, n_distinct, correlation from pg_stats where tablename='data_file_extracted_item_link';
tablename | attname | null_frac | avg_width | n_distinct | correlation
-------------------------------+-------------------+-----------+-----------+------------+-------------
data_file_extracted_item_link | extracted_item_id | 0 | 33 | 838778 | -0.0032647
data_file_extracted_item_link | data_file_id | 0 | 33 | 299 | 0.106799

What was going on? Postgres though there were only 299 different data files represented among the 100 million rows. Therefore, when I went to look at perhaps 100 different data files from a source, the query planner sensibly thought I’d be looking at something like a third of the entire link table, and decided a seqscan was the way to go.

It turns out that this is an artifact of the way the n_distinct is estimated. For more on this, see “serious under-estimation of n_distinct for clustered distributions” http://postgresql.nabble.com/serious-under-estimation-of-n-distinct-for-clustered-distributions-td5738290.html

Make sure you have this problem, and then, if you do, you can fix it by issuing two DDL statments (be sure to put these in your DDL / migrations with adequate annotation, and be aware they are PostgreSQL-specific).

First, choose a good number for n_distinct using guidance from http://www.postgresql.org/docs/current/static/sql-altertable.html

(In a nutshell, if you don’t want to be periodically querying and adjusting this with an actual empirical number, you can choose a negative number from (-1, 0] to force the planner to guess that the sparsity is abs(number), such that -1 => 100% sparsity.)

Then, you can simply

alter table data_file_extracted_item_link alter column data_file_id set (n_distinct = -0.5);
analyze data_file_extracted_item_link;

After which, things are better:

select tablename, attname, null_frac, avg_width, n_distinct, correlation from pg_stats where tablename='data_file_extracted_item_link';
tablename | attname | null_frac | avg_width | n_distinct | correlation
-------------------------------+-------------------+-----------+-----------+------------+-------------
data_file_extracted_item_link | extracted_item_id | 0 | 33 | 838778 | -0.0032647
data_file_extracted_item_link | data_file_id | 0 | 33 | -0.5 | 0.098922

and no more grody seqscan.

Postgresql speedup of length measurements: use octet_length

I was looking at some very rough metrics of JSON blobs stored in Postgres, mainly doing some stats on the total size of the blob. What I really cared about was the amount of data coming back from an API call. The exact numbers not so much; I mainly cared if an API usually sent back 50 kilobytes of JSON but today was sending 2 bytes (or 2 megabytes) — which is about the range of sizes of blobs I was working with.

Naively, I used

SELECT source_id, sum(length(json_blob_serialized)) FROM my_table GROUP BY source_id WHERE ;

But for larger (> 10k rows) aggregates, I was running into performance issues, up to minutes-long waits.

Turns out that length(text) is a slow function, or at least in the mix of locales and encodings I am dealing with, it is.

Substituting octet_length(text) was a 100x speedup. Be advised.

Finally, I wouldn’t have known this necessarily without a process-of-elimination over the aggregates I was calculating in the query, COMBINED with the use of “EXPLAIN ANALYZE VERBOSE.” Remember to add “VERBOSE” or else you won’t be given information on when, where, and how the aggregates get calculated.

“AttributeError: exp” from numpy when calling predict_proba()

If you’ve been trying out different types of Scikit-learn classifier algorithms, and have been merrily going along calling predict(X) and predict_proba(X) on various classifiers (e.g. DecisionTreeClassifier, RandomForestClassifier), you might decide to try something else (like LogisticRegression), which will seem to work for calling predict(X) but maddenly fails with “AttributeError: exp”

If you follow the stack trace and the error is when “np.exp” is being called within _predict_proba_lr, you might have my problem, namely, you have some un-casted booleans within your X. This causes the predict_proba method to fail with linear models (though not with classifiers).

You can fix this by converting your X to floats with X.astype(float) explicitly before passing X to predict_proba. Careful; if you have values that ACTUALLY don’t cast to float intelligently this will probably do terrible things to your model.

If you formed up your X as an np.array natively, you probably don’t get this behavior, since np.array’s constructor seems to convert your bools for you. But if you started with a pd.DataFrame or pd.Series, *even if you converted it to an np.array*, it will consider the bools as objects and they will bomb out in predict_proba.

(np = numpy, pd = pandas by convention)


import numpy as np
import pandas as pd
a = np.array([1,2,3,True, False])
b = pd.Series([1,2,3,True, False])
c = np.array(b)
d = c.astype(float)

## native np.array is OK, because bools were converted.
np.exp(a)
Out[64]: array([ 2.71828183, 7.3890561 , 20.08553692, 2.71828183, 1. ])

## pd.Series can usually be used where np.array, but not when exp can't handle bools:
np.exp(b)
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
np.exp(b)
AttributeError: exp

## merely explicitly creating a np.array first won't solve your problem.
np.exp(c)
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
np.exp(c)
AttributeError: exp

## explicit cast to float works.
np.exp(d)
Out[67]: array([ 2.71828183, 7.3890561 , 20.08553692, 2.71828183, 1. ])

Vagrant/Ansible SSH problem with older OpenSSH lacking ControlPersist

Intro; skip to the meat below if you found this through a google search on the error about “-c ssh … ControlPersist”

Vagrant is a Ruby-based abstraction layer that manages a mixture of VirtualBox (or other VM software), SSH, and “Provisioners” like Chef, Puppet, or in our case, Ansible. It’s meant mainly for setting up development and testing environments consistently; it lets you ignore the vagaries of each dev’s local box mess by actually running and testing the software inside a well-defined, consistently configured virtual machine.

Ansible is a Python-based configuration management tool that has a much more straightforward “up and running” learning curve than its ostensible peers. Notably, it is generally “agentless,” in the sense that all of the Ansible software gets run on your local box (your devops guy’s box at world headquarters) without any part of Ansible being installed on your remote nodes, and the actual process of configuring each node is done mainly by opening up ssh connections to each box and running generic, non-Ansible software (such as that remote box’s package manager).

Vagrant can invoke Ansible as a provisioner. Ansible can also be invoked to provision “real” machines, like EC2 instances or (does anyone even have anymore?) actual physical machines.

The holy grail of devops here would be to re-use your dev, test, and prod configs, varying them only in the necessary parts. Ansible is modular enough to do this, and so in theory, you do something very schematically like this:

core_software: x, y, z ...
some_debugging_stuff: a, b, c ...
real_live_security_stuff: m, n, o ...

dev_vm: core_software, some_debugging_stuff
prod_box: core_software, real_live_security_stuff

You can now invoke Vagrant to create a VM and provision it with Ansible to give you a “dev_vm”, while directly using Ansible to create a “prod_box” at your data center. Theoretically, you now have some assurance that the two boxen have exactly the same core software of x, y, z.

The meat of it

Ansible’s heavy reliance upon outbound SSH connections from your local box is OK but throws some kinks in the works when you try to use identical Vagrant + Ansible configurations on two machines that do not share identical software versions like SSH (say, one brand-new Mac OS X and one older Linux). Specifically, you may see this fatal error while performing a “vagrant up” or “vagrant provision”:

using -c ssh on certain older ssh versions may not support ControlPersist

If it’s not clear, that error is coming from Ansible which is sanity-checking the SSH options which it’s being asked to use by Vagrant. Test your local system with:

$ man ssh | grep ControlPersist

If that fails, you have an SSH which is too old to support the ControlPersist option, but Ansible thinks it’s being asked to use that. (ControlPersist is used by default by recent Ansible versions to speed up the reinvocation of SSH connections, since Ansible uses lots and lots and lots of them.)

Optional: to help you understand debug this, you’ll need to get Ansible more verbose. You can do this through the Vagrantfile you’re using by giving the option:

ansible.verbose = "vvv"

The error message you get from Ansible will suggest that you set ANSIBLE_SSH_ARGS=”” as a remedy. If you try this on the command line while invoking Vagrant merely by prepending it, like ‘ANSIBLE_SSH_ARGS=”” vagrant provision’, it won’t work; the “-vvv” output from Ansible will show that it’s been invoked with a long list of ANSIBLE_SSH_ARGS including the troublesome ControlPersist.

Further Googling may suggest that you can override the ssh args either in an “ansible.cfg” file (in one of /etc/ansible/ansible.cfg, ./ansible.cfg, or ~/.ansible.cfg) or in the Vagrantfile with “ansible.raw_ssh_args=[”]”. It is possible that none of these will seem to work; read on.

After much stomping around and examining of the Vagrant source as of 4ef996d at https://github.com/mitchellh/vagrant/blob/master/plugins/provisioners/ansible/provisioner.rb the problem became clear: Vagrant’s “get_ansible_ssh_args” function WILL permit you to set an empty list of ssh_args (thereby leading to Vagrant setting ANSIBLE_SSH_ARGS=”” for you), but only if there are NONE of the following set: an array (more than one) in config.ssh.private_key_path, true in config.ssh.forward_agent, or ANYTHING in the raw_ssh_args. If anything is set in any of those at that point, the ControlMaster and ControlPersist options will be set.

It’s kind of vexing because you don’t expect that setting forward_agent will cause these other things always to be set, even when you have tried explicitly to set raw_ssh_args to empty.

So in sum:

  • No ssh_args in any of the ansible.cfg files that may be looked at
  • Vagrantfile: ensure no more than one private ssh key in config.ssh.private_key_path
  • Vagrantfile: ensure config.ssh.forward_agent=false
  • Vagrantfile: ensure ansible block has ansible.raw_ssh_args=[]

(My problem was with OpenSSH OpenSSH_5.3p1 Debian-3ubuntu7.1, Ansible ansible 1.7.1, and Vagrant Vagrant 1.6.3, and was specifically triggered by the config.ssh.forward_agent=true.)

That solves it for me — I am happy with it because it’s almost 100% used for local Vagrant VMs. I have yet to see how managing remote boxes works from my older Linux machine with the ControlPersist optimization removed (though remember, in the case that you’re using Ansible directly and not through Vagrant, the above fix won’t apply.)

Pandas merge woes on MultiIndex, solved

Perhaps you are data munging in Python, using “pandas”, and you attempt to use DataFrame.merge() in order to put two DataFrames together.

“cannot join with no level specified and no overlapping names”

This happens when you have two DataFrames with one having a MultiIndex type, which *could* play nice together (e.g. you have “year, month” on the left, and “year” on the right, *but do not have names set.*

You’ll need to explicitly set names with

leftdf.index.levels[0].name = “onename”
leftdf.index.levels[1].name = “twoname”
rightdf.index.levels[0].name = “onename”

Alternatively, you can make it work if you reindex the right-hand side by the left hand side:

rightdf2 = rightdf.reindex(index=leftdf.index, level=0) ## NOTE Assignment, does not modify rightdf in-place

CHCBP-013 describes unlimited former spouse coverage under CHCBP

(This is extremely specialized subject material. If you found it while searching, you probably understand this, but if you’re just a curious reader, you might want to skip it.)

The Continued Health Care Benefit Program is analogous to COBRA for former TRICARE-covered persons; it is mostly a transitional way to get TRICARE-like (TRICARE less military facilities) benefits self-paid after one ceases to be eligible for TRICARE itself.

Generally the CHCBP eligibility period is limited. However, in the case of former spouses of servicemembers who have not remarried before age 55, it is possible to qualify for unlimited (lifetime) eligibility. This is obvious from law and from published stuff (e.g. continuing legal education materials for the tiny subset of lawyers for whose clients this is a very important topic), but it is nearly impossible to get anything from the “horse’s mouth,” that is, the actual CHCBP plan administrator, which is Humana Military Healthcare Services (Humana Military).

If you call Humana Military at 1-800-444-5445 and ask for the rules regarding former spouse unlimited eligibility, they will punt and tell you it’s something that is decided only at the point at which the normal 3-year eligibility has expired. If you persist and ask how it’s decided, they’ll tell you that there is a questionnaire which is sent, but that they don’t have access to it and can’t send a copy. If, incredulous, you insist that some person, perhaps that person’s manager, does in fact know where the questionnaire is, since it’s obviously already written and gets sent out to people all the time, you will get transferred to a kindly manager who takes your FAX number and swears up and down that they’ll send it to you tomorrow morning since the office is about to close.

If you’re a real ornery gadfly, though, and you’re ready to start writing letters to your congressman etc., you might try sending a letter to Humana Military asking for this information and documenting your attempts to get it and the promises which were made and unfulfilled about sending it to you before. If you do all this, they will mail you a redacted copy of an actual form CHCBP-13, which had been sent to an actual former spouse whose eligibility was expiring. (Maybe, in fact, the first representative was truthful and this is not a document that they have ready, and they write up a new one every time and so the only way to send it out is to take the latest one and redact the policyholder’s personal info. I doubt it.) (For this courtesy I am grateful, O unnamed mail-replier at Humana Military, but the policy of keeping this information completely off the public Web is either backwards or sinister.)

Note that the operative requirements are:

1. A “signed statement of assurance that you have not remarried before the age of 55.”
2. Proof (the standard of which is left to the imagination) that you were “enrolled in, or covered by, an approved health benefits plan under Chapter 55, Title 10, United States Code as the dependent of a retiree at any time during the 18-month period before the date of the divorce, dissolution, or annulment (Both TRICARE and CHCPB would qualify as such a plan).” [sic — Humana’s language implies that the enrolee must have been a dependent of a “retiree” but the actual language of 10 USC 55 sec. 1078a (b) (3) talks about a “member or former member” of the armed forces]
3. “A copy of the final divorce decree or other authoritative evidence” that you are actually receiving or have a court order to receive part of the retired or retainer pay of the service member, or have a written agreement that the service member will elect to provide you an annuity. [DO NOT rely on this nor on Humana’s exact language, for this, get your ducks in a row with the statute etc. — it’s slightly complicated.]
4. The executed (signed) renewal notice page of CHCBP-013 and a check for the next premium payable to “United States Treasury.”

Given that this is a hugely important item to a few (but maybe not all that few) people out there, and that it related to the execution of very clear federal law, I’m rather shocked that there’s no mention ***anywhere on the Internet prior to today(!!!)*** of CHCBP-013 and its contents and requirements.

So, with no further ado, I present you a redacted copy of form letter CHCBP-13, which shows the actual procedural requirements for obtaining unlimited former spouse CHCBP coverage.