Posts Tagged ‘n_distinct’

Avoid sequential scan in PostgreSQL link table with highly variant density

Friday, January 9th, 2015

I had a particularly knotty problem today. I have two tables:


data_file (id, filename, source, stuff)
extracted_item (id, item_name, stuff)

A data file comes in and we store mention of it in data_file. The file can have zero, or more commonly, some finite positive number of items in it. We extract some data, and store those extracted items in, you guessed it, extracted_item.

There are tens of sources, and over time, tens of thousands of data files processed. No one source accounted for more than, say, 10% of the extracted items.

Now, sometimes the same extracted item appears in more than one file. We don’t want to store it twice, so what we have is the classic “link table,” “junction table,” or “many-to-many table,” thus:

data_file_extracted_item_link (data_file_id, extracted_item_id)

There are of course indices on both data_file_id and extracted_item_id.

Now, most data files have a tiny few items (1 is the modal number of items per file), but a couple of my data sources send files with almost 1 million items per file. Soon my link table grew to over 100 million entries.

When I went to do a metrics query like this:

select count(distinct data_file.filename),
count(data_file_extracted_item_link.*) from data_file left join
data_file_extracted_item_link on
data_file.id=data_file_extracted_item_link.data_file_id
where source=$1 and [SOME OTHER CONDITIONS]

I would sometimes get instant (40 ms) responses, and sometimes get minutes-long responses. It depended upon the conditions and the name of the source, or so it seemed.

ANALYZE told me that sometims the Postgresql planner was choosing a sequential scan (seqscan) of the link table, with its 100 million rows. This was absurd, since 1. there were indices available to scan, and 2. no source ever accounted for more than a few percent of the total link table entries.

It got to the point where it was faster by orders of magnitude for me to write two separate queries instead of using a join. And I do mean “write” — I could manually write out a new query and run it in a different psql terminal minutes before Postgres could finish the 100 million + row seqscan.

When I examined pg_stats, I was shocked to find this gem:

select tablename, attname, null_frac, avg_width, n_distinct, correlation from pg_stats where tablename='data_file_extracted_item_link';
tablename | attname | null_frac | avg_width | n_distinct | correlation
-------------------------------+-------------------+-----------+-----------+------------+-------------
data_file_extracted_item_link | extracted_item_id | 0 | 33 | 838778 | -0.0032647
data_file_extracted_item_link | data_file_id | 0 | 33 | 299 | 0.106799

What was going on? Postgres though there were only 299 different data files represented among the 100 million rows. Therefore, when I went to look at perhaps 100 different data files from a source, the query planner sensibly thought I’d be looking at something like a third of the entire link table, and decided a seqscan was the way to go.

It turns out that this is an artifact of the way the n_distinct is estimated. For more on this, see “serious under-estimation of n_distinct for clustered distributions” http://postgresql.nabble.com/serious-under-estimation-of-n-distinct-for-clustered-distributions-td5738290.html

Make sure you have this problem, and then, if you do, you can fix it by issuing two DDL statments (be sure to put these in your DDL / migrations with adequate annotation, and be aware they are PostgreSQL-specific).

First, choose a good number for n_distinct using guidance from http://www.postgresql.org/docs/current/static/sql-altertable.html

(In a nutshell, if you don’t want to be periodically querying and adjusting this with an actual empirical number, you can choose a negative number from (-1, 0] to force the planner to guess that the sparsity is abs(number), such that -1 => 100% sparsity.)

Then, you can simply

alter table data_file_extracted_item_link alter column data_file_id set (n_distinct = -0.5);
analyze data_file_extracted_item_link;

After which, things are better:

select tablename, attname, null_frac, avg_width, n_distinct, correlation from pg_stats where tablename='data_file_extracted_item_link';
tablename | attname | null_frac | avg_width | n_distinct | correlation
-------------------------------+-------------------+-----------+-----------+------------+-------------
data_file_extracted_item_link | extracted_item_id | 0 | 33 | 838778 | -0.0032647
data_file_extracted_item_link | data_file_id | 0 | 33 | -0.5 | 0.098922

and no more grody seqscan.