I was looking at some very rough metrics of JSON blobs stored in Postgres, mainly doing some stats on the total size of the blob. What I really cared about was the amount of data coming back from an API call. The exact numbers not so much; I mainly cared if an API usually sent back 50 kilobytes of JSON but today was sending 2 bytes (or 2 megabytes) — which is about the range of sizes of blobs I was working with.
Naively, I used
SELECT source_id, sum(length(json_blob_serialized)) FROM my_table GROUP BY source_id WHERE
But for larger (> 10k rows) aggregates, I was running into performance issues, up to minutes-long waits.
Turns out that length(text) is a slow function, or at least in the mix of locales and encodings I am dealing with, it is.
Substituting octet_length(text) was a 100x speedup. Be advised.
Finally, I wouldn’t have known this necessarily without a process-of-elimination over the aggregates I was calculating in the query, COMBINED with the use of “EXPLAIN ANALYZE VERBOSE.” Remember to add “VERBOSE” or else you won’t be given information on when, where, and how the aggregates get calculated.