Best way to count distinct indexed things in PostgreSQL


tl;dr; SELECT COUNT(*) FROM (SELECT DISTINCT my_not_unique_indexed_column FROM my_table) t;

I have a table that looks like this:

songsearch=# \d main_songtexthash Table "public.main_songtexthash" Column | Type | Collation | Nullable |
-----------+--------------------------+-----------+----------+ id | integer | | not null | text_hash | character varying(32) | | not null | created | timestamp with time zone | | not null | modified | timestamp with time zone | | not null | song_id | integer | | not null |
Indexes: "main_songtexthash_pkey" PRIMARY KEY, btree (id) "main_songtexthash_song_id_key" UNIQUE CONSTRAINT, btree (song_id) "main_songtexthash_text_hash_c2771f1f" btree (text_hash) "main_songtexthash_text_hash_c2771f1f_like" btree (text_hash varchar_pattern_ops)
Foreign-key constraints: ...snip...

And the data looks something like this:

songsearch=# select text_hash, song_id from main_songtexthash limit 10; text_hash | song_id
----------------------------------+--------- 6f98e1945e64353bead9d6ab47a7f176 | 2565031 0c6662363aa4a340fea5efa24c98db76 | 486091 a25af539b183cbc338409c7acecc6828 | 212 5aaf561b38c251e7d863aae61fe1363f | 2141077 6a221df60f7cbb8a4e604f87c9e3aec0 | 245186 d2a0b5b3b33cdf5e03a75cfbf4963a6f | 1453382 95c395dd78679120269518b19187ca80 | 981402 8ab19b32b3be2d592aa69e4417b732cd | 616848 8ab19b32b3be2d592aa69e4417b732cd | 243393 01568f1f57aeb7a97e2544978fc93b4c | 333
(10 rows)

If you look carefully, you'll notice that every song_id has a different text_hash except two of them.
Song IDs 616848 and 243393 both have the same text_hash of value 8ab19b32b3be2d592aa69e4417b732cd.

Also, if you imagine this table only has 10 rows, you could quickly and easily conclude that there are 10 different song_id but 9 different distinct text_hash. However, how do you do this counting if the tables are large??

The Wrong Way

songsearch=# select count(distinct text_hash) from main_songtexthash; count
--------- 1825983
(1 row)

And the explanation and cost analysis is:

songsearch=# explain analyze select count(distinct text_hash) from main_songtexthash; QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=44942.09..44942.10 rows=1 width=8) (actual time=40029.225..40029.226 rows=1 loops=1) -> Seq Scan on main_songtexthash (cost=0.00..40233.87 rows=1883287 width=33) (actual time=0.029..193.653 rows=1879521 loops=1) Planning Time: 0.059 ms Execution Time: 40029.250 ms
(4 rows)

Oh noes! A Sec Scan! Run!

The Right Way

Better explained in this blog post but basically, cutting to the chase, here's how you count on an indexed field:

songsearch=# select count(*) from (select distinct text_hash from main_songtexthash) t; count
--------- 1825983
(1 row)

And the explanation and cost analysis is:

songsearch=# explain analyze select count(*) from (select distinct text_hash from main_songtexthash) t; QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=193871.20..193871.21 rows=1 width=8) (actual time=4894.555..4894.556 rows=1 loops=1) -> Unique (cost=0.55..172861.54 rows=1680773 width=33) (actual time=0.075..4704.741 rows=1825983 loops=1) -> Index Only Scan using main_songtexthash_text_hash_c2771f1f on main_songtexthash (cost=0.55..168153.32 rows=1883287 width=33) (actual time=0.074..4132.822 rows=1879521 loops=1) Heap Fetches: 1879521 Planning Time: 0.082 ms Execution Time: 4894.581 ms
(6 rows)

Same exact result but ~5s instead of ~40s . I'll take that, thank you very much.

The Django Way

As a bonus: Django is smart. Here's how they do it:

>>> SongTextHash.objects.values('text_hash').distinct().count()
1825983

And, the SQL it generates to make that count looks very familiar:

SELECT COUNT(*) FROM (SELECT DISTINCT "main_songtexthash"."text_hash" AS Col1 FROM "main_songtexthash") subquery

Conclusion

  • Avoid "sequential scans" like the plague if you care about performance (...or not just killing your resources).
  • Trust in Django.

Previous:
Format numbers with numberWithCommas() or Number.toLocaleString() 05 March 2019