Postgres tips and tricks to kick-start the year 2021 - CYBERTEC


I decided to start out this year by looking into my notes from last year and extracting from them a small set of Postgres tips for you. This might not be, strictly speaking, fresh news… but I guess the average RAM of tech workers is about 64KB or so nowadays, so some repetition might not be such a bad thing.

Partial indexes

This is a huge one – and you’re really missing out if you’ve not heard of this feature or are not using it enough… as too often seems to be the case in my daily consulting work! Thus this was the first item on my list from last year. For larger amounts of data, partial indexes can literally be a life-saving trick when your back is already against the wall due to some unexpected increase in workload or disk usage. One of the reasons this feature is relatively unknown is that most other popular DBMS engines don’t have it at all, or they call it by another name.

The crux of the thing is super easy to remember – stop indexing the most common values! Since Postgres knows what your data looks like, it does not use the index whenever you search for a value that’s too common! As always, when you reduce indexes, you’re not just winning on the storage – you can also avoid the penalties incurred when you’re inserting or modifying the data. So: with partial indexes, when a new column contains the most common value, we don’t have to go and touch the index at all!

When does a value become “too common”? Sadly, there is no simple rule for this, as it depends on a couple of other config/data layout variables as well as the actual queries themselves – but roughly, starting at about 25-50% of the total rows, Postgres will tend to ignore the indexes.

Here is a small test table with 100 million rows to help you visualize the possible size benefits.

CREATE UNLOGGED TABLE t_event ( id int GENERATED BY DEFAULT AS IDENTITY, state text NOT NULL, data jsonb NOT NULL
); /* 1% NEW, 4% IN_PROCESSING, 95% FINISHED */
INSERT INTO t_event (state, data)
SELECT CASE WHEN random() < 0.95 THEN 'FINISHED' ELSE CASE WHEN random() < 0.8 THEN 'IN_PROCESSING' ELSE 'NEW' END END, '{}'
FROM generate_series(1, 5 * 1e7); postgres=# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+---------+-------+----------+---------+------------- public | t_event | table | postgres | 2489 MB | (1 row) CREATE INDEX ON t_event (state); CREATE INDEX ON t_event (state)
WHERE state <> 'FINISHED';
postgres=# \di+ t_event_state_* List of relations Schema | Name | Type | Owner | Table | Size | Description --------+--------------------+-------+----------+---------+---------+------------- public | t_event_state_idx | index | postgres | t_event | 1502 MB | public | t_event_state_idx1 | index | postgres | t_event | 71 MB | (2 rows) 

Another tip on partial indexes (to finish off the topic) – usually, another perfect set of candidates are any columns that are left mostly empty, i.e., “NULL”. The thing is that unlike Oracle, PostgreSQL has default all-encompassing indexes, so that all the NULL-s are physically stored in the index!

Estimating average row/record size in bytes

Quite often, you will need to evaluate the approximate size of a number of rows of data for capacity planning reasons. In these cases, there are a couple of approaches. The simplest ones consist of looking at the table size and row count and then doing some elementary math. However, the simplest approaches can be too rough and inaccurate – most often due to bloat; another reason could be that some historical data might not have some recently added columns (for example, due to some peculiarities of when Postgres built-in compression kicks in, etc.) The fastest way to get an estimate is to use the EXPLAIN command. EXPLAIN has the information embedded on the “cost” side – but do note that the estimate may use stale statistics, and it’s generally pretty inaccurate on “toasted” columns.

explain select * from pgbench_accounts a; QUERY PLAN --------------------------------------------------------------------------- Seq Scan on pgbench_accounts a (cost=0.00..2650.66 rows=100366 width=97)
(1 row)

So what do you do if you want to estimate the storage size of some recent data accurately?

The function listed below, which answers the question raised above, is a welcome discovery for most people when they see it in action. Probably, not many people know about this due to the somewhat “hidden” function name (naming things is one of the two most difficult problems in computer science 🙂 ), and due to the fact that we’re abusing the function – which is expecting a column – by feeding in a whole row! Remember: in Postgres all tables automatically get a virtual table type with all columns that is kind of a “scalar” – if they are not, for example, unpacked with “t.*”.

select avg(pg_column_size(a)) from pgbench_accounts a; avg ---------------------- 120.9999700000000000
(1 row) 

The advantage of this function is that we can exactly specify and inspect the data that we’re interested in. It also handles the disk storage side of things like compression and row headers!

Keep in mind that the size determined still does not usually map 1-to-1 to real-life disk usage for larger amounts of data due to good old companions like “bloat” and normal “defragmentation” or “air bubbles”. So if you want super-sharp estimates, there’s no way around generating some real-life test data. But do use a TEMP or UNLOGGED table if you do so; no need to create a huge WAL spike…especially if you happen to run a bunch of replicas.

Avoiding risk on upgrading older Postgres instances to new major versions

A fun (or maybe sad) fact from the Postgres consulting trenches – most Postgres users are running some pretty darn old versions! One of many reasons behind that is commonly voiced as, “We cannot stop our business for so and so many minutes.” I say – fair enough, minutes are bad…but with Logical Replication (LR) we’re talking about seconds!!! It’s no magic, far from it – the built-in LR introduced in v10 couldn’t be any simpler! We’ve performed many such migrations. In the vast majority of cases, it worked smoothly – as intended! In some cases, there might be problems if the verification/switchover period is overly prolonged…

In short – with LR, there is really no reason to run some ancient (well, in “internet years” at least 🙂 ) versions. If my article got you curious and you want to learn more about this topic, I’d suggest starting here.

By the way, if you’re on some pre v10 instance, but higher or equal to v9.4, then logical upgrades are also possible via a 3rd party plugin called “pglogical“, which often worked quite well when I used it some years ago.

Detecting which index types are available for my column type

What if you’re unhappy with your indexing performance? Maybe there are some other more cool index types available for your column? Postgres has a bunch of those, actually…

postgres=# SELECT DISTINCT a.amname
FROM pg_type t JOIN pg_opclass o ON o.opcintype = t.oid JOIN pg_am a ON a.oid = o.opcmethod
WHERE t.typname = 'int4'; amname -------- btree hash brin
(3 rows) 

And that’s not all – after declaring some “magic” extensions:

CREATE EXTENSION btree_gin;
CREATE EXTENSION btree_gist;

the picture changes to something like that below. Quite a lot of stuff to choose from for our good old integer!

postgres=# SELECT DISTINCT a.amname
FROM pg_type t JOIN pg_opclass o ON o.opcintype = t.oid JOIN pg_am a ON a.oid = o.opcmethod
WHERE t.typname = 'int4'; amname -------- bloom brin btree gin gist hash
(6 rows)

By the way, if you’re thinking, “What the hell are all those weird index types? Why should I give them a chance?”, then I would recommend starting here and here.

Inspecting metadata at the speed of light with psql’s “gdesc”

Since we are talking about data types…how do we determine what is actually delivered to us by the Postgres engine in tabular format? What type of data is in column XYZ for example?

The thing is that sometimes you get some above-average funky query sent to you where developers are having difficulties with “you might need to add explicit type casts” complaints from the server, converting some end results of a dozen sub-selects and transformations, so that the original column and its data type have already fallen into an abyss. Or, maybe the ORM needs an exact data type specified for your query, but runtime metadata introspection might be a painful task in the programming language at hand.

So here’s a quick tip on how to employ a PostgreSQL DBA’s best friend “psql” (sorry, doggos) for that purpose:

postgres=# SELECT 'jambo' mambo, 1 a, random(), array[1,2]; mambo | a | random | array -------+---+---------------------+------- jambo | 1 | 0.28512335791514687 | {1,2}
(1 row) postgres=# \gdesc Column | Type --------+------------------ mambo | text a | integer random | double precision array | integer[]
(4 rows) 

This only works starting from v11 of Postgres.

Thanks for reading, and hopefully it got you back on track to start learning some new Postgres stuff again in 2021! Check out last years blogpost for more tips.