Tech preview: How PostgreSQL 12 handles prepared plans - Cybertec


PostgreSQL 12 is just around the corner and therefore we already want to present some of the new features we like. One important new feature gives users and devops the chance to control the behavior of the PostgreSQL optimizer. Prepared plans are always a major concern (especially people moving from Oracle seem to be most concerned) and therefore it makes sense to discuss the way plans are handled in PostgreSQL 12.

Firing up a PostgreSQL test database

To start I will create a simple table consisting of just two fields:

db12=# CREATE TABLE t_sample (id serial, name text);
CREATE TABLE

Then some data is loaded:

db12=# INSERT INTO t_sample (name) SELECT 'hans' FROM generate_series(1, 1000000);
INSERT 0 1000000 db12=# INSERT INTO t_sample (name) SELECT 'paul' FROM generate_series(1, 2);
INSERT 0 2

Note that 1 million names are identical (“hans”) and just two people are called “paul”. The distribution of data is therefore quite special, which has a major impact as you will see later in this post.

To show how plans can change depending on the setting, an index on “name” is defined as shown in the next listing:

db12=# CREATE INDEX idx_name ON t_sample (name);
CREATE INDEX

The PostgreSQL query optimizer at work

Let us run a simple query and see what happens:

db12=# explain SELECT count(*) FROM t_sample WHERE name = 'hans'; QUERY PLAN
------------------------------------------------------------------
Finalize Aggregate (cost=12656.23..12656.24 rows=1 width=8) -> Gather (cost=12656.01..12656.22 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=11656.01..11656.02 rows=1 width=8) -> Parallel Seq Scan on t_sample (cost=0.00..10614.34 rows=416668 width=0) Filter: (name = 'hans'::text)
(6 rows)

In this case PostgreSQL decided to ignore the index and go for a sequential scan. It has even seen that the table is already quite large and opted for a parallel query. Still, what we see is a sequential scan. All data in the table has to be processed. Why is that? Remember: Most people in the table have the same name. It is faster to read the entire table and kick out those other ones instead of having to read almost the entire index. The planner figures (correctly) that running a sequential scan will be faster.

What you can take away from this example is that an index is not used because it exists – PostgreSQL uses indexes when they happen to make sense. If we search for a less frequent value, PostgreSQL will decide on using the index and offer us the optimal plan shown in the next listing:

db12=# explain SELECT count(*) FROM t_sample WHERE name = 'paul'; QUERY PLAN
------------------------------------------------
Aggregate (cost=4.45..4.46 rows=1 width=8) -> Index Only Scan using idx_name on t_sample (cost=0.42..4.44 rows=1 width=0) Index Cond: (name = 'paul'::text)
(3 rows)

Optimizer statistics: Fueling good performance

If you are looking for good performance, keeping an eye on optimizer statistics is definitely a good idea. The main question now is: Which data does the optimizer keep? pg_stats contains information about each column:

db12=# \x
Expanded display is on.
db12=# SELECT * FROM pg_stats WHERE tablename = 't_sample' AND attname = 'name';
-[ RECORD 1 ] --------------------------+------------
schemaname | public
tablename | t_sample
attname | name
inherited | f
null_frac | 0
avg_width | 5
n_distinct | 1
most_common_vals | {hans}
most_common_freqs | {1}
histogram_bounds |
correlation | 1
most_common_elems |
most_common_elem_freqs |
elem_count_histogram |

PostgreSQL keeps track over the percentage of NULL entries in the table (null_frac). The average width of a column, the estimated number of distinct values (are all different, are all values the same). Then PostgreSQL keeps a list of the most frequent entries as well as their likelihood. The histogram_bounds column will contain the statistical distribution of data. In our example you will only find entries in this field if you are looking for the “id” column. There are only two names so keeping a histogram is basically pointless. The correlation column will tell us about the physical order of rows on disk. This field can be pretty important because it helps the optimizer to estimate the amount of I/O.

Preparing plans manually

If you send a query to PostgreSQL it is usually planned when the query is sent. However, if you explicitly want to prepare a query, you can make use of the PREPARE / EXECUTE commands. Here is how it works:

db12=# PREPARE myplan(text) AS SELECT count(*) FROM t_sample WHERE name = $1;
PREPARE
db12=# EXECUTE myplan('paul');
count
------- 2
(1 row)

As you can see the following query will give us an indexscan:

db12=# explain EXECUTE myplan('paul'); QUERY PLAN
-------------------------------------------------
Aggregate (cost=4.45..4.46 rows=1 width=8) -> Index Only Scan using idx_name on t_sample (cost=0.42..4.44 rows=1 width=0) Index Cond: (name = 'paul'::text)
(3 rows)

If we fall back to the more common value we will again get a parallel sequential scan:

db12=# explain EXECUTE myplan('hans'); QUERY PLAN
-------------------------------------------------------------------
Finalize Aggregate (cost=12656.23..12656.24 rows=1 width=8) -> Gather (cost=12656.01..12656.22 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=11656.01..11656.02 rows=1 width=8) -> Parallel Seq Scan on t_sample (cost=0.00..10614.34 rows=416668 width=0) Filter: (name = 'hans'::text)
(6 rows)

Why is that the case? In PostgreSQL life is not so straight forward. Even if we prepare explicitly we will still get a “fresh plan” before a generic plan is created. What is a generic plan? A generic plan is made assuming some constant parameters. The idea is to keep the plan and execute it multiple times in the hope that overall performance goes up due to lower planning overhead. Up to PostgreSQL 11 this process has been a bit “obscure” to most people.

Here is how the “obscure” thing works in detail. There are two ways PostgreSQL can choose for executing a prepared statement:

  • It could create a new plan for every execution that considers the current parameter value. That will lead to the best possible plan, but having to plan the query for every execution can remove most of the benefit for an OLTP application, which is to avoid having to plan the same statement over and over again.
  •  It could create a “generic plan” that does not take the parameter values into account. That will avoid re-planning the statement every time, but it can lead to problems during execution if the best plan depends heavily on the parameter values.

By default, PostgreSQL chooses a “middle road”: it will generate a “custom plan” during the first 5 executions of the prepared statement that takes the parameter values into account. From the sixth execution on, it will check if the generic plan would have performed as well (by comparing the estimated execution costs of the custom and the generic plan). If it thinks that the generic plan would have done just as well, the prepared statement will always use the generic plan from that point on.

PostgreSQL 12 introduces a new variable, which allows users to control the behavior more explicitly. Let us try the same thing again and enforce a generic plan: