PostgreSQL 11: something for everyone


This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

PostgreSQL 11 had its third beta release on August 9; a fourth beta (or possibly a release candidate) is scheduled for mid-September. While the final release of the relational database-management system (currently slated for late September) will have something new for many users, its development cycle was notable for being a period when the community hit its stride in two strategic areas: partitioning and parallelism.

Partitioning and parallelism are touchstones for major relational database systems. Proprietary database vendors manage to extract a premium from a minority of users by upselling features in these areas. While PostgreSQL has had some of these "high-tier" items for many years (e.g., CREATE INDEX CONCURRENTLY, advanced replication functionality), the upcoming release expands the number considerably. I may be biased as a PostgreSQL major contributor and committer, but it seems to me that the belief that community-run database system projects are not competitive with their proprietary cousins when it comes to scaling enterprise workloads has become just about untenable.

Partitioning

Partitioning is the ability to subdivide a parent table into a number of smaller tables that can be managed individually, while preserving the appearance of a single table to application code. Individual partitions can be indexed, migrated to a different filesystem/tablespace, or dropped altogether by the administrator. There may be significant benefits for performance, especially with things like reporting queries that must access most or all of the data in an entire partition. The partition key is often based on a natural way of dividing the data, such as partitioning sales data by country or year.

Partitioning now supports unique indexes, foreign keys, and automatic creation of new, equivalent indexes as new partitions are created. In general, partitions now have most of the capabilities of ordinary tables. Note that unique indexes must include the partition key; global indexes, where a single B-Tree structure covers each and every partition may appear as an enhancement in some future version.

Support for hash-based partition keys has been added. This allows partitions to be defined in terms of a hash key, relieving users from managing partitions when there is no natural way to uniformly separate rows into partitions.

UPDATE statements can move a row across partition boundaries in v11. This occurs when the update happens to affect a column that participates in defining the boundaries. Frequently doing so might defeat the purpose of partitioning, but making it at least possible to do so could make all the difference in a pinch.

"Dynamic" partition elimination can take place via partition pruning in v11. In version 10, the query planner inferred which partitions could safely be excluded from consideration during query optimization. In version 11, dynamic pruning will also take into consideration information that is only available during query execution, thereby eliminating more unnecessary partitions in some cases.

Last but not least, partitioning can take advantage of cases where aggregation has a GROUP BY clause that contains the partition key (i.e., partition-wise aggregate), and cases where two tables that happen to have exactly identical partitioning schemes are joined (i.e., partition-wise join). The general idea is that query execution can divide and conquer by joining and aggregating individual partitions rather than naively operating through the parent table. These optimizations are governed by the settings enable_partitionwise_aggregate and enable_partitionwise_join, both of which are set to "off" by default.

Parallelism

Parallelism in PostgreSQL has come a long way since it was introduced in PostgreSQL 9.6 just two years ago. Version 11 promises to fill many of the remaining gaps in the system's parallel capabilities.

CREATE INDEX can now scan a table in parallel, which will take place for B-Tree index builds whenever the new strategy is thought favorable, subject to various resource limits. Forming new index tuples occurs in worker processes, as does almost all sorting work. (The duration of a CREATE INDEX for B-Tree indexes is largely a matter of how quickly a batch sort operation can be set up and executed.)

Though as the feature's primary author I can make no claim of objectivity, I suspect that parallel CREATE INDEX will be a boon to applications that have not previously seen a benefit from parallelism. CREATE INDEX is an intensive batch operation that is often CPU bound; it is a natural target for parallel processing. That isn't necessarily the case for any SELECT query in certain applications. There may be no SELECT statement that processes enough rows at a time to ever benefit from parallelism. For my part, the feature is the culmination of years of work on low-level sorting optimizations.

Parallel B-Tree index builds will now usually be 2-3x faster than the equivalent serial index builds. Hat-tip to my co-authors, Rushabh Lathia and Heikki Linnakangas.

Parallel hash joins

A hash join is a join operation that is executed by building an ad-hoc hash table in memory for one side of the join (typically the smaller of the two inputs), and subsequently probing the hash table for a matching tuple when scanning the other side. It is possible that the hash table will be spilled to disk. Hash joins are often seen in query plans when there is an equi-join with an inherent need to process a large number of rows, especially when one side of the join has far fewer input rows than the other. When the join qualification is selective, such that most of the hash table probes don't find a match to output a joined tuple on, hash join inputs are usually from some other already-filtered scan on both sides, so the inputs are generally not correlated.

PostgreSQL 11 introduces "parallel-aware hash joins". While it was possible for a parallel sequential scan to feed a hash join in previous releases, the hash table on the inner side of the join had to be independently built within each parallel worker process. While this wasn't necessarily a bad strategy (it's still used when the hash table/inner side input is a tiny lookup table), it was quite limiting much of the time. Hash joins can now store their hash table in shared memory and can build it in parallel. The feature's primary author, Thomas Munro, wrote a blog post on this work that explains how it was developed and puts it in context.

This enhancement greatly increases the scalability of hash joins. Query plans with a hash join that builds a large hash table in parallel are now an excellent general showcase for PostgreSQL's parallel query capabilities. Being able to build the hash table in parallel often enables executing underlying "partial" scans in parallel, so parallel-aware hash joins enable greater parallelism within the query plan as a whole. Both sides of a hash join can be fed by a parallel sequential scan.

JIT compilation of queries

After over two years of work, Andres Freund committed a series of patches to add just-in-time (JIT) compilation to the PostgreSQL executor in March. JIT compilation makes use of the LLVM compiler framework. The general idea is to replace static machine code with specialized versions that are JIT compiled, especially for queries where comparatively small amounts of code bottleneck their execution.

This builds on an enhancement added to PostgreSQL 10 by Freund, with assistance from Tom Lane, which overhauled expression evaluation and target-list projection. That is, the method by which the executor evaluated expressions to filter rows (such as WHERE A = 15), and the method by which the executor projected expressions from rows to the next level up in the plan tree (such as RETURNING lower(name)) were each redesigned.

The v10 improvement essentially flattened the representation of expression state and changed how expression evaluation was performed. A compact, linear representation based on opcode dispatch was adopted, superseding an earlier tree-based representation. This helped performance in its own right by avoiding the need to do a tree walk that visits each node recursively. Individual nodes are trivially simple much of the time, so fixed overheads (e.g., call-stack space) can be important. Instruction cache misses are often the dominant cost incurred by recursive evaluation.

The follow-up v11 work builds on the flattened, composable representation by adding support for JIT-compiling highly specialized variants of expression evaluation and tuple deforming code, all without compromising the mathematically rigorous abstraction of set processing over relations. "Stock" LLVM intermediate representation (IR) bitcode files are generated for most standard operators (the underlying functions are compiled to IR bitcode) when PostgreSQL is initially built (--with-llvm builds only).

When a query is JIT compiled by PostgreSQL during query evaluation, the LLVM optimizer generates optimized IR based on the stock unoptimized IR. LLVM is well suited to JIT compiling optimized IR from IR generated by static languages such as C. Non-recursive expression evaluation allows less performance-critical code to be shared between interpreted and compiled plans, which was a crucial maintainability consideration for Freund.

The LLVM optimization process has access to the full context of which operators are applied to which columns, the shape of tuples (e.g., if a fixed column offset can be used), whether or not constant folding may be possible, and other such minutiae. A lot of branches can be entirely removed in the final generated machine code. The performance improvements that Freund has been able to show [PDF] using queries from a fair-use implementation of the industry standard TPC-H benchmark have been impressive.

Procedures

A traditional complaint from those moving from other systems to PostgreSQL has been the lack of stored procedures. A v11 feature from Peter Eisentraut goes some way toward addressing these concerns.

Procedures are similar to functions, in that they are routines that consist of reusable code that can be called from SQL. Unlike functions, procedures cannot return a value when called and cannot appear as part of a SELECT statement — a new CALL command has been added to execute procedures. The CALL command will begin and end transactions on the caller's behalf when the called procedural code requires it. Traditional functions, in contrast, are always subordinate to whatever specific SQL statement happens to execute them. However, CALL cannot be used inside an existing transaction block in most cases.

Transaction control features (COMMIT and ROLLBACK interfaces) were added to all of the standard procedural languages as part of the v11 feature: PL/pgSQL, PL/Python, PL/Perl, and PL/Tcl. It's clear to me that a certain proportion of application developers are allergic to putting business logic in the database system, despite the fact that batch processing logic is often a great deal faster when implemented that way. Perhaps the fact that clients are now able to cede full control of transaction scope to procedural code running on the database server (code that can be written in a familiar scripting language) will help application developers to get over the understandable aversion. Hope springs eternal.

INCLUDE indexes ("covering" indexes)

A personal favorite and insider's pick, INCLUDE indexes are those that have some number of extra included columns. These extra columns do not participate in comparisons as rows are located by index scans — their values are just an additional payload that don't affect unique enforcement for unique indexes or constraints. (It's possible to use this with a secondary/non-unique index as well, but that will be far less common.)

Judicious use of INCLUDE indexes can increase the number of index-only scans, and thus decrease the aggregate amount of I/O used to service queries. This enhancement allows users to add extra columns to a primary key or unique constraint without affecting when or how duplicate violation errors occur. An additional, quasi-redundant index need never be defined. A single all-encompassing index may now be able to service practically all queries with an index-only scan, at least if we make certain reasonable assumptions about the access patterns and rate of churn.

Regrettably, INCLUDE indexes failed to make their way into several previous releases. The community finally found a way forward on certain concerns about the on-disk representation in the final CommitFest of the v11 cycle. Thanks in large part to the hard work of the feature's primary author, Anastasia Lubennikova, administrators will have a powerful new tool for maximizing the use of index-only scans in specific, though important, situations.

Instant ADD COLUMN with a default value

Schema migrations are a challenge in many production environments. Depending on the exact details, a utility statement that adds a new column to a table may either be more or less instantaneous, or may take at least as long as it takes to rewrite everything. This is a world of difference. (Lock acquisition is another complicating factor, though often doesn't need to be given much consideration.)

A rewrite that wasn't strictly necessary could happen anyway in previous versions. In particular, there was typically no good reason for a non-NULL default value within ALTER TABLE ... ADD COLUMN to force a rewrite, but the extra metadata accounting required to avoid one wasn't in place. PostgreSQL 11 is able to store metadata about the default value, which is retrieved for columns in on-disk tuples without an explicit representation (those rows from before the column was added to the table) during subsequent SELECT statements. In other words, the "NULL default requires no rewrite" case was extended to include defaults consisting of any expression with a value that can be precalculated up front. This is not limited to simple scalar values — only rare cases where a rewrite is truly necessary (e.g., DEFAULT random()) will require a rewrite.

Where the rubber meets the road

General purpose database systems must support a wide variety of use cases, especially since users seldom know what will be important to their application well ahead of time. I expect that progress on general performance and scalability will continue or pick up in future releases, though the true rate of progress depends in no small part on the participation of the wider user community. Many enhancements require that some excruciating tradeoff be made. As the progenitor of PostgreSQL, Dr. Michael Stonebraker put it in his 2015 ACM Turing Award lecture: "If you are going to solve real-world problems, you got to talk to real people to know what they are. The rubber meets the road as opposed to the rubber meets the sky."

Prognostication is a tricky business, but there are some early signs on how version 12 will shape up. Even more operations will be parallelized. A patch for parallel VACUUM is in the works for version 12, written by Masahiko Sawada. We expect to see progress on the zheap project, a longer term effort led by Robert Haas and Amit Kapila to address problems with VACUUM by completely re-architecting how garbage collection of old row versions takes place.

An index skip scan patch has also been proposed, which enables B-Tree index scans that omit the leading column from a particular available index when the column happens to have few distinct values. A similar patch by Alexander Korotkov to recognize that the ordering provided by a B-Tree index can be made to work by performing many small sorts on an omitted trailing column may finally leave patch purgatory in the v12 cycle. There is a duality here: the first enhancement helps the case where we want to do an index scan using the B-Tree sort order "(B, C)", but only have an index on "(A,B,C)", whereas the second enhancement helps the case where we want to do an index scan with the sort order "(A,B,C)", but only have an index on "(A,B)". The former enhancement "skips", while the latter enhancement "extends".

I strongly suspect that both patches will help real-world performance for online transaction processing (OLTP) and web applications quite noticeably, especially by making the indexes that are available work well enough without expert database administrator intervention. Early feedback from users may validate that intuition, as well as other intuitions held by other contributors.

Another area that's ripe for further improvements, which would likewise benefit from feedback from more sophisticated users, is the query planner. Remaining gaps in the optimizer's capabilities include an inability to push down aggregates below joins and an inability to produce plans that are optimal for certain kinds of star schema queries.

While building a feature complete relational database system that deals gracefully with real-world production performance and scalability challenges is an enormous undertaking, PostgreSQL now provides all the major pieces.

I would like to thank Joe Conway for his helpful review of this article.

[Peter Geoghegan works for Crunchy Data.] (Log in to post comments)