Today we’re going to look at performance tuning PostgreSQL. PostgreSQL is an advanced open source SQL database. Setting up the database is just the first step. With every table you add and every query you run, your databases will require maintenance and updates to ensure that they continue to run efficiently.
Although PostgreSQL does make a lot of optimizations for you, there are some things that it needs to know from you, the database admin, in order to run effectively.
When it comes to performance tuning an application, one rule applies: don’t optimize early. Start first by implementing your database and schema. Then when you’ve got your optimal setup, you can start to monitor your SQL queries with tools like Retrace. This is important because performance tuning is all about trade-offs. You might accept a slow-running query in one part of your application that isn’t used so often in order to get a lightning-fast response time on a frequently performed query.
Today I’m going to walk you through some different optimizations. First, we’ll go through some of the basic setup options available:
- Hardware updates—Changes you can make to the physical server
- Configuration—Modifications to the out-of-the-box PostgreSQL configuration
- Vacuuming—Ways in which vacuum settings can improve performance
Once your system is set up, we’ll see how you can analyze and improve your schema:
- Analyze your query performance—How to analyze individual queries
- Analyze your logs—How to get information out of your system for analysis
- Index for query performance—How adding indexes helps the database to optimize its query plan
Part 1: Best Practices and Setup
In the first part of this article, we’re going to go through how you can alter your basic PostgreSQL setup for faster performance. In part 2, I’ll cover how to optimize your system specifics, such as query optimizations. Let’s get to it!
The obvious place to start when it comes to optimizing performance is looking at the hardware of the system itself. Some of the most important factors to your database are the available memory, CPU and disk space, and performance.
When it comes to hardware updates, you should consider the following:
Updating your memory. Memory is what your system uses to effectively cache data it needs often. It’s important to keep memory in mind when you’re optimizing your queries later. The bigger your cache, the less your database will need to go out to disk—which can be painful for performance. PostgreSQL tries to hold the most frequently accessed data in memory to make performance improvements based on how your queries are performed and the configuration that you give it. But we’ll return to memory-based performance optimization later.
Separating the application from the database. On the note of increasing memory, if you’re running your database application on the same server as your application, you may want to consider isolating it. It’ll be difficult to truly analyze and improve your database performance if a separate component is affecting your metrics and environment.
Database configuration: The what, why, and how
PostgreSQL comes with a set of default configurations. This default configuration is set up for compatibility and is essentially a best guess that tries to suit all possible use cases of PostgreSQL. Luckily for you, that means there are some potential quick wins if you start to tweak the default configuration.
Before we dive in and go through some of the most common configuration optimizations you can make, it’s worth pointing out that there are several tools available—such as PGTune—that try to do some of this configuration legwork for you. Their effectiveness varies, but it might be worth experimenting to the delta between the output of these tools and your current database configuration.
Database configurations in PostgreSQL are made directly in the configuration file (postgresql.conf), or through running an ALTER SYSTEM command. If you want to see all the current configurations of your database now, simply run the SHOW command as follows:
This command will list out all of the existing configuration and their settings. It’s also important to note that different configurations will apply only in certain conditions, such as a database restart. Some configuration changes will require a server restart; others will require a reload of the configuration.
When you’ve made configuration changes to your database, you can see details such as whether a restart is required for your configuration update by running the following command:
SELECT * FROM pg_settings WHERE pending_restart = true;
You will probably want to modify a lot of different configurations to get the most out of your PostgreSQL database. Let’s go through some of the main configurations that you can change to get more performance from your system.
Connections are how your applications communicate with your database. Each connection will require some communication chatter and setup in order to establish. However, once established, queries can be sent down the wire to your database. Your PostgreSQL database will come with a default number of connections, which you can alter. You want to make sure that your applications aren’t unnecessarily connecting, as this can be costly and can impact performance. Memory allocations are performed on a per-connection basis, so you’ll want to balance between allocating memory and the number of connections.
A checkpoint is a periodic action that stores information about your system. By default, a checkpoint will run after a number of segments, but depending on your system, you may want to increase this value. We’ll talk later in the article about how you can log out your checkpoint data, but checkpoint configuration is important because it can be quite an expensive operation. It’s often thought that the default configuration is too aggressive and performs checkpoints too often, so you might want to increase this value to make checkpoints less frequent.
As I mentioned earlier, memory allocation and management is a big part of performance-tuning PostgreSQL. If your system is doing a lot of complex sorts, increasing the sort memory can help the database optimize its configuration for your setup. This allows PostgreSQL to cache more data in memory while it performs its sorting, as opposed to making expensive calls to the disk.
This setting essentially is the amount of time that your optimizer should spend reading memory before reaching out to your disk. You should alter this setting only when you’ve done other plan-based optimizations that we’ll cover soon, such as vacuuming, indexing, or altering your queries and schema.
These are just some of the optimizations you can make for database configurations, but there are plenty more. Now that you know how to tweak your database setup, let’s look at another area for investigation: vacuuming.
The next area we’re looking at is vacuuming. A vacuum is a scan that marks tuples as no longer needed so that they can be overwritten. Failure to do this can mean that you have dead tuples lingering in your system. These dead tuples are often called bloat. Bloat originates mostly from records that are being deleted, updated, or inserted.
PostgreSQL sets up vacuuming by default, but just as you can configure other settings, you can also set your vacuuming settings. You can even set vacuuming settings on a per-table basis for more fine-grained optimization.
To perform a vacuum, you simply run the command:
If you want to see the history of previously run vacuums, you can do so by executing the following:
SELECT * FROM pg_stat_user_tables
Generally speaking, regular vacuuming can’t be done enough. More frequent vacuuming keeps bloat to a minimum and ensures that database performance stays high. Although autovacuuming is already set up, you might want to alter the settings to be more aggressive.
Upfront system configuration will have a limit on performance increases. Although you can do some configuration upfront, a lot of your configuration decisions will be driven by the data and analysis you have of your system. You’ll want to optimize how it’s set up, know how it’s being queried, and optimize based on these use cases. PostgreSQL and the community give us some handy tools for making these optimizations.
Before we continue, it’s important to understand the life cycle of a query. The life cycle is what happens from an initial request to the database right up to the response it sends. PostgreSQL doesn’t just dip its hand into a big bag of memory. When you have a lot of data, crude fetching of your data can lead to performance drops. If you’re scanning your database sequentially (often called a table scan) for your data, your performance will scale linearly— more rows, slower performance. But we can do better than that.
So what is the life cycle of a query? Initially, there is a transmission of the query string to PostgreSQL. The query string is then parsed and a plan is created. Plans are really important. Plans are the steps that PostgreSQL is going to take to find requested data. How you set up your database configuration, your schema, and your indexes (more on indexes soon) will all affect how performant these plans are. So, it’ll be important that we understand plans and how to optimize them. Finally, the database will perform the plan and retrieve the data.
A discrepancy can occur between the database plan that PostgreSQL intends to use to get your data and the way it actually fetches your data. This is because PostgreSQL bases its plan on infrequently updated metrics and statistics. Some data has to be updated periodically so that the statistics used for the plans are up to date.
This is where ANALYZE comes in. Running the ANALYZE command updates these statistics so that Postgres has a fresh set of data about how to create its plans. So, if you’re updating the tables or schema or adding indexes, remember to run an ANALYZE command after so the changes will take effect.
One of the next obvious areas to go to is query optimization. The queries that you are running might be inefficient for many reasons, so we’ll need a way to understand what’s going on with these queries—which is where the EXPLAIN command comes in.
EXPLAIN gives you a precise breakdown of how PostgreSQL will execute your query. The plan is based on statistics about the table, and it takes into account things like database indexes to find the most efficient path to your data. But EXPLAIN will only guess and give you a plan that it thinks it will execute. You can run EXPLAIN ANALYZE to get not only the information about the predicted plan, but also an update on how the query performed.
When you’ve got the plan that the database is trying to perform, you can then begin to analyze it. Several tools exist to make life easier, such as PEV, which allows you to visualize and understand the output of your EXPLAIN command.
Let’s assume that you have a running database and you’re looking to debug slow performance in the application. One way you’ll want to do this is through logs. Logs are short information statements that the application leaves behind when it performs an action. You can then gather up these actions and analyze them in a log management tool like Retrace to get a picture of how the system is behaving.
By default, your application doesn’t log all data. This is because increases in logging also have an effect on database performance. So, while we go through how you can change log settings to analyze performance, remember that the log settings themselves can actually affect performance.
Logs are emitted to a file in your system, depending on your configuration. When you’ve found where your logs are at, you can use a tool such as Retrace to analyze these logs. Retrace will tell you stats such as the most frequently run queries, how long the queries take on average, and so on. These aggregate metrics can give you a better understanding of where you may have performance bottlenecks within your system.
So, now that we see how great logs are for understanding system performance, how do we go about setting them up or changing them?
This is the format of your database logs. For tools like PGBadger, they’ll need rich data and will also need to know the data format. Your log line prefix tells PostgreSQL what format to emit its log data in. If you want to do any meaningful log analysis, you’ll need to set this value to be compatible with the tooling you are using to analyze your logs.
This is the logging level you’re operating at. Log levels usually refer to the level of detail you want in your logs. For instance, do you want absolutely all logs, or only logs of a certain critical type? log_statement has a few different settings, including:
- ddl, which logs only structural changes to the database
- mod, which logs modifications to existing data (basically everything except SELECT)
- all, which logs … everything
As we discussed in the configuration settings, checkpoints in PostgreSQL are periodic actions that store data about your system. These log checkpoints can, if excessive, lead to performance degradations. If you suspect this might be the case, by enabling log checkpoints you’ll be able to see rich data about those checkpoints, such as how often they’re running and what could be triggering them.
You also might want to know information about connections. If you’ve only got one application connecting to your database, but you’re seeing many concurrent connections, something could be wrong. Too many connections flooding your database could also mean that requests are failing to reach the database and could be affecting your application’s end users.
Now that we’ve got lots of data about our system, what tools do we have for improving the structure of our system to be more performant?
In a world without indexes, every request to the database would lead to a full scan of the entirety of the table to find the relevant results. With a large data set, this can be extremely slow—and that’s where database indexing comes in. Just as in a book, indexes give information to your database engine about roughly where in the system the data you’re looking for lives. To index our system properly, we need an understanding of our data and how we’re trying to access it. Which is why it’s important we have observability and monitoring tools, like Retrace setup to help us. Indexes don’t come for free, however; if you updated the content of a book, you’ll have to update the index each time you make a change to the content.
PostgreSQL has a few different types of index. The difference between these is that they all use a different algorithm. The indexes available are B-tree (the default index), Hash, GiST, SP-GiST, and GIN. PostgreSQL will create implicit indexes when you create a primary key or a unique key constraint. However, you will have to add the rest of your indexes manually.
The basic syntax for an INDEX is:
CREATE INDEX index_name ON table_name;
However, you shouldn’t use an index in some use cases—for example, when the overhead of using the index exceeds the benefit from the algorithm, such as on a small table. But tables that have large batch updates performed might also see performance issues. It may make sense to remove indexes on these tables temporarily while it’s updating, before restoring indexes.
There you have it: lightning-fast PostgreSQL
Hopefully that gives you an understanding of how to start performance tuning your Postgres database. Now you should be able to analyze your queries with EXPLAIN, pull out your logs for further insight, and modify your indexes for faster performance. Plus, you should be able to tweak your database configuration to match use so you can get the most out of it.
Just remember, tuning your database will take time and practice. So be patient and stay curious to find out more about your PostgreSQL system to get the best performance results.