Standby in production: scaling application in second largest classified site in the world.

By Konstantin Evteev

  • 4.5 Gbit/sec TX, 2 Gbit/sec RX without static;
  • about a million queries per minute to the backend;
  • 270TB of images;
  • >20 TB in Postgres on 100 nodes:
  • 7–8K TPS on most nodes;
  • the largest — 20k TPS, 5 TB.

A few words about standby and its history in general

  • Scaling — you can switch part or all your read queries between one or more standbys
  • 2005 PostgreSQL 8.0: point in time recovery, makes it possible to create handmade replication scripts to have warm reserve.
  • 2008 PostgreSQL 8.3: administration of previously released warm standby feature (2006) becomes easier. That’s why someone still uses PostgreSQL 8.3 in production.
  • 2010 PostgreSQL 9.0: since this version PostgreSQL standby has been able to serve read queries (hot standby), moreover release of streaming replication feature has simplified setting up replication and decreased possible replication lag.
  • 2011 PostgreSQL 9.1: synchronous replication makes it easier to support SLA for applications working with critical data (for example information about financial transactions). Previously to avoid data loss you had to wait until your data was successfully replicated to standby or had to save data in two databases manually.
  • 2013 PostgreSQL 9.3: standby can follow timeline switch which makes failover procedures easier.
  • 2014 PostgreSQL 9.4: replication slots make it safer to setup standby without archive, but as a result standby fail can lead to master fail. Logical decoding becomes the fundament for built-in logical replication.
  • 2016 PostgreSQL 9.6: multiple synchronous standbys give an opportunity to create more reliable clusters. Remote_apply makes it possible to serve read queries from standby without stale reads.

Problems and solutions in replication based horizontal scale-out

When you use asynchronous standby, it can fall behind the primary. That’s why different rules for routing read queries to the primary or to the standby match different types of applications.

  • specific profile of queries with a lack of or a small number of errors connected with stale reads;
  • and so on.

Avito’s implementation to avoid stale reads from replica

We successfully used the logical routing technique (number 2: based on business specific logic) when we faced the following spikes. In the picture you can see a diagram with the Load Average on a 28-physical-core machine (2X Intel(R) Xeon(R) CPU E5–2697 v3 @ 2.60GHz). When we started using Hyper-threading, we benchmarked and found out that most of our queries were successfully scaled with enabling Hyper-threading. But then the distribution of queries changed and the result of this can be seen on the diagram. When the level of load average got closer to the number of physical cores, the spikes happened.

  • we routed all read queries for the entity on which mutating request was executed within 7.5 minutes to primary server;
  • we could route read queries to replicas, which must not fall behind in relation to primary greater than 5-minute interval.
  • Decreasing the value of time interval used for routing queries causes the growth in number of queries to standby. Meanwhile, this leads to increasing probability that standbys lag is greater than the routing interval. If this happens, all queries will be sent to the primary server and it’s highly likely that the primary won’t be able to process all these requests. And this elicits one more fact: when you use 2 nodes (primary and standby in production), you need at least the 3rd node (2nd standby) for the reserve purpose, and there should be a disaster recovery plan for all parts of this complex infrastructure.

Сases highlighting possible problems while using standby with high request rate, applying DDL, receiving WAL files from archive and handling some issues with technique of using few standbys in production and routing queries between them

  • standby;
  • two tables: the items and the options.
2018–01–12 16:54:40.208 MSK pid=20949,user=user_3,db=test,host=127.0.0.1:55763 LOG: process 20949 still waiting for AccessShareLock on relation 10000 of database 9000 after 5000.055 ms2018–01–12 16:54:40.208 MSK pid=20949,user=user_3,db=test,host=127.0.0.1:55763 DETAIL: Process holding the lock: 46091. Wait queue: 18639, 20949, 53445, 20770, 10799, 47217, 37659, 6727, 37662, 25742, 20771,
  • Then switch traffic to the second standby (which has successfully replayed the WAL with DDL statement).
  • Start the replication on the first standby and wait till the ALTER command has been replayed on it.
  • Archive is not the single point of failure. WAL is normally archived in two archives, or at least in one (in case another one is crashed or becomes unavailable). Running synchronizing WAL files procedure after each backup is the way to deal with temporary unavailability of one of two archives servers.
  • With the help of new archive schema we expanded the interval for PITR and the number of backups. We use the archives in turn, for example, 1st backup on 1st archive, 2nd backup on 2nd archive, 3rd backup on 1st archive, 4th backup on 2nd archive.
  • all the data has been transferred either to the standby or to the archive. In other words, the archive and the standby are in different states.

Standbys pool

The existence of the possible lag of standby and other problems I have described above made us use the standbys pool in Avito’s infrastructure.

if masterthen falseif lag > maxthen create file and return falseif lag > min and file existsthen return falseif lag < min and file existsthen remove file and return trueelsetrue

Logical replication as an alternative tool to scale application

It is out of the current topic but I want to say that there is logical replication as an alternative of binary replication. With its help, the following Avito’s issues are successfully solved: the growth of data volume and growth of number of requests to it, the scaling and the distribution of the load, the delivery of data to the DWH and to the search subsystems, inter-base and intersystem data synchronization, etc. Michael Tyurin (ex chief architect) is the author of the majority of core solutions mentioned above, implemented with the help of SkyTools.

Conclusions

  • analytical queries;
  • read queries when stale reads is not a problem.
Performing Upgrade------------------Analyzing all rows in the new cluster okFreezing all rows on the new cluster okDeleting files from new pg_clog okCopying old pg_clog to new server okSetting next transaction ID and epoch for new cluster okDeleting files from new pg_multixact/offsets okCopying old pg_multixact/offsets to new server okDeleting files from new pg_multixact/members okCopying old pg_multixact/members to new server okSetting next multixact ID and offset for new cluster okResetting WAL archives okconnection to database failed: FATAL: database "template1" does not existcould not connect to new postmaster started with the command:"/home/test/inst/pg9.6/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "new/"-o "-p 50432 -b -c synchronous_commit=off -c fsync=off -cfull_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700-c unix_socket_directories='/home/test/tmp/u'" startFailure, exitingTest script:#!/bin/bashPGOLD=~/inst/pg9.4/binPGNEW=~/inst/pg9.6/bin${PGOLD}/pg_ctl init -s -D old -o "--lc-messages=C -T pg_catalog.english"${PGNEW}/pg_ctl init -s -D new -o "--lc-messages=C -T pg_catalog.english"echo vacuum_defer_cleanup_age=10000 >> new/postgresql.auto.conf# move txid to 3 000 000 000 in old cluster as in production${PGOLD}/pg_ctl start -w -D old -o "--port=54321--unix_socket_directories=/tmp"${PGOLD}/vacuumdb -h /tmp -p 54321 --all --analyze${PGOLD}/vacuumdb -h /tmp -p 54321 --all --freeze${PGOLD}/pg_ctl stop -D old -m smart#${PGOLD}/pg_resetxlog -x 3000000000 olddd if=/dev/zero of=old/pg_clog/0B2D bs=262144 count=1# # move txid in new cluster bigger than vacuum_defer_cleanup_age may fixproblem# ${PGNEW}/pg_ctl start -w -D new -o "--port=54321--unix_socket_directories=/tmp"# echo "select txid_current();" | ${PGNEW}/pgbench -h /tmp -p 54321 -n -P 5-t 100000 -f- postgres# ${PGNEW}/pg_ctl stop -D new -m smart${PGNEW}/pg_upgrade -k -d old/ -D new/ -b ${PGOLD}/ -B ${PGNEW}/# rm -r new old pg_upgrade_* analyze_new_cluster.sh delete_old_cluster.sh