The dangers of streaming across versions of glibc: A cautionary tale


The following is a real critical problem that we ran into here at TripAdvisor, but have yet figured out a clear way to mitigate.

TL;DR:
Streaming replicas—and by extension, base backups—can become dangerously broken when the source and target machines run slightly different versions of glibc. Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on the slave. These indexes are sorted out of order with respect to the strcoll running on the slave. Because postgres is unaware of the discrepancy is uses these "corrupt" indexes to perform merge joins; merges rely heavily on the assumption that the indexes are sorted and this causes all the results of the join past the first poison pill entry to not be returned. Additionally, if the slave becomes master, the "corrupt" indexes will in cases be unable to enforce uniqueness, but quietly allow duplicate values.

Context:
We were doing a hardware upgrade on a large internal machine a couple months ago. We followed a common procedure here: stand up a the new HA pair as streaming replica's of the old system; then failover to the new pair. All systems involved were running 9.1.9 (though that is not relevant as we'll see), and built from source.

Immediately, after the failover we saw some weird cases with some small indexes. We thought it was because the streaming replication failover had gone poorly (and because we weren't running latest version of postgres on that machine), so we rebuilt them and moved on. Until last week when an important query stopped getting optimized as a hash join and turned into a merge join. From that query I generated a simple, single column join between two tables. That query returns 50 million rows with merge joins disabled and 0 rows with them enabled. Rebuilding the index fixed the issue, but this was an important table and so we did some digging.

Using some query optimizer coercion, I was able to show that 1. the "corrupt" index had the same number of rows as the table, and 2. the index returned rows in a different, but nearly identical ordering to the one that you would receive by explicitly sorting the column. Taking a pair of rows that were out of place, I manage to narrow the issue down. Luckily, we able to find the old server sitting on the floor. The simplest form of this issue is:

SELECT 'M' > 'ஐ';

Root cause:
Depending on your charset the first character might look like an ascii 'M'. It is not. The two characters in question are the utf8 representations of http://www.fileformat.info/info/unicode/char/ff2d/index.htm and http://www.fileformat.info/info/unicode/char/0b90/index.htm respectively. Across different machines, running the same version of postgres, and in databases with identical character encodings and collations ('en_US.UTF-8') that select will return different results if the version of glibc is different. This holds whether one pg instance is a pg_basebackup of the other or if you run an initdb and then immediately start postgres.

Digging further lead me to: master:src/backend/utils/adt/varlena.c:1494,1497 These are the lines where postgres calls strcoll_l and strcoll, in order to sort strings in a locale aware manner.

In the simplest case, the attached c file returns inconsistent results across glibc versions and environments. It just sets the collation to 'en_US.UTF-8', and then compares two one character strings (the characters above). Depending on the version of glibc you are running, you may see positive, negative or zero as the output. I have observed:

Old Server (CentOS 5.8, kernel 2.6.18-308.24.1.el5):0 -> glibc-2.5-81.el5_8.71 -> glibc-devel-2.5-81.el5_8.7 statically linked

0 -> Source builds of glibc (2.5, 2.6, 2.10)

New Server (CentOS 6.4, kernel 2.6.32-358.el6.x86_64)
-1 -> glibc-2.12-1.132.el6.x86_64

Dev server (CentOS 6.4, kernel 2.6.32-358.el6.x86_64)-1 -> glibc-2.12-1.107.el6_4.5.x86_64-1 -> Source build (2.12.2, 2.16, 2.18)

15 -> Source build (HEAD)

Laptop (Ubuntu, kernel 3.11.0-12-generic)-1 -> 2.17-93ubuntu4

15 -> 2.17-93ubuntu4 statically linked

Mac OS (For comparison only)
62365 -> OSX 10.8, 10.9

From my digging, I have been unable to figure out why glibc is returning different results in different situations. It is probably worth getting a discussion going on their mailing lists as well.

Regardless, the reality is that there are different versions of glibc out there in the wild, and they do not sort consistently across versions/environments. Streaming replica's rely on the assumption that the sort order within a collation is consistent across machines. If they differ ever so slightly, then a single pair of rows with poison pill characters can yield an entire index invalid. We were lucky that the first discrepancy was at the beginning of the index. If it was 85% of the way through then we probably would never have noticed that merge joins were broken on that machine.

We still are discussing internally how we plan to prevent this in the future, but we'd like to open this up to the community for wider discussion.

- Matt K, TripAdvisor

Attachment Content-Type Size
localetest.c application/octet-stream 387 bytes

On Wed, Aug 6, 2014 at 09:24:17PM +0000, Matthew Kelly wrote:> The following is a real critical problem that we ran into here at TripAdvisor,> but have yet figured out a clear way to mitigate.> > TL;DR:> Streaming replicas—and by extension, base backups—can become dangerously broken> when the source and target machines run slightly different versions of glibc.> Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on> the slave. These indexes are sorted out of order with respect to the strcoll> running on the slave. Because postgres is unaware of the discrepancy is uses> these "corrupt" indexes to perform merge joins; merges rely heavily on the> assumption that the indexes are sorted and this causes all the results of the> join past the first poison pill entry to not be returned. Additionally, if the> slave becomes master, the "corrupt" indexes will in cases be unable to enforce

> uniqueness, but quietly allow duplicate values.

No surprise; I have been expecting to hear about such breakage, and amsurprised we hear about it so rarely. We really have no way of testing

for breakage either. :-(

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

On Wed, Aug 6, 2014 at 5:11 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:> No surprise; I have been expecting to hear about such breakage, and am> surprised we hear about it so rarely. We really have no way of testing

> for breakage either. :-(

I guess that Trip Advisor were using some particular collation thathad a chance of changing. Sorting rules for English text (so, say,en_US.UTF-8) are highly unlikely to change. That might be much less

true for other locales.

Unicode Technical Standard #10 states:

"""
Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed asmore information becomes available about languages; there may be newgovernment or industry standards for the language that requirechanges; and finally, new characters added to the Unicode Standardwill interleave with the previously-defined ones. This means thatcollations must be carefully versioned.

"""

So, the reality is that we only have ourselves to blame. :-(

LC_IDENTIFICATION serves this purpose on glibc. Here is what en_US
looks like on my machine:

"""escape_char /comment_char %% Locale for English locale in the USA

% Contributed by Ulrich Drepper <drepper(at)redhat(dot)com>, 2000

LC_IDENTIFICATIONtitle "English locale for the USA"source "Free Software Foundation, Inc."address "59 Temple Place - Suite 330, Boston, MA 02111-1307, USA"contact ""email "bug-glibc-locales(at)gnu(dot)org"tel ""fax ""language "English"territory "USA"revision "1.0"date "2000-06-24"%category "en_US:2000";LC_IDENTIFICATIONcategory "en_US:2000";LC_CTYPEcategory "en_US:2000";LC_COLLATEcategory "en_US:2000";LC_TIMEcategory "en_US:2000";LC_NUMERICcategory "en_US:2000";LC_MONETARYcategory "en_US:2000";LC_MESSAGEScategory "en_US:2000";LC_PAPERcategory "en_US:2000";LC_NAMEcategory "en_US:2000";LC_ADDRESScategory "en_US:2000";LC_TELEPHONE*** SNIP ***

"""

This is a GNU extension [1]. If the OS adds a new version of acollation, that probably accidentally works a lot of the time, becausethe collation rule added or removed was fairly esoteric anyway, suchis the nature of these things. If it was something that came up a lot,

it would surely have been settled by standardization years ago.

If OS vendors are not going to give us a standard API for versioning,we're hosed. I thought about suggesting that we hash a strxfrm() blobfor about 2 minutes, before realizing that that's a stupid idea. Glibc

would be a good start.

[1] https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Special-Shell-Variables.html-- Regards,

Peter Geoghegan

> Over time, collation order will vary: there may be fixes needed as> more information becomes available about languages; there may be new> government or industry standards for the language that require> changes; and finally, new characters added to the Unicode Standard> will interleave with the previously-defined ones. This means that

> collations must be carefully versioned.

Another idea could be having our own collation data to isolate anychanges from outside world. I vaguley recall this had been discussed

before.

Best regards,--Tatsuo IshiiSRA OSS, Inc. Japan

English: http://www.sraoss.co.jp/index_en.php


Japanese:http://www.sraoss.co.jp

On Wed, Aug 6, 2014 at 6:30 PM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:> Another idea could be having our own collation data to isolate any> changes from outside world. I vaguley recall this had been discussed

> before.

That's probably the best solution. It would not be the first time thatwe decided to stop relying on the operating system's facilities due tovarious problems (e.g. we used to use the C standard library qsort()until about 2006). The only problem is that it's a lot of work. Onepossible solution that has been proposed is to adopt ICU [1]. Thatmight allow us to say "this is the official way that PostgreSQL 9.6sorts Japanese; you may use the old way if you want, but it'sincompatible with the new way". ICU would give us a standardversioning interface [2]. They seem to take this seriously, and are

aware of our considerations around B-Tree indexes on text.

[1] https://wiki.postgresql.org/wiki/Todo:ICU
[2] http://userguide.icu-project.org/collation/architecture#TOC-Versioning-- Regards,

Peter Geoghegan

We are currently running with the en_US.UTF-8 collation. It was a decision made long ago, and seeing as we never actually rely on the sort order of internationalized strings (other than for stability, apparently), we have never had any motivation to change this practice.

Some way of versioning collations, which is not tied to glibc seems immensely appealing. Without a good way of testing the differences between glibc sort versions, it seems the only safe thing to do at the moment is to guarantee all streaming replica's run from the exact same OS image. Which is fine until you want to upgrade your OS, and need to do a dump-restore instead of being able to do that in a rolling fashion.

To Bruce's point the way I was able to test for this issue in a particular index was (approximately):--Assuming textfield is what the index is on, this causes the query planner to scan the index and give the position in the index.CREATE TABLE index_order (SELECT textfield, dense_rank() OVER (ORDER BY textfield) as i_order FROM table);--No index here, postgres must sortCREATE TABLE both_order as (SELECT textfield, i_order, dense_rank() OVER (ORDER BY textfield) as sort_order FROM index_order);-- If this doesn't return zero, you have a problem

SELECT count(*) FROM both_orders WHERE i_order <> sort_order;

This method is really slow on a big table, and I'm not going to promise it always works, but that is how we found the root cause.

- Matt K

On Aug 6, 2014, at 9:46 PM, Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
wrote:

> On Wed, Aug 6, 2014 at 6:30 PM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:>> Another idea could be having our own collation data to isolate any>> changes from outside world. I vaguley recall this had been discussed>> before.> > That's probably the best solution. It would not be the first time that> we decided to stop relying on the operating system's facilities due to> various problems (e.g. we used to use the C standard library qsort()> until about 2006). The only problem is that it's a lot of work. One> possible solution that has been proposed is to adopt ICU [1]. That> might allow us to say "this is the official way that PostgreSQL 9.6> sorts Japanese; you may use the old way if you want, but it's> incompatible with the new way". ICU would give us a standard> versioning interface [2]. They seem to take this seriously, and are> aware of our considerations around B-Tree indexes on text.>

> [1] https://wiki.postgresql.org/wiki/Todo:ICU


> [2] http://userguide.icu-project.org/collation/architecture#TOC-Versioning> -- > Regards,

> Peter Geoghegan

On Thu, Aug 7, 2014 at 03:07:04PM +0000, Matthew Kelly wrote:> We are currently running with the en_US.UTF-8 collation. It was a decision made long ago, and seeing as we never actually rely on the sort order of internationalized strings (other than for stability, apparently), we have never had any motivation to change this practice.> > Some way of versioning collations, which is not tied to glibc seems immensely appealing. Without a good way of testing the differences between glibc sort versions, it seems the only safe thing to do at the moment is to guarantee all streaming replica's run from the exact same OS image. Which is fine until you want to upgrade your OS, and need to do a dump-restore instead of being able to do that in a rolling fashion.> > > > To Bruce's point the way I was able to test for this issue in a particular index was (approximately):> --Assuming textfield is what the index is on, this causes the query planner to scan the index and give the position in the index.> CREATE TABLE index_order (SELECT textfield, dense_rank() OVER (ORDER BY textfield) as i_order FROM table);> --No index here, postgres must sort> CREATE TABLE both_order as (SELECT textfield, i_order, dense_rank() OVER (ORDER BY textfield) as sort_order FROM index_order);> -- If this doesn't return zero, you have a problem> SELECT count(*) FROM both_orders WHERE i_order <> sort_order;>

> This method is really slow on a big table, and I'm not going to promise it always works, but that is how we found the root cause.

We could walk the index looking for inconsistent btree splits, e.g. thesplit doesn't match the ordering returned by the existing collation

functions.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

On Thu, Aug 7, 2014 at 9:46 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:> We could walk the index looking for inconsistent btree splits, e.g. the> split doesn't match the ordering returned by the existing collation

> functions.

I'm not sure I follow. I don't think that a tool like my btreechecktool will necessarily be able to catch anything like this on astandby. Maybe it will, but that isn't guaranteed. For example, thedifference in collation rules in question might just not have croppedup yet, but it's still a ticking time-bomb. Or, there are onlydifferences affecting values on internal pages. Things break down very

quickly.

In general, once there is an undetected inconsistency in collationsbetween replicas, that means that the varlena B-Tree support functionnumber 1 can violate various invariants that all operator classes mustobey. I doubt we want to get into the business of working backwardsfrom a broken state of affairs like that to figure out there is aproblem. Rather, I really do think we're compelled to offer betterversioning of collations using a versioning interface like Glibc'sLC_IDENTIFICATION. There is no way other way to properly fix theproblem. This is a problem that is well understood, and anticipated by

the Unicode consortium.

-- Regards,

Peter Geoghegan