PostgreSQL 12


PostgreSQL provides a simple linear distance operator <-> (linear distance). We will use this to find points that are closest to a given location.

PostgreSQL provides a simple linear distance operator the data, and performing no optimizations and having no indexes, we see the following execution plan:

time psql -qtAc " 
EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, location FROM geonames ORDER BY location <-> '(29.9691,-95.6972)' LIMIT 5; 
" <-- closing quote 
 QUERY PLAN ----------------------------------------------------------------------------------------------------------- Limit (cost=418749.15..418749.73 rows=5 width=38) (actual time=2553.970..2555.673 rows=5 loops=1) Buffers: shared hit=100 read=272836 -> Gather Merge (cost=418749.15..1580358.21 rows=9955954 width=38) (actual time=2553.969..2555.669 rows=5 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=100 read=272836 -> Sort (cost=417749.12..430194.06 rows=4977977 width=38) (actual time=2548.220..2548.221 rows=4 loops=3) Sort Key: ((location <-> '(29.9691,-95.6972)'::point)) Sort Method: top-N heapsort Memory: 25kB Worker 0: Sort Method: top-N heapsort Memory: 26kB Worker 1: Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=100 read=272836 -> Parallel Seq Scan on geonames (cost=0.00..335066.71 rows=4977977 width=38) (actual time=0.040..1637.884 rows=3982382 loops=3) Buffers: shared hit=6 read=272836 Planning Time: 0.493 ms Execution Time: 2555.737 ms real 0m2.595s user 0m0.011s sys 0m0.015s 

and here are the results: (the same results for all requests, so we’ll omit them later.)

name location
Cypress (29.96911,-95.69717)
Cypress Pointe Baptist Church (29.9732,-95.6873)
Cypress Post Office (29.9743,-95.67953)
Hot Wells (29.95689,-95.68189)
Dry Creek Airport (29.98571,-95.68597)

So, 418749.73 is the OPTIMIZER cost to beat, and it took two and a half seconds (2555.673) to execute that query. This is actually a very good result, using PostgreSQL with no optimizations at all against an 11 million row table. This is also why we selected a larger data set, as there would be very minimal difference using indexes against less than 10 millions rows. Parallel sequential scans are fantastic, but that’s another article.

Adding GiST index

We begin the optimization process by adding a GiST index. Because our example query has a

LIMIT 

clause of 5 items, we have a very high selectivity. This will encourage the planner to use an index, so we’ll provide one that works fairly well with geometry data.

time psql -qtAc "CREATE INDEX idx_gist_geonames_location ON geonames USING gist(location);" 

The act of creating the index has a bit of an expense.

CREATE INDEX real 3m1.988s user 0m0.011s sys 0m0.014s 

And then run the same query again.

time psql -qtAc " 
EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, location FROM geonames ORDER BY location <-> '(29.9691,-95.6972)' LIMIT 5; 
" 
 QUERY PLAN ---------------------------------------------------------------------------------- Limit (cost=0.42..1.16 rows=5 width=38) (actual time=0.797..0.881 rows=5 loops=1) Buffers: shared hit=5 read=15 -> Index Scan using idx_gist_geonames_location on geonames (cost=0.42..1773715.32 rows=11947145 width=38) (actual time=0.796..0.879 rows=5 loops=1) Order By: (location <-> '(29.9691,-95.6972)'::point) Buffers: shared hit=5 read=15 Planning Time: 0.768 ms Execution Time: 0.939 ms real 0m0.033s user 0m0.011s sys 0m0.013s 

In this case, we see some pretty dramatic improvement. The estimated cost of the query is only 1.16! Compare that to the original cost of the unoptimized query at 418749.73. The actual time taken was .939 seconds (nine tenths of a second), which compares to the 2.5 seconds of the original query. This result took less time to plan, got a dramatically better estimate, and took about a third of the runtime.

Let’s see if we can do better.

Adding an SP-GiST index

time psql -qtAc "CREATE INDEX idx_spgist_geonames_location ON geonames USING spgist(location);" 
CREATE INDEX real 1m25.205s user 0m0.010s sys 0m0.015s 

And then we run the same query again.

time psql -qtAc " 
EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT name, location FROM geonames ORDER BY location <-> '(29.9691,-95.6972)' LIMIT 5; 
" 
 QUERY PLAN ----------------------------------------------------------------------------------- Limit (cost=0.42..1.09 rows=5 width=38) (actual time=0.066..0.323 rows=5 loops=1) Buffers: shared hit=47 -> Index Scan using idx_spgist_geonames_location on geonames (cost=0.42..1598071.32 rows=11947145 width=38) (actual time=0.065..0.320 rows=5 loops=1) Order By: (location <-> '(29.9691,-95.6972)'::point) Buffers: shared hit=47 Planning Time: 0.122 ms Execution Time: 0.358 ms (7 rows) real 0m0.040s user 0m0.011s sys 0m0.015s 

Wow! Now using an SP-GiST index, the query cost only 1.09, and executed in .358 seconds (a third of a second).

Let’s examine some things about the indexes themselves, and see how they stack up to each other on disk.

Index comparisons

indexname creation time estimate query time indexsize plan time
unindexed 0S 418749.73 2555.673 0 .493
idx_gist_geonames_location 3M 1S 1.16 .939 s 868 MB .786
idx_spgist_geonames_location 1M 25S 1.09 .358 s 523 MB .122

Conclusions

So, we see that SP-GiST is twice the speed of GiST in execution, 8x faster to plan, and about 60% of the size on disk. And (relevant to this article) it also supports KNN index searching as of PostgreSQL 12. For this type of operation, we have a clear winner.

Setting up the data

For this article, we are going to use the data provided by the GeoNames Gazetteer.
This work is licensed under a Creative Commons Attribution 4.0 License
The Data is provided “as is” without warranty or any representation of accuracy, timeliness or completeness.

Create the structure

We start the process by creating a working directory and a little bit of ETL.

 cd mkdir spgist cd spgist psql -qtAc " 
CREATE TABLE IF NOT EXISTS geonames ( geonameid integer primary key ,name text ,asciiname text ,alternatenames text ,latitude numeric(13,5) ,longitude numeric(13,5) ,feature_class text ,feature_code text ,country text ,cc2 text ,admin1 text ,admin2 bigint ,admin3 bigint ,admin4 bigint ,population bigint ,elevation bigint ,dem bigint ,timezone text ,modification date ); COMMENT ON COLUMN geonames.geonameid IS ' integer id of record in geonames database'; COMMENT ON COLUMN geonames.name IS ' name of geographical point (utf8) varchar(200)'; COMMENT ON COLUMN geonames.asciiname IS ' name of geographical point in plain ascii characters, varchar(200)'; COMMENT ON COLUMN geonames.alternatenames IS ' alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)'; COMMENT ON COLUMN geonames.latitude IS ' latitude in decimal degrees (wgs84)'; COMMENT ON COLUMN geonames.longitude IS ' longitude in decimal degrees (wgs84)'; COMMENT ON COLUMN geonames.feature_class IS ' http://www.geonames.org/export/codes.html, char(1)'; COMMENT ON COLUMN geonames.feature_code IS ' http://www.geonames.org/export/codes.html, varchar(10)'; COMMENT ON COLUMN geonames.country IS ' ISO-3166 2-letter country code, 2 characters'; COMMENT ON COLUMN geonames.cc2 IS ' alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters'; COMMENT ON COLUMN geonames.admin1 IS ' fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)'; COMMENT ON COLUMN geonames.admin2 IS ' code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) '; COMMENT ON COLUMN geonames.admin3 IS ' code for third level administrative division, varchar(20)'; COMMENT ON COLUMN geonames.admin4 IS ' code for fourth level administrative division, varchar(20)'; COMMENT ON COLUMN geonames.population IS ' bigint (8 byte int) '; COMMENT ON COLUMN geonames.elevation IS ' in meters, integer'; COMMENT ON COLUMN geonames.dem IS ' digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.'; COMMENT ON COLUMN geonames.timezone IS ' the iana timezone id (see file timeZone.txt) varchar(40)'; COMMENT ON COLUMN geonames.modification IS ' date of last modification in yyyy-MM-dd format'; 
" 

ETL

wget http://download.geonames.org/export/dump/allCountries.zip unzip allCountries.zip IFS=$'\n' for line in $(<allCountries.txt) do echo -n "$line" | psql -qtAc 
 "COPY geonames FROM STDIN WITH CSV DELIMITER E'\t';" 
errors.txt done 

Clean up and Set up

Everything else we do from inside psql:

psql 
 CREATE EXTENSION postgis; CREATE EXTENSION postgis_topology; ALTER TABLE geonames ADD COLUMN location point; UPDATE geonames SET location = ('(' || latitude || ', ' || longitude || ')')::point; DELETE FROM geonames WHERE latitude IS NULL or longitude IS NULL; CLUSTER geonames USING geonames_pkey;