20180920_DBTS_PGStrom_EN

By Kohei KaiGai

  1. 1. NVMEandGPUaccelerates PostgreSQL beyondthelimitation 〜Our challenge to the 10GB/s for query execution performance〜 HeteroDB,Inc Chief Architect & CEO KaiGai Kohei <kaigai@heterodb.com>
  2. 2. Here are mysterious benchmark results NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 20182 Benchmark conditions:  By the PostgreSQL v11beta3 + PG-Strom v2.1devel on a single-node server system  13 queries of Star-schema benchmark onto the 1055GB data set 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3 QueryExecutionThroughput[MB/s] Star Schema Benchmark for PostgreSQL 11beta3 + PG-Strom v2.1devel PG-Strom v2.1devel max 13.5GB/s in query execution throughput on single-node PostgreSQL
  3. 3. about HeteroDB,Inc NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 20183 Corporate overview  Name HeteroDB,Inc  Established 4th-Jul-2017  Headcount 2 (KaiGai and Kashiwagi)  Location Shinagawa, Tokyo, Japan  Businesses Sales of accelerated database product Technical consulting on GPU&DB region By the heterogeneous-computing technology on the database area, we provides a useful, fast and cost-effective data analytics platform for all the people who need the power of analytics. CEO Profile  KaiGai Kohei – He has contributed for PostgreSQL and Linux kernel development in the OSS community more than ten years, especially, for security and database federation features of PostgreSQL.  Award of “Genius Programmer” by IPA MITOH program (2007)  The top-5 posters finalist at GPU Technology Conference 2017.
  4. 4. Features of RDBMS  High-availability / Clustering  DB administration and backup  Transaction control  BI and visualization  We can use the products that support PostgreSQL as-is. Core technology – PG-Strom PG-Strom: An extension module for PostgreSQL, to accelerate SQL workloads by the thousands cores and wide-band memory of GPU. GPU Big-data Analytics PG-Strom NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 20184 Mass data loading from the storage device rapidly Machine-learning & Statistics
  5. 5. Characteristics of GPU (1/3) Highly parallel computing processor with thousands cores and hundreds GB/s memory band on a single chip CPU Like a passenger vehicle; well utilizable but less transportation capacity. GPU Like a high-speed railway; a little bit troublesome to get in or out, but capable for mass-transportation. Model Intel Xeon Platinum 8180M NVIDIA Tesla V100 Architecture Skylake-SP Volta # of cores 28 (functional) 5120 (simple) Performance (FP32) 2.24 TFLOPS (with AVX2) 15.0TFLOPS Memory capacity max 1.5TB (DDR4) 16GB (HBM2) Memory band 127.8GB/s 900GB/s TDP 205W 300W NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 20185
  6. 6. Characteristics of GPU (2/3) – Reduction algorithm ●item[0] step.1 step.2 step.4step.3 Calculation of the total sum of an array by GPU Σi=0...N-1item[i] ◆ ● ▲ ■ ★ ● ◆ ● ● ◆ ▲ ● ● ◆ ● ● ◆ ▲ ■ ● ● ◆ ● ● ◆ ▲ ● ● ◆ ● item[1] item[2] item[3] item[4] item[5] item[6] item[7] item[8] item[9] item[10] item[11] item[12] item[13] item[14] item[15] Total sum of items[] with log2N steps Inter-cores synchronization with hardware support SELECT count(X), sum(Y), avg(Z) FROM my_table; Same logic is internally used to implement aggregate function. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -6
  7. 7. Characteristics of GPU (3/3) NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 20187 Over 10years history in HPC, then massive popularization in Machine-Learning NVIDIA Tesla V100 Super Computer (TITEC; TSUBAME3.0) Computer Graphics Machine-Learning Today’s Topic How I/O workloads are accelerated by GPU that is a computing accelerator? Simulation
  8. 8. NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 20188 How PostgreSQL utilizes GPU? 〜Architecture of PG-Strom〜
  9. 9. Construction of query execution plan in PostgreSQL (1/2) Scan t0 Scan t1 Scan t2 Join t0,t1 Join (t0,t1),t2 GROUP BY cat ORDER BY score LIMIT 100 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -9
  10. 10. Construction of query execution plan in PostgreSQL (2/2) Scan t0 Scan t1 Join t0,t1 Statistics) nrows: 1.2M width: 80 Index: none candidate HashJoin cost=4000 candidate MergeJoin cost=12000 candidate NestLoop cost=99999 candidate Parallel Hash Join cost=3000 candidate GpuJoin cost=2500 WINNER! Built-in execution path of PostgreSQLProposition by extensions (since PostgreSQL v9.5) (since PostgreSQL v9.6) GpuJoin t0,t1 Statistics) nrows: 4000 width: 120 Index: t1.id Competition of multiple algorithms, then chosen by the “cost”. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -10
  11. 11. Interactions between PostgreSQL and PG-Strom with CustomScan As long as consistent results are made, implementation is flexible. CustomScan (GpuJoin) (*BeginCustomScan)(...) (*ExecCustomScan)(...) (*EndCustomScan)(...) : SeqScan on t0 SeqScan on t1 GroupAgg key: cat ExecInitGpuJoin(...)  Initialize GPU context  Kick asynchronous JIT compilation of the GPU program auto-generated ExecGpuJoin(...)  Read records from the t0 and t1, and copy to the DMA buffer  Kick asynchronous GPU tasks  Fetch results from the completed GPU tasks, then pass them to the next step (GroupAgg) ExecEndGpuJoin(...)  Wait for completion of the asynchronous tasks (if any)  Release of GPU resource DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -11
  12. 12. Auto generation of GPU code from SQL - Example of WHERE-clause DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -12 QUERY: SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; : STATIC_FUNCTION(bool) gpupreagg_qual_eval(kern_context *kcxt, kern_data_store *kds, size_t kds_index) { pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index); pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index); return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) && pgfn_float8le(kcxt, KVAR_3, pgfn_float8pl(kcxt, KVAR_4, KPARAM_1)))); } : E.g) Transformation of the numeric-formula in WHERE-clause to CUDA C code on demand Reference to input data SQL expression in CUDA source code Run-time compiler Parallel Execution
  13. 13. EXPLAIN shows query execution plan NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201813 postgres=# EXPLAIN ANALYZE SELECT cat,count(*),sum(ax) FROM tbl NATURAL JOIN t1 WHERE cid % 100 < 50 GROUP BY cat; QUERY PLAN --------------------------------------------------------------------------------------------------- GroupAggregate (cost=203498.81..203501.80 rows=26 width=20) (actual time=1511.622..1511.632 rows=26 loops=1) Group Key: tbl.cat -> Sort (cost=203498.81..203499.26 rows=182 width=20) (actual time=1511.612..1511.613 rows=26 loops=1) Sort Key: tbl.cat Sort Method: quicksort Memory: 27kB -> Custom Scan (GpuPreAgg) (cost=203489.25..203491.98 rows=182 width=20) (actual time=1511.554..1511.562 rows=26 loops=1) Reduction: Local Combined GpuJoin: enabled -> Custom Scan (GpuJoin) on tbl (cost=13455.86..220069.26 rows=1797115 width=12) (never executed) Outer Scan: tbl (cost=12729.55..264113.41 rows=6665208 width=8) (actual time=50.726..1101.414 rows=19995540 loops=1) Outer Scan Filter: ((cid % 100) < 50) Rows Removed by Outer Scan Filter: 10047462 Depth 1: GpuHashJoin (plan nrows: 6665208...1797115, actual nrows: 9948078...2473997) HashKeys: tbl.aid JoinQuals: (tbl.aid = t1.aid) KDS-Hash (size plan: 11.54MB, exec: 7125.12KB) -> Seq Scan on t1 (cost=0.00..2031.00 rows=100000 width=12) (actual time=0.016..15.407 rows=100000 loops=1) Planning Time: 0.721 ms Execution Time: 1595.815 ms (19 rows) What’s happen?
  14. 14. GpuScan + GpuJoin + GpuPreAgg Combined Kernel (1/3) Aggregation GROUP BY JOIN SCAN SELECT cat, count(*), avg(x) FROM t0 JOIN t1 ON t0.id = t1.id WHERE y like ‘%abc%’ GROUP BY cat; count(*), avg(x) GROUP BY cat t0 JOIN t1 ON t0.id = t1.id WHERE y like ‘%abc%’ results NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201814 GpuScan GpuJoin Agg + GpuPreAgg SeqScan HashJoin Agg
  15. 15. GpuScan + GpuJoin + GpuPreAgg Combined Kernel (2/3) GpuScan kernel GpuJoin kernel GpuPreAgg kernel DMA Buffer GPU CPU Storage Simple replacement of the logics makes ping-pong of data-transfer between CPU and GPU DMA Buffer DMA Buffer NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201815 DMA Buffer Agg (PostgreSQL) results
  16. 16. GpuScan + GpuJoin + GpuPreAgg Combined Kernel (3/3) GpuScan kernel GpuJoin kernel GpuPreAgg kernel DMA Buffer GPU CPU Storage Save the data-transfer by data exchange on the GPU device memory GPU Buffer GPU Buffer results NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201816 DMA Buffer Agg (PostgreSQL) A combined GPU kernel for SCAN + JOIN + GROUP BY data size = Large data size = Small Usually, amount of data size to be written back from GPU is much smaller than the data size sent to GPU
  17. 17. NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201817 Re-definition of the GPU’s role 〜How GPU accelerates I/O workloads〜
  18. 18. A usual composition of x86_64 server GPUSSD CPU RAM HDD N/W NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201818
  19. 19. Data flow to process a massive amount of data CPU RAM SSD GPU PCIe PostgreSQL Data Blocks Normal Data Flow All the records, including junks, must be loaded onto RAM once, because software cannot check necessity of the rows prior to the data loading. So, amount of the I/O traffic over PCIe bus tends to be large. NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201819 Unless records are not loaded to CPU/RAM once, over the PCIe bus, software cannot check its necessity even if it is “junk”.
  20. 20. Core Feature: SSD-to-GPU Direct SQL CPU RAM SSD GPU PCIe PostgreSQL Data Blocks NVIDIA GPUDirect RDMA It allows to load the data blocks on NVME-SSD to GPU using peer-to-peer DMA over PCIe-bus; bypassing CPU/RAM. WHERE-clause JOIN GROUP BY Run SQL by GPU to reduce the data size Data Size: Small NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201820 v2.0
  21. 21. Benchmark Results – single-node version NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201821 2172.3 2159.6 2158.9 2086.0 2127.2 2104.3 1920.3 2023.4 2101.1 2126.9 1900.0 1960.3 2072.1 6149.4 6279.3 6282.5 5985.6 6055.3 6152.5 5479.3 6051.2 6061.5 6074.2 5813.7 5871.8 5800.1 0 1000 2000 3000 4000 5000 6000 7000 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3 QueryProcessingThroughput[MB/sec] Star Schema Benchmark on NVMe-SSD + md-raid0 PgSQL9.6(SSDx3) PGStrom2.0(SSDx3) H/W Spec (3xSSD) SSD-to-GPU Direct SQL pulls out an awesome performance close to the H/W spec  Measurement by the Star Schema Benchmark; which is a set of typical batch / reporting workloads.  CPU: Intel Xeon E5-2650v4, RAM: 128GB, GPU: NVIDIA Tesla P40, SSD: Intel 750 (400GB; SeqRead 2.2GB/s)x3  Size of dataset is 353GB (sf: 401), to ensure I/O bounds workload
  22. 22. Element technology - GPUDirect RDMA (1/2) ▌P2P data transfer technology between GPU and other PCIe devices, bypass CPU  Originally designed for multi-nodes MPI over Infiniband  Infrastructure of Linux kernel driver for other PCIe devices, including NVME-SSDs. Copyright (c) NVIDIA corporation, 2015 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -22
  23. 23. Element technology - GPUDirect RDMA (2/2) Physical address space PCIe BAR1 Area GPU device memory RAM NVMe-SSD Infiniband HBA PCIe device GPUDirect RDMA It enables to map GPU device memory on physical address space of the host system Once “physical address of GPU device memory” appears, we can use is as source or destination address of DMA with PCIe devices. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -23 0xf0000000 0xe0000000 DMA Request SRC: 1200th sector LEN: 40 sectors DST: 0xe0200000
  24. 24. SSD-to-GPU Direct SQL - Software Stack NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201824 Tesla GPU NVIDIA CUDA Toolkit Filesystem (ext4, xfs) nvme driver (inbox) nvme_strom kernel module NVMe SSD drives commodity x86_64 hardware NVIDIA GPUDirect RDMA NVIDIA kernel driver PostgreSQL pg_strom extension read(2) ioctl(2) Hardware Layer Operating System Software Layer Database Software Layer Application Software SQL Interface I/O path based on normal filesystem I/O path based on SSD-to-GPU Direct SQL Execution ■ User’s Applications ■ Software developed by others ■ Software developed by HDB ■ Hardware v2.0
  25. 25. Run faster, beyond the limitation
  26. 26. Approach① – Faster NVME-SSD (1/2) Intel DC P4600 (2.0TB, HHHL) SeqRead: 3200MB/s, SeqWrite: 1575MB/s RandRead: 610k IOPS, RandWrite: 196k IOPS Interface: PCIe 3.0 (x4) NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201826 May I pull out the maximum performance of them?
  27. 27. Approach① – Faster NVME-SSD (1/2) Broadwell-EP is capable up to 7.1GB/s for P2P DMA routing performance. NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201827
  28. 28. Approach② – The latest CPU generation NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201828 Supermicro 1019GP-TT CPU: Xeon Gold 6126T (2.6GHz, 12C) RAM: 192GB (32GB DDR4-2666 x6) GPU: NVIDIA Tesla P40 (3840C, 24GB) x1 SSD: Intel SSD DC P4600 (2.0TB, HHHL) x3 HDD: 2.0TB (SATA, 72krpm) x6 N/W: 10Gb ethernet x2ports
  29. 29. Approach② – The latest CPU generation NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201829 Skylake-SP improved the P2P DMA routing performance to 8.5GB/s. GPU SSD-1 SSD-2 SSD-3 md-raid0 Xeon Gold 6126T routing by CPU
  30. 30. Consideration for the hardware configuration NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201830 ① SSD and GPU are connected to the same PCIe-switch OK ② CPU controls PCIe-bus, and SSD and GPU are directly connected to the same CPU Workable ③ SSD and GPU are connected to the different CPUs Not Supported CPU CPU PLX SSD GPU PCIe-switch CPU CPU SSD GPU CPU CPU SSD GPU QPI A pair of SSD and GPU must be under a particular CPU or PLX(PCIe-switch). PLX is more preferable than CPU. Which kind of the hardware can provide optimal data path with PCIe-switch?
  31. 31. Simple solution) HPC servers optimized for RDMA NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201831 NVIDIA GPUDirect RDMA is originally designed for multi-node MPI. Some HPC servers are optimized to P2P DMA between GPU and Infiniband HBA Supermicro SYS-4029GP-TRT2
  32. 32. Practical solution) Utilization of I/O Expansion Box NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201832 Reduction of the traffic on PCIe-bus by detachment of data flow from SSD to GPU NEC ExpEther 40G (4slot) slot-0 slot-1 slot-2 slot-3 PCIe switch slot-0 slot-1 slot-2 slot-3 PCIe switch GPU-0 GPU-1 SSD-0 SSD-1 SSD-2 SSD-3 Host RAM HBA0 HBA1 CPU Small data (= less traffic)
  33. 33. NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201833 Optimization of the Storage Path with I/O Expansion Box
  34. 34. System configuration with I/O expansion boxes NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201834 PCIe I/O Expansion Box Host System (x86_64 server) NVMe SSD PostgreSQL Tables PostgreSQL Data Blocks Internal PCIe Switch SSD-to-GPU P2P DMA (Large data size) GPU WHERE-clause JOIN GROUP BY PCIe over Ethernet Pre-processed small data A few GB/s SQL execution performance per box A few GB/s SQL execution performance per box A few GB/s SQL execution performance per box NIC / HBA Simplified DB operations and APP development by the simple single-node PostgreSQL configuration Enhancement of capacity & performance Visible as leafs of partitioned child-tables on PostgreSQL v2.1
  35. 35. Table Partitioning considering the hardware (1/2) NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201835 lineorder lineorder_p0 lineorder_p1 lineorder_p2 reminder=0 reminder=1 reminder=2 customer date supplier parts tablespace: nvme0 tablespace: nvme1 tablespace: nvme2 Associate partition-leafs with tablespaces and I/O expansion boxes v2.1
  36. 36. Table Partitioning considering the hardware (2/2) NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201836 lineorder lineorder_p0 lineorder_p1 lineorder_p2 reminder=0 reminder=1 reminder=2 customer date supplier parts tablespace: nvme0 tablespace: nvme1 tablespace: nvme2 New in PostgreSQL v11: Data distribution by Hash Partitioning key INSERT Hashed key hash = f(key) hash % 3 = 2 hash % 3 = 0 Raw data 1053GB Partial data 351GB Partial data 351GB Partial data 351GB v2.1
  37. 37. Partition-wise GpuJoin/GpuPreAgg(1/3) NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201837 lineorder lineorder_p0 lineorder_p1 lineorder_p2 reminder=0 reminder=1 reminder=2 customer date supplier parts tablespace: nvme0 tablespace: nvme1 tablespace: nvme2 New in PostgreSQL v11: Parallel scan of the partition leafs Scan Scan Scan Gather Join Agg Query Results Scan Massive records makes hard to gather v2.1
  38. 38. Partition-wise GpuJoin/GpuPreAgg(2/3) NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201838 lineorder lineorder_p0 lineorder_p1 lineorder_p2 reminder=0 reminder=1 reminder=2 customer date supplier parts tablespace: nvme0 tablespace: nvme1 tablespace: nvme2 Preferable: Gathering the partition-leafs next to JOIN / GROUP BY Join Gather Agg Query Results Scan Scan PreAgg Join Scan PreAgg Join Scan PreAgg v2.1
  39. 39. Partition-wise GpuJoin/GpuPreAgg(3/3) NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201839 ssbm =# EXPLAIN SELECT sum(lo_extendedprice*lo_discount) as revenue FROM lineorder,date1 WHERE lo_orderdate = d_datekey AND d_year = 1993 AND lo_discount between 1 and 3 AND lo_quantity < 25; QUERY PLAN ------------------------------------------------------------------------------ Aggregate -> Gather Workers Planned: 9 -> Parallel Append -> Parallel Custom Scan (GpuPreAgg) Reduction: NoGroup Combined GpuJoin: enabled GPU Preference: GPU2 (Tesla P40) -> Parallel Custom Scan (GpuJoin) on lineorder_p2 Outer Scan: lineorder_p2 Outer Scan Filter: ((lo_discount >= '1'::numeric) AND (lo_discount <= '3'::numeric) AND (lo_quantity < '25'::numeric)) Depth 1: GpuHashJoin (nrows 102760469...45490403) HashKeys: lineorder_p2.lo_orderdate JoinQuals: (lineorder_p2.lo_orderdate = date1.d_datekey) KDS-Hash (size: 66.03KB) GPU Preference: GPU2 (Tesla P40) NVMe-Strom: enabled -> Seq Scan on date1 Filter: (d_year = 1993) -> Parallel Custom Scan (GpuPreAgg) Reduction: NoGroup Combined GpuJoin: enabled GPU Preference: GPU1 (Tesla P40) : Portion to be executed on the 3rd I/O expansion box. v2.1
  40. 40. Distance between SSD and GPU (1/2) NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201840 lineorder lineorder_p0 lineorder_p1 lineorder_p2 reminder=0 reminder=1 reminder=2 customer date supplier parts tablespace: nvme0 tablespace: nvme1 tablespace: nvme2 GPU selection based on the distance from SSD where PG-Strom tries to scan Good Not Good v2.1
  41. 41. Distance between SSD and GPU (2/2) NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201841 $ pg_ctl restart : LOG: - PCIe[0000:80] LOG: - PCIe(0000:80:02.0) LOG: - PCIe(0000:83:00.0) LOG: - PCIe(0000:84:00.0) LOG: - PCIe(0000:85:00.0) nvme0 (INTEL SSDPEDKE020T7) LOG: - PCIe(0000:84:01.0) LOG: - PCIe(0000:86:00.0) GPU0 (Tesla P40) LOG: - PCIe(0000:84:02.0) LOG: - PCIe(0000:87:00.0) nvme1 (INTEL SSDPEDKE020T7) LOG: - PCIe(0000:80:03.0) LOG: - PCIe(0000:c0:00.0) LOG: - PCIe(0000:c1:00.0) LOG: - PCIe(0000:c2:00.0) nvme2 (INTEL SSDPEDKE020T7) LOG: - PCIe(0000:c1:01.0) LOG: - PCIe(0000:c3:00.0) GPU1 (Tesla P40) LOG: - PCIe(0000:c1:02.0) LOG: - PCIe(0000:c4:00.0) nvme3 (INTEL SSDPEDKE020T7) LOG: - PCIe(0000:80:03.2) LOG: - PCIe(0000:e0:00.0) LOG: - PCIe(0000:e1:00.0) LOG: - PCIe(0000:e2:00.0) nvme4 (INTEL SSDPEDKE020T7) LOG: - PCIe(0000:e1:01.0) LOG: - PCIe(0000:e3:00.0) GPU2 (Tesla P40) LOG: - PCIe(0000:e1:02.0) LOG: - PCIe(0000:e4:00.0) nvme5 (INTEL SSDPEDKE020T7) LOG: GPU<->SSD Distance Matrix LOG: GPU0 GPU1 GPU2 LOG: nvme0 ( 3) 7 7 LOG: nvme5 7 7 ( 3) LOG: nvme4 7 7 ( 3) LOG: nvme2 7 ( 3) 7 LOG: nvme1 ( 3) 7 7 LOG: nvme3 7 ( 3) 7 Auto selection of the optimal GPU according to the distance between PCIe devices v2.1
  42. 42. Benchmark (1/3) - System configuration NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201842 x 1 x 3 x 6 x 3 NEC Express5800/R120h-2m CPU: Intel Xeon E5-2603 v4 (6C, 1.7GHz) RAM: 64GB OS: Red Hat Enterprise Linux 7 (kernel: 3.10.0-862.9.1.el7.x86_64) CUDA-9.2.148 + driver 396.44 DB: PostgreSQL 11beta3 + PG-Strom v2.1devel NEC ExpEther 40G (4slots) I/F: PCIe 3.0 x8 (x16 physical) ... 4slots with internal PCIe switch N/W: 40Gb-ethernet Intel DC P4600 (2.0TB; HHHL) SeqRead: 3200MB/s, SeqWrite: 1575MB/s RandRead: 610k IOPS, RandWrite: 196k IOPS I/F: PCIe 3.0 x4 NVIDIA Tesla P40 # of cores: 3840 (1.3GHz) Device RAM: 24GB (347GB/s, GDDR5) CC: 6.1 (Pascal, GP104) I/F: PCIe 3.0 x16 SPECIAL THANKS FOR v2.1
  43. 43. Benchmark (2/3) - Result of query execution performance NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201843  13 SSBM queries to 1055GB database in total (a.k.a 351GB per I/O expansion box)  Raw I/O data transfer without SQL execution was up to 9GB/s. In other words, SQL execution was faster than simple storage read with raw-I/O. 13,401 13,534 13,536 13,330 12,696 12,965 12,533 11,498 12,312 12,419 12,414 12,622 12,594 2,388 2,477 2,493 2,502 2,739 2,831 1,865 2,268 2,442 2,418 1,789 1,848 2,202 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3 QueryExecutionThroughput[MB/s] Star Schema Benchmark for PgSQL v11beta3 / PG-Strom v2.1devel on NEC ExpEther x3 PostgreSQL v11beta3 PG-Strom v2.1devel Raw I/O Limitation max 13.5GB/s for query execution performance with 3x I/O expansion boxes!! v2.1
  44. 44. Benchmark (3/3) - Density of I/O per expansion box NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201844 0 2000 4000 6000 8000 10000 12000 14000 16000 nvme0n1 nvme1n1 nvme2n1 nvme3n1 nvme4n1 nvme5n1 I/O workload balances over the I/O expansion boxes, more scaling are expected  On SQL execution, raw-I/O performance was 5000〜5100MB/s per expansion box, and 2600MB/s per NVME-SSD.  Overall performance was balanced, so we can expect performance scaling if more expansion boxes.  4.5GB/s x8 = 36GB/s is expected if 8 expansion box configuration; close to the commercial DWH solutions. v2.1
  45. 45. NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201845 Conclusion
  46. 46. Expected usage – Log data processing and analysis NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201846 As a data management, analytics and machine-learning platform for log data daily growing up Manufacturing Logistics Mobile Home electronics GPU + NVME-SSD Why PG-Strom?  It supports nearly 100TB with single node by addition of I/O expansion box.  It allows to summarize the raw log data as is, more than max performance of H/W.  Users can continue to use the familiar SQL statement and applications.
  47. 47. Conclusion NVME and GPU accelerates PostgreSQL beyond the limitation - DB Tech Showcase Tokyo 201847  PG-Strom An extension module for PostgreSQL, to accelerate SQL execution by GPU. It pulls out maximum potential of hardware to summarize and analyze large data more than terabytes class.  Core feature: SSD-to-GPU Direct SQL It directly transfers the data blocks on NVME-SSD to GPU by P2P DMA, and runs SQL workloads on GPU prior to data loading onto the host system. By reduction of the data to be processed, it improves the performance of I/O bound jobs.  Multiple GPU/SSD configuration with I/O expansion box To avoid saturation of CPU which performs PCIe root complex, it exchanges P2P DMA packets close to the storage device by I/O expansion box that mounts PCIe switch. Not only mitigation of CPU loads, but also allows enhancement of database capacity and performance on demand. We measured 13.5GB/s in SQL execution by 3x expansion box. More performance is expected according to the investment of hardware.  Expected use scenario Database system which stores massive logs, including M2M. Simpleness of operations by single-node PostgreSQL, and continuity of the skill-set by the familiar SQL statement and applications. Adoption targets: small〜middle Hadoop clusters, or entry-class DWH solutions.