Announcing BlazingSQL — A GPU SQL Engine for RAPIDS Open-Source Software from NVIDIA

By Rodrigo Aramburu

We have two major announcements:

  1. Over the past months, we have been contributing heavily to the recently announced RAPIDS open-source software.
  2. We have built a free to use version of BlazingDB’s query execution engine on the RAPIDS open-source software. It’s called BlazingSQL.

BlazingSQL is built on open source projects, is free to use, and provides a clear benefit; query datasets from your enterprise Data Lakes directly into GPU memory as a GPU DataFrame (GDF).

The GPU DataFrame (GDF) is a project with the goal to support interoperability between GPU applications and define a common GPU in-memory data layer. When we understood NVIDIA and Anaconda were both looking for ways to expand the compute capability of the GDF, we wanted in.

The BlazingDB team helped build critical open-source libraries inside the RAPIDS open-source software and then layered on a series of modules from BlazingDB, our enterprise product, to provide Data Lake integration and enable SQL queries on the software.

With BlazingSQL, Python developers can execute SQL queries directly on flat files inside distributed file systems and have the results in a GPU DataFrame (GDF). With the GDF users can then use PyGDF or Dask_GDF which provide a simple interface that is similar to the Pandas DataFrame.

Finally, cuML and cuDNN offer popular, GPU-accelerated machine learning and deep learning libraries that can consume GDFs. The GPU DataFrame and this ecosystem provide developers the ability to run complete machine learning workloads inside GPU memory, reducing the cost of data exchange between different tools, and the transfer overhead over the PCIe bus.

This is a very new project for us. Over the next few weeks and months, we will start launching demos and installable binaries once the latest version of BlazingSQL is ready to show off complete workloads.

  1. BlazingSQL 0.1 Use the PyBlazing connection to execute SQL queries on GDFs that are loaded by the PyGDF API.
    Next couple of weeks (Oct. 11th — Oct. 25th)
  2. BlazingSQL 0.2 Integrate BlazingDB’s FileSystem API, adding the ability to directly query flat files inside distributed file systems.
    Few weeks after that(Oct. 25th — Nov. 8th)
  3. BlazingSQL 0.3 Integrate the distributed scheduler so SQL queries are fanned out across multiple GPUs and servers.
    More weeks after that(Nov. 8th— Nov. 30st)
  4. BlazingSQL 0.4 Integrate the distributed, multi-layered cache. Queries larger than memory, either GPU or CPU, have a cascading caching mechanism with most available data in GPU memory, second-most in CPU memory, and finally in SSD/NVME.
    Between today and the future, ideally in 2018.