Skip to content

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

License

Notifications You must be signed in to change notification settings

wmalpica/blazingsql

 
 

Repository files navigation

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem.

Getting Started | Documentation | Examples | Contributing | License | Blog

BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS ecosystem. RAPIDS is based on the Apache Arrow columnar memory format, and cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

BlazingSQL is a SQL interface for cuDF, with various features to support large scale data science workflows and enterprise datasets.

  • Query Data Stored Externally - a single line of code can register remote storage solutions, such as Amazon S3.
  • Simple SQL - incredibly easy to use, run a SQL query and the results are GPU DataFrames (GDFs).
  • Interoperable - GDFs are immediately accessible to any RAPIDS library for data science workloads.

Check out our 5-min quick start notebook Google Colab Badge using BlazingSQL.

Getting Started

Please reference our docs to find out how to install BlazingSQL.

Querying a CSV file in Amazon S3 with BlazingSQL:

For example:

from blazingsql import BlazingContext
bc = BlazingContext()

bc.s3('dir_name', bucket_name='bucket_name', access_key_id='access_key', secret_key='secret_key')

# Create Table from CSV
bc.create_table('taxi', '/dir_name/taxi.csv')

# Query
result_gdf = bc.sql('SELECT count(*) FROM taxi GROUP BY year(key)')

#Print GDF
print(result_gdf)

Examples

Documentation

You can find our full documentation at the following site

Quick Start

Too see all the ways you can get started with BlazingSQL checkout out our Getting Started Page

Install Using Conda

BlazingSQL can be installed with conda (miniconda, or the full Anaconda distribution) from the blazingsql channel:

Note: BlazingSQL is supported only on Linux, and with Python version 3.6 or 3.7.

Stable Version

For CUDA 9.2 and Python 3.7:

conda install -c blazingsql/label/cuda9.2 -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=3.7 cudatoolkit=9.2

For CUDA 10.0 and Python 3.7:

conda install -c blazingsql/label/cuda10.0 -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=3.7 cudatoolkit=10.0

Nightly Version

For CUDA 9.2 and Python 3.7:

conda install -c blazingsql-nightly/label/cuda9.2 -c blazingsql-nightly -c rapidsai-nightly -c conda-forge -c defaults blazingsql python=3.7

For CUDA 10.0 and Python 3.7:

conda install -c blazingsql-nightly/label/cuda10.0 -c blazingsql-nightly -c rapidsai-nightly -c conda-forge -c defaults blazingsql python=3.7

Build/Install from Source (Conda Environment)

This is the recommended way of building all of the BlazingSQL components and dependencies from source. It ensures that all the dependencies are available to the build process.

Stable Version

Install build dependencies

For CUDA 9.2 and Python 3.7:

conda create -n bsql python=3.7
conda activate bsql
conda install --yes -c conda-forge openjdk=8.0 maven cmake gtest gmock rapidjson cppzmq cython=0.29 jpype1 netifaces pyhive
conda install --yes -c conda-forge -c blazingsql bsql-toolchain
conda install --yes -c rapidsai -c nvidia -c conda-forge -c defaults cudf=0.12 dask-cudf=0.12 dask-cuda=0.12 cudatoolkit=9.2

For CUDA 10.0 and Python 3.7:

conda create -n bsql python=3.7
conda activate bsql
conda install --yes -c conda-forge openjdk=8.0 maven cmake gtest gmock rapidjson cppzmq cython=0.29 jpype1 netifaces pyhive
conda install --yes -c conda-forge -c blazingsql bsql-toolchain
conda install --yes -c rapidsai -c nvidia -c conda-forge -c defaults cudf=0.12 dask-cudf=0.12 dask-cuda=0.12 cudatoolkit=10.0

Build

The build process will checkout the BlazingSQL repository and will build and install into the conda environment.

cd $CONDA_PREFIX
git clone https://github.com/BlazingDB/blazingsql.git
cd blazingsql
git checkout master
export CUDACXX=/usr/local/cuda/bin/nvcc
conda/recipes/blazingsql/build.sh

$CONDA_PREFIX now has a folder for the blazingsql repository.

Nightly Version

Install build dependencies

For CUDA 9.2:

conda create -n bsql python=3.7
conda activate bsql
conda install --yes -c conda-forge openjdk=8.0 maven cmake gtest gmock rapidjson cppzmq cython=0.29 jpype1 netifaces pyhive
conda install --yes -c conda-forge -c blazingsql-nightly bsql-toolchain
conda install --yes -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcudf=0.12 cudf=0.12 dask-cudf=0.12 dask-cuda=0.12 cudatoolkit=9.2

For CUDA 10.0:

conda create -n bsql python=3.7
conda activate bsql
conda install --yes -c conda-forge openjdk=8.0 maven cmake gtest gmock rapidjson cppzmq cython=0.29 jpype1 netifaces pyhive
conda install --yes -c conda-forge -c blazingsql-nightly bsql-toolchain
conda install --yes -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcudf=0.12 cudf=0.12 dask-cudf=0.12 dask-cuda=0.12 cudatoolkit=10.0

Build

The build process will checkout the BlazingSQL repository and will build and install into the conda environment.

cd $CONDA_PREFIX
git clone https://github.com/BlazingDB/blazingsql.git
cd blazingsql
export CUDACXX=/usr/local/cuda/bin/nvcc
./build.sh

NOTE: You can do ./build.sh -h to see more build options.

$CONDA_PREFIX now has a folder for the blazingsql repository.

Community

Contributing

Have questions or feedback? Post a new github issue.

Please see our guide for contributing to BlazingSQL.

Contact

Feel free to join our Slack chat room: RAPIDS Slack Channel

You may also email us at [email protected] or find out more details on the BlazingSQL site

License

Apache License 2.0

RAPIDS AI - Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

Apache Arrow on GPU

The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.

About

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 56.0%
  • Cuda 29.2%
  • Java 5.6%
  • Python 5.3%
  • CMake 2.6%
  • Shell 0.8%
  • Other 0.5%