RaySQL: DataFusion on Ray
Project description
RaySQL: DataFusion on Ray
This is an experimental research project to evaluate the concept of performing distributed SQL queries from Python, using Ray and DataFusion.
Example
See examples/tips.py.
import ray
from raysql.context import RaySqlContext
from raysql.worker import Worker
# Start our cluster
ray.init()
# create some remote Workers
workers = [Worker.remote() for i in range(2)]
# create context and plan a query
ctx = RaySqlContext(workers)
ctx.register_csv('tips', 'tips.csv', True)
result_set = ctx.sql('select sex, smoker, avg(tip/total_bill) as tip_pct from tips group by sex, smoker')
print(result_set)
Status
- RaySQL can run 21 of the 22 TPC-H benchmark queries (query 15 needs DDL and that is not yet supported).
Features
- Mature SQL support (CTEs, joins, subqueries, etc) thanks to DataFusion
- Support for CSV and Parquet files
Limitations
- Requires a shared file system currently
Performance
This chart shows the relative performance of RaySQL compared to other open-source distributed SQL frameworks.
Performance is looking pretty respectable!
Building
# prepare development environment (used to build wheel / install in development)
python3 -m venv venv
# activate the venv
source venv/bin/activate
# update pip itself if necessary
python -m pip install -U pip
# install dependencies (for Python 3.8+)
python -m pip install -r requirements-in.txt
Whenever rust code changes (your changes or via git pull
):
# make sure you activate the venv using "source venv/bin/activate" first
maturin develop
python -m pytest
Benchmarking
Create a release build when running benchmarks, then use pip to install the wheel.
maturin develop --release
How to update dependencies
To change test dependencies, change the requirements.in
and run
# install pip-tools (this can be done only once), also consider running in venv
python -m pip install pip-tools
python -m piptools compile --generate-hashes -o requirements-310.txt
To update dependencies, run with -U
python -m piptools compile -U --generate-hashes -o requirements-310.txt
More details here
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
raysql-0.2.0.tar.gz
(106.8 kB
view hashes)
Built Distribution
Close
Hashes for raysql-0.2.0-cp37-abi3-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ac2dd6c08e8938dc55ebd6a495d9ae8c8b698a60b52bbe3c79329b2a3f1dfb7 |
|
MD5 | cc816d8a45348784cd5fa0a08559744d |
|
BLAKE2b-256 | 95e0430a2916d1a1b3eba27d8c94973a67b1092904369556730e44b2bb7bce52 |