GPU Acceleration for Spatial Join¶
SedonaDB supports GPU-accelerated spatial joins with NVIDIA GPUs. The key innovation is using NVIDIA RT cores for spatial query acceleration, which significantly reduces candidate search cost and predicate evaluation.
Prerequisites: GPU acceleration requires a SedonaDB build with GPU support, NVIDIA CUDA 12+ on a GPU with compute capability 7.5 or higher. The feature works with or without dedicated RT cores. When RT cores are not available, NVIDA OptiX emulates ray tracing on CUDA cores.
Why This Matters¶
Spatial joins are often bottlenecked by candidate generation and exact refinement. SedonaDB's GPU path targets both phases:
- RT-core acceleration for high-throughput spatial filtering.
- A geometry-aware refinement pipeline, including a heavily optimized point-in-polygon (PIP) path.
Filtering and Refining Stages¶
The GPU-based join follows SedonaDB's two-stage execution model:
- Filtering stage
- Runs on NVIDIA RT cores [1].
-
Quickly generates candidate geometry pairs that intersect.
-
Refining stage
- Runs exact predicate checks on candidates.
- For point-polygon geometry pairs, refinement is heavily optimized and accelerated with RT-core-backed ray-tracing techniques [2].
- For other spatial join patterns, refinement runs on CUDA-core kernels. We will gradually expand the RT acceleration coverage in future releases.
Supported Predicates and Fallback Behavior¶
GPU spatial join currently supports relation predicates:
ST_IntersectsST_ContainsST_WithinST_CoversST_CoveredByST_TouchesST_Equals
Not currently supported on the GPU path:
- Distance predicates (for example,
ST_DWithin) - KNN / KNN join predicates
- GeometryCollection
When gpu.fallback_to_cpu = true (default), unsupported predicates fall back to CPU spatial join.
When gpu.fallback_to_cpu = false, unsupported predicates fail the query.
Install from Source with the GPU Feature¶
Build from source
For building the Python package from source, you should enable the GPU at build time and configure the CUDA
environment before running MATURIN_PEP517_ARGS="--features='gpu,s2geography,pyo3/extension-module' pip install.
Common environment variables used by the GPU build:
CUDA_HOME: points to your CUDA toolkit root.-
CMAKE_CUDA_ARCHITECTURES: CUDA SM targets (default falls back to86if not set; You can specify multiple arcs, e.g., "86;89"). Change this according to your GPU models. Otherwise, it requires JIT to compile kernels, making the first-time run very slow! -
LIBGPUSPATIAL_LOGGING_LEVEL: Logging level, includingTRACE,DEBUG,INFO,WARN,ERROR,CRITICAL. GPUSPATIAL_PROFILING=ON: Profiling mode (ON,OFF) to compile profiling instrumentation.
Using Docker
A prebuilt, GPU-enabled image is published to Docker Hub at apache/sedona under the sedonadb-latest tag (built for both linux/amd64 and linux/arm64). It bundles SedonaDB with the GPU feature enabled and starts a JupyterLab instance.
Make sure you have installed Docker, the NVIDIA Driver, and the NVIDIA Container Toolkit (sudo nvidia-ctk runtime configure --runtime=docker), then pull and run the image:
docker run -it --rm --gpus all -p 8888:8888 apache/sedona:sedonadb-latest
Open the JupyterLab URL printed in the terminal to start working.
The prebuilt image is compiled for the Turing, Ampere, and Ada Lovelace architectures (CMAKE_CUDA_ARCHITECTURES=75;86;89), covering common NVIDIA GPUs such as the T4, A10G, and L4. On a GPU outside this set the kernels are JIT-compiled on first use, which can make the first run slow. To target your own GPU, build the image from the repository root:
docker build -f docker/sedonadb-gpu.dockerfile --build-arg CMAKE_CUDA_ARCHITECTURES="<your GPU compute capability>" -t sedonadb-gpu .
docker run -it --rm --gpus all -p 8888:8888 sedonadb-gpu
Note: This Docker image is currently source-built directly from the repository's active codebase and is not tied to a released SedonaDB GPU wheel yet. This is to support development and early testing before an official SedonaDB GPU wheel is published.
Enable GPU Join with SQL SET¶
The GPU join is disabled by default even if you enabled the GPU feature at build time. To enable GPU acceleration for spatial joins, set the gpu.enable option to true:
import sedonadb
ctx = sedonadb.connect()
ctx.sql("SET gpu.enable = true").execute()
<sedonadb.dataframe.DataFrame object at 0x70b85c1b7850>
Performance Tuning and Special Cautions¶
To keep the GPU efficiently utilized, use larger execution batches:
ctx.sql("SET datafusion.execution.batch_size = 100000").execute()
Important guidance:
- A large batch size (for example,
100000) is often necessary for good GPU throughput. - For highest GPU performance, spilling should be disabled (for example, set
sd.options.memory_limit = "unlimited"before running queries). - Small joins may not amortize GPU overhead and can be slower than CPU execution.
- Start with defaults for other
gpu.*options, then tune based on measured workload behavior.
GPU Options¶
The following session options are available under the gpu. prefix:
gpu.enable(bool, defaultfalse): enable GPU spatial join.gpu.concat_build(bool, defaulttrue): concatenate geometry buffers before GPU processing.gpu.device_id(int, default0): CUDA device ID.gpu.fallback_to_cpu(bool, defaulttrue): use CPU path when GPU path is unavailable/unsupported.gpu.use_memory_pool(bool, defaulttrue): use CUDA memory pool.gpu.memory_pool_init_percentage(int, default50): initial CUDA memory pool size as a percentage of total GPU memory.gpu.pipeline_batches(int, default1): overlap parsing and refinement across batches.gpu.compress_bvh(bool, defaultfalse): compress BVH to reduce memory usage (can reduce performance).
Example:
import sedonadb
ctx = sedonadb.connect()
ctx.sql("SET gpu.enable = true")
ctx.sql("SET gpu.device_id = 0")
ctx.sql("SET gpu.use_memory_pool = true")
ctx.sql("SET gpu.memory_pool_init_percentage = 60")
ctx.sql("SET gpu.pipeline_batches = 2")
ctx.sql("SET gpu.compress_bvh = false")
ctx.sql("SET datafusion.execution.batch_size = 100000").execute()
Example¶
Check GPU environment¶
!nvidia-smi
Tue May 26 20:49:41 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:31:00.0 Off | 0 |
| N/A 41C P0 28W / 72W | 11610MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 20603 C ...a3/envs/sedona/bin/python3.11 11602MiB |
+-----------------------------------------------------------------------------------------+
Install required packages for the example¶
!pip install huggingface_hub rasterio pyogrio
Download datasets¶
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="apache-sedona/spatialbench",
repo_type="dataset",
local_dir="hf-data",
allow_patterns=["v0.1.0/sf1/zone/*", "v0.1.0/sf1/trip/*"],
)
Downloading (incomplete total...): 0.00B [00:00, ?B/s]
Fetching 8 files: 0%| | 0/8 [00:00<?, ?it/s]
'/mnt/data/sedona-db/docs/hf-data'
Create sedonadb context and load datasets¶
import sedonadb
ctx = sedonadb.connect()
ctx.options.memory_limit = "unlimited"
ctx.sql("""
CREATE EXTERNAL TABLE zone
STORED AS PARQUET
LOCATION 'hf-data/v0.1.0/sf1/zone/'
""")
ctx.sql("""
CREATE EXTERNAL TABLE trip
STORED AS PARQUET
LOCATION 'hf-data/v0.1.0/sf1/trip/'
""").execute()
0
Define query¶
# Define the query once to ensure both executions use the exact same logic
query = """
SELECT COUNT(*) AS cross_zone_trip_count
FROM trip t
JOIN zone pickup_zone
ON ST_Within(ST_GeomFromWKB(t.t_pickuploc), ST_GeomFromWKB(pickup_zone.z_boundary))
JOIN zone dropoff_zone
ON ST_Within(ST_GeomFromWKB(t.t_dropoffloc), ST_GeomFromWKB(dropoff_zone.z_boundary))
WHERE pickup_zone.z_zonekey != dropoff_zone.z_zonekey
"""
First, we will run this query using the standard CPU execution to establish our baseline correctness and get a sense of the standard execution time.
import time
print("Executing on CPU (Baseline)...")
ctx.sql("SET gpu.enable = false") # Disable the GPU feature
ctx.sql("SET datafusion.execution.batch_size = 8192") # Default configuration
runs = 5
cpu_times = []
cpu_count = 0
for i in range(runs):
start_time = time.time()
cpu_df = ctx.sql(query).to_pandas()
elapsed = time.time() - start_time
cpu_times.append(elapsed)
cpu_count = cpu_df.iloc[0, 0]
print(f" Run {i + 1}: {elapsed:.4f} seconds")
cpu_avg_time = sum(cpu_times) / runs
print("-" * 30)
print(f"CPU Average Time: {cpu_avg_time:.4f} seconds | Rows: {cpu_count}")
Executing on CPU (Baseline)...
Run 1: 7.7795 seconds
Run 2: 6.8287 seconds
Run 3: 7.0688 seconds
Run 4: 7.2434 seconds
Run 5: 7.1739 seconds
------------------------------
CPU Average Time: 7.2189 seconds | Rows: 176391
Next, we enable GPU execution. We also increase the batch size, which is a crucial optimization step to ensure we are feeding enough data to the GPU to maximize its parallel processing capabilities.
print("Executing on GPU (Accelerated)...")
ctx.sql("SET gpu.enable = true")
ctx.sql("SET datafusion.execution.batch_size = 1000000") # Optimized for GPU throughput
runs = 5
gpu_times = []
gpu_count = 0
for i in range(runs):
start_time = time.time()
gpu_df = ctx.sql(query).to_pandas()
elapsed = time.time() - start_time
gpu_times.append(elapsed)
gpu_count = gpu_df.iloc[0, 0]
print(f" Run {i + 1}: {elapsed:.4f} seconds")
gpu_avg_time = sum(gpu_times) / runs
print("-" * 30)
print(f"GPU Average Time: {gpu_avg_time:.4f} seconds | Rows: {gpu_count}")
Executing on GPU (Accelerated)...
Run 1: 3.2582 seconds
Run 2: 3.3126 seconds
Run 3: 3.0703 seconds
Run 4: 3.2604 seconds
Run 5: 3.1418 seconds
------------------------------
GPU Average Time: 3.2087 seconds | Rows: 176391
Compare Results¶
# Validate correctness
match_status = "✅ Match" if cpu_count == gpu_count else "❌ Mismatch"
# Calculate speedup based on the new averages
speedup = cpu_avg_time / gpu_avg_time if gpu_avg_time > 0 else 0
print("--- Query Validation (Based on Averages) ---")
print(f"Row Count Validated : {gpu_count} ({match_status})")
print(f"Average CPU Time : {cpu_avg_time:.4f}s")
print(f"Average GPU Time : {gpu_avg_time:.4f}s")
print(f"Calculated Speedup : {speedup:.2f}x")
--- Query Validation (Based on Averages) ---
Row Count Validated : 176391 (✅ Match)
Average CPU Time : 7.2189s
Average GPU Time : 3.2087s
Calculated Speedup : 2.25x
References¶
- Geng, Liang, Rubao Lee, and Xiaodong Zhang. "Librts: A spatial indexing library by ray tracing." Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 2025.
- Geng, Liang, Rubao Lee, and Xiaodong Zhang. "Rayjoin: Fast and precise spatial join." Proceedings of the 38th ACM International Conference on Supercomputing. 2024.