SedonaDB 0.3.0 Release
The Apache Sedona community is excited to announce the release of SedonaDB version 0.3.0!
SedonaDB is the first open-source, single-node analytical database engine that treats spatial data as a first-class citizen. It is developed as a subproject of Apache Sedona. This release consists of 187 resolved issues including 36 new functions from 18 contributors.
Apache Sedona powers large-scale geospatial processing on distributed engines like Spark (SedonaSpark), Flink (SedonaFlink), and Snowflake (SedonaSnow). SedonaDB extends the Sedona ecosystem with a single-node engine optimized for small-to-medium data analytics, delivering the simplicity and speed that distributed systems often cannot.
Release Highlights¶
Weβre excited to have so many things to highlight in this release!
- Larger-than-memory spatial join
- Item/row-level CRS support
- Parameterized SQL queries
- Parquet reader and writer improvements
- GDAL/OGR write support
- Improved spatial function coverage and documentation
- R DataFrame API
# pip install "apache-sedona[db]"
import sedona.db
sd = sedona.db.connect()
sd.options.interactive = True
Out of Core Spatial Join¶
The first release of SedonaDB almost four months ago included an extremely flexible and performant spatial join. As a refresher, a spatial join finds the interaction between two tables based on some spatial relationship (e.g., intersects or within distance). For example, if youβre thinking of moving and you love pizza, SedonaDB can find you the neighborhood with the most pizza restaurants about as fast as your internet connection/hard drive can load Overture Maps Divisions and Places data.
overture = "s3://overturemaps-us-west-2/release/2026-02-18.0"
options = {"aws.skip_signature": True, "aws.region": "us-west-2"}
sd.read_parquet(f"{overture}/theme=places/type=place/", options=options).to_view(
"places"
)
sd.read_parquet(
f"{overture}/theme=divisions/type=division_area/", options=options
).to_view("divisions")
sd.sql("""
SELECT
get_field(names, 'primary') AS name,
geometry
FROM places
WHERE get_field(categories, 'primary') = 'pizza_restaurant'
""").to_view("pizza_restaurants")
sd.sql("""
SELECT divisions.id, get_field(divisions.names, 'primary') AS name, COUNT(*) AS n
FROM divisions
INNER JOIN pizza_restaurants ON ST_Contains(divisions.geometry, pizza_restaurants.geometry)
WHERE divisions.subtype = 'neighborhood'
GROUP BY divisions.id, get_field(divisions.names, 'primary')
ORDER BY n DESC
""").limit(5)
# > ββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ¬ββββββββ
# > β id β name β n β
# > β utf8 β utf8 β int64 β
# > ββββββββββββββββββββββββββββββββββββββββͺβββββββββββββββββββββββββββββͺββββββββ‘
# > β 1373e423-efdc-471a-ae51-3e9d6c4231b2 β UPZs de BogotΓ‘ β 860 β
# > ββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββ€
# > β 74e87dad-3b11-4fc5-bedd-cb51a617e141 β Distrito-sede de Guarulhos β 388 β
# > ββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββ€
# > β 76077e78-c0c0-48f4-9683-1a16a56dfaf9 β Roma I β 346 β
# > ββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββ€
# > β ee274c81-5b62-4e80-be67-e613b9219b17 β Roma VII β 289 β
# > ββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββ€
# > β bf762f01-cad0-46be-b47b-76435478035c β Birmingham β 263 β
# > ββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββ
The way that this join algorithm worked was approximately (1) read all the batches in one side of the spatial join into memory, (2) build a spatial index in memory, and (3) query it in parallel for every batch in the other side of the join, streaming the result. This is how most spatial data frame libraries do a join, although SedonaDB was able to leverage DataFusionβs infrastructure for parallel streaming execution and Apache Arrowβs compact representation of an array of geometries to efficiently perform most joins that the average user could think of.
Even though our initial join was fast, it had a hard limit: all of the uncompressed batches of the build side AND the spatial index had to fit in memory. For a middle-of-the-road laptop with 16 GB of memory, this is somewhere around 200 million points, not accounting for any space required for non-geometry columns: plenty for most analyses, but not for large datasets or resource-limited environments like containers or cloud deployments.
We also had a unique resource at our disposal: the Apache Sedona community! Sedona Spark has been able to perform distributed spatial joins for many years, and its authors (Jia Yu, Kristin Cowalcijk, and Feng Zhang) are all active SedonaDB contributors. After our 0.2 release we asked the question: can we use the same ideas behind Sedona Sparkβs spatial join to allow SedonaDB to joinβ¦anything?
The result is a mostly rewritten implementation of the spatial join (which includes the k-nearest neighbours or KNN join) that allows SedonaDB users to be truly fearless: if you can write the query, SedonaDB can finish the job. This is a complex enough feature that will be the subject of a dedicated future blog post, but as a quick example: SedonaDB can identify duplicate (or nearly duplicate) points in a 130 million point dataset in about 30 seconds with only 3 GB of memory!
import sedona.db
sd = sedona.db.connect()
sd.options.memory_limit = "3g"
sd.options.memory_pool_type = "fair"
url = "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/microsoft-buildings_point.parquet"
sd.read_parquet(url).to_view("buildings")
sd.sql("""
SELECT ROW_NUMBER() OVER () AS building_id, geometry
FROM buildings
""").to_parquet("buildings_idx.parquet")
sd.sql("""
SELECT
l.building_id,
r.building_id AS nearest_building_id,
l.geometry AS geometry,
ST_Distance(l.geometry, r.geometry) AS dist
FROM "buildings_idx.parquet" AS l
JOIN "buildings_idx.parquet" AS r
ON l.building_id <> r.building_id
AND ST_DWithin(l.geometry, r.geometry, 1e-10)
""").to_memtable().to_view("duplicates", overwrite=True)
sd.view("duplicates").count()
# > 1017476
For some perspective, DuckDB 1.5.0 (beta) needs about 12 GB of memory to make this work and GeoPandas needs at least 28 GB.
βThank youβ seems like a completely inadequate thing to say given the amount of work it took to make this happen, so we will just tell you that @Kontinuation designed and implemented nearly all of this and hope that you are appropriately impressed.
Item/Row-level CRS Support¶
Since our first release SedonaDB has supported coordinate reference systems (CRS) following the idiom of GeoParquet, GeoPandas, and other spatial data frame libraries: every column can optionally declare a CRS that defines how the coordinates of each value relate to the surface of the earth[^1]. This is efficient because it amortizes the cost of considering the CRS to once per query or once per batch; however, this pattern made it awkward to interact with systems like PostGIS and Sedona Spark that allow each item to declare its own CRS. Per-item CRSes have other uses, too, such as performing computations in meter-based local CRSes or reading in data from multiple sources and using SedonaDB to transform to a common CRS.
Item level CRS support was added to SedonaDB in version 0.2; however, in the 0.3 release we added support throughout the stack such that columns of this type work in all functions. For example, if you want to transform some input into a local CRS, SedonaDB can get you there:
import pandas as pd
cities = pd.DataFrame(
{
"name": ["Ottawa", "Vancouver", "Toronto"],
"utm_code": ["EPSG:32618", "EPSG:32610", "EPSG:32617"],
"srid": [32618, 32610, 32617],
"longitude": [-75.7019612, -123.1235901, -79.38945855491194],
"latitude": [45.4186427, 49.2753624, 43.66464454743429],
}
)
sd.create_data_frame(cities).to_view("cities", overwrite=True)
sd.sql("""
SELECT
name,
ST_Transform(
ST_Point(longitude, latitude, 4326),
utm_code
) AS geom
FROM cities
""").to_view("cities_item_crs")
sd.view("cities_item_crs")
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β name β geom β
β utf8 β struct β
βββββββββββββͺββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β Ottawa β {item: POINT(445079.1082827766 5029697.641232558), crs: EPSG:32618} β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Vancouver β {item: POINT(491010.24691805616 5458074.5942144925), crs: EPSG:32610} β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Toronto β {item: POINT(629849.6255738225 4835886.955149793), crs: EPSG:32617} β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Crucially, it can also get you back!
sd.sql("SELECT name, ST_Transform(geom, 4326) FROM cities_item_crs")
βββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββ
β name β st_transform(cities_item_crs.geom,Int64(4326)) β
β utf8 β geometry β
βββββββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββ‘
β Ottawa β POINT(-75.7019612 45.4186427) β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββ€
β Vancouver β POINT(-123.1235901 49.275362399999985) β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββ€
β Toronto β POINT(-79.38945855491194 43.664644547434285) β
βββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββ
The main route to creating item-level CRSes is ST_Transform() and
ST_SetSRID(), as well as the last argument of ST_Point(),
ST_GeomFromWKT(), ST_GeomFromWKB(). The new functions
ST_GeomFromEWKT() and ST_GeomFromEWKB() always return item-level
CRSes (and SRIDs are preserved when exporting via ST_AsEWKT() and
ST_AsEWKB()). Item-level CRSes should be preserved across calls to all
other functions (just like type-level CRSes are in previous versions).
Weβd love to hear more use cases to ensure our support covers the full
range of use cases in the wild!
Parameterized queries¶
While the SedonaDB Python bindings expose a limited DataFrame API, to date the primary way to interact with SedonaDB in any meaningful way is SQL. SQL is well-supported by LLMs and allows us to leverage DataFusionβs excellent SQL parser; however, it made interacting with SedonaDB in a programmatic environment awkward and completely reliant on serializing query parameters to SQL.
In SedonaDB 0.3.0, we added parameterized query support in R and Python along with converters for most geometry-like and/or CRS-like objects. For example, if you have some contextual information as a GeoPandas DataFrame, you can now insert that into any SQL query without worrying about setting the CRS or serializing to SQL.
import geopandas
url_countries = "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries.fgb"
countries = geopandas.read_file(url_countries)
canada = countries.geometry[countries.name == "Canada"]
url_cities = "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_cities.fgb"
sd.read_pyogrio(url_cities).to_view("cities", overwrite=True)
sd.sql("SELECT * FROM cities WHERE ST_Contains($1, wkb_geometry)", params=(canada,))
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β name β wkb_geometry β
β utf8 β geometry β
βββββββββββββͺββββββββββββββββββββββββββββββββββββββββββββββ‘
β Ottawa β POINT(-75.7019612 45.4186427) β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€
β Vancouver β POINT(-123.1235901 49.2753624) β
βββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€
β Toronto β POINT(-79.38945855491194 43.66464454743429) β
βββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββ
Parquet Improvements¶
When SedonaDB was first launched, the Parquet Geometry and Geography types had just been added to the Parquet specification and support in the ecosystem was sparse. As of SedonaDB 0.3.0, Parquet files that define geometry and geography columns are supported out of the box! Functionally this means that SedonaDB no longer needs an extra (bbox) column and GeoParquet metadata to query a Parquet file efficiently: with the built-in geometry and geography types came embedded statistics to reduce file size for many applications.
# A Parquet file with no GeoParquet metadata
url = "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries.parquet"
sd.read_parquet(url).schema
SedonaSchema with 3 fields:
name: utf8<Utf8View>
continent: utf8<Utf8View>
geometry: geometry<WkbView(ogc:crs84)>
Even though SedonaDB can now read such files, it cannot yet write them (keep an eye out for this feature in 0.4.0!). For more on spatial types in Parquet, see the recent deep dive on the Parquet blog.
In addition to read improvements, we also improved the interface for writing Parquet files. In particular, writing friendly GeoParquet 1.1 files that work nicely with SedonaDB, GeoPandas, and other engineβs spatial filter pushdown. This works nicely with SedonaDBβs GDAL/OGR read support to produce optimized GeoParquet files from just about anything:
url = "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-point.fgb"
sd.read_pyogrio(url).to_parquet(
"optimized.parquet",
geoparquet_version="1.1",
max_row_group_size=100_000,
sort_by="wkb_geometry",
)
GDAL/OGR Write Support¶
In SedonaDB 0.2.0 we added the ability to read a wide variety of spatial file formats like Shapefile, FlatGeoBuf, and GeoPackage powered by the fantastic GDAL/OGR and the pyogrio packageβs Arrow interface. In 0.3.0 we added write support! Because both reads and writes are streaming, this makes SedonaDB an excellent choice for arbitrary read/write/convert workflows that scale to any size of input.
sd.read_parquet("optimized.parquet").to_pyogrio("water-point.fgb")
Improved spatial function coverage¶
Since the 0.2.0 release we have been fortunate to work with contributors to
add 36 new ST_ and RS_ functions to our growing catalogue. Users of
rs_bandpath, rs_convexhull, rs_crs, rs_envelope, rs_georeference,
rs_numbands, rs_rastertoworldcoord, rs_rastertoworldcoordx,
rs_rastertoworldcoordy, rs_rotation, rs_setcrs, rs_setsrid, rs_srid,
rs_worldtorastercoord, rs_worldtorastercoordx, rs_worldtorastercoordy,
st_affine, st_asewkb, st_asgeojson, st_concavehull, st_force2d,
st_force3d, st_force3dm, st_force4d, st_geomfromewkb, st_geomfromewkt,
st_geomfromwkbunchecked, st_interiorringn, st_linemerge, st_nrings,
st_numinteriorrings, st_numpoints, st_rotate, st_rotatex, st_rotatey,
and st_scale will be pleased to know that these functions are now
available in SedonaDB workflows. This brings the total ST_ and RS_
function count to 143, all of which are tested against PostGIS for
compatibility with the full matrix of geometry types and XY, XYZ, XYM,
and XYZM wherever possible.
In the process of adding some of these new functions, we discovered that
GEOS-based functions that return large geometries (e.g., ST_Buffer()
with parameters or ST_Union()) could be much faster. In 0.3.0 we
improved the process of appending large GEOS geometries to our output
arrays during execution which resulted in some benchmarks becoming up to
70% faster!
Thank you to 2010YOUY01, Abeeujah, jesspav, Kontinuation, petern48, prantogg, and yutannihilation for these contributions!
R GDAL/OGR read support¶
Early testers of SedonaDB for R expressed excitement over the potential
but noted that to be useful it would have to provide a way to read spatial
formats like Shapefile, FlatGeoBuf, and GeoPackage without going through
an sf DataFrame first. In SedonaDB 0.3.0, we added sd_read_sf(), which
leverages the GDAL ArrowArrayStream feature that is experimentally
available from the sf package.
library(sedonadb)
# Works with files
fgb <- system.file("files/natural-earth_cities.fgb", package = "sedonadb")
sd_read_sf(fgb)
<sedonab_dataframe: NA x 2>
ββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β name β wkb_geometry β
β utf8 β geometry β
ββββββββββββββͺββββββββββββββββββββββββββββββββββββββββββββββ‘
β Wellington β POINT(174.77720094690068 -41.2920679923151) β
ββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€
β Auckland β POINT(174.763027 -36.8480549) β
ββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€
β Melbourne β POINT(144.9730704 -37.8180855) β
ββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€
β Canberra β POINT(149.1290262 -35.2830285) β
ββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€
β Sydney β POINT(151.2125477744749 -33.87137339218338) β
ββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€
β Suva β POINT(178.4417073 -18.1330159) β
ββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββ
Preview of up to 6 row(s)
R GDAL/OGR support is more limited than the same feature in Python (e.g., SQL filters are not automatically pushed down into the data source and instead must be specified in the read call); however, it hopefully removes some friction and unlocks R support for a wider variety of use cases!
R DataFrame API¶
Last but not least, we added the building blocks for a dplyr backend to SedonaDB, including basic expression translation and support for most geometry-like objects in the r-spatial ecosystem. This API is experimental and feedback is welcome!
library(sedonadb)
url <- "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-point.parquet"
df <- sd_read_parquet(url)
df |>
sd_group_by(FEAT_CODE) |>
sd_summarise(
n = n(),
geometry = st_collect_agg(geometry)
) |>
sd_arrange(desc(n))
<sedonab_dataframe: NA x 3>
βββββββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FEAT_CODE β n β geometry β
β utf8 β int64 β geometry β
βββββββββββββͺββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β WARK60 β 44406 β MULTIPOINT Z((534300.0192 4986233.3817 82.3999999999941β¦ β
βββββββββββββΌββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β WARA60 β 193 β MULTIPOINT Z((291828.02660000045 4851581.779999999 18.3β¦ β
βββββββββββββΌββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β WAFA60 β 90 β MULTIPOINT Z((552953.3191 4974220.8818 0.95799999999871β¦ β
βββββββββββββΌββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β WADM60 β 1 β MULTIPOINT Z((421205.36319999956 4983412.968 22.8999999β¦ β
βββββββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preview of up to 6 row(s)
Thank you to e-kotov for the detailed feature request and motivating use case!
Whatβs Next?¶
During the 0.3.0 development cycle we merged the first steps of two exciting contributions: A GPU-accelerated spatial join and support for reading point cloud formats LAS and LAZ. In 0.4.0 we hope to have these components polished and ready to go! Finally, we hope to broaden our support for the Geography data type throughout the engine and accelerate its current implementation.
Contributors¶
git shortlog -sn apache-sedona-db-0.3.0.dev..apache-sedona-db-0.3.0
50 Dewey Dunnington
45 Kristin Cowalcijk
19 Hiroaki Yutani
6 Balthasar Teuscher
5 jp
4 Peter Nguyen
4 Yongting You
3 Feng Zhang
3 Liang Geng
2 Abeeujah
2 Matthew Powers
1 Isaac Corley
1 Jia Yu
1 John Bampton
1 Mayank Aggarwal
1 Mehak3010
1 Pranav Toggi
1 eitsupi
[^1]: It is significantly less common, but SedonaDB also supports non-Earth CRSes.