Skip to content

Python API Reference

sedonadb.context

SedonaContext

Context for executing queries using Sedona

This object keeps track of state such as registered functions, registered tables, and available memory. This is similar to a Spark SessionContext or a database connection.

Runtime configuration (memory limits, spill directory, pool type) can be set via options before executing the first query. Once the first query runs, the internal execution context is created and runtime options become read-only.

Examples:

>>> sd = sedona.db.connect()
>>> sd.options.interactive = True
>>> sd.sql("SELECT 1 as one")
┌───────┐
│  one  │
│ int64 │
╞═══════╡
│     1 │
└───────┘

Configuring memory limits:

>>> sd = sedona.db.connect()
>>> sd.options.memory_limit = "4gb"
>>> sd.options.memory_pool_type = "fair"

funcs cached property

funcs: Functions

Access Python wrappers for SedonaDB functions

col

col(name: str, qualifier: Optional[str] = None) -> Expr

Reference a column by name.

Parameters:

  • name (str) –

    The column name to reference.

  • qualifier (Optional[str], default: None ) –

    An optional table qualifier (e.g. "t" for t.x). Useful when the same column name appears in multiple input tables of a join. Defaults to None, which leaves the column unqualified and lets the planner resolve against the surrounding schema.

Examples:

>>> sd = sedona.db.connect()
>>> sd.col("x")
Expr(x)
>>> sd.col("x", "t")
Expr(t.x)

create_data_frame

create_data_frame(obj: Any, schema: Any = None) -> DataFrame

Create a DataFrame from an in-memory or protocol-enabled object.

Converts supported Python objects into a SedonaDB DataFrame so you can run SQL and spatial operations on them.

Parameters:

  • obj (Any) –

    A supported object: - pandas DataFrame - GeoPandas DataFrame - Polars DataFrame - pyarrow Table

  • schema (Any, default: None ) –

    Optional object implementing __arrow_schema__ for providing an Arrow schema.

Returns:

  • DataFrame ( DataFrame ) –

    A SedonaDB DataFrame.

Examples:

>>> import pandas as pd
>>> sd = sedona.db.connect()
>>> sd.create_data_frame(pd.DataFrame({"x": [1, 2]})).head(1).show()
┌───────┐
   x   
 int64 
╞═══════╡
     1 
└───────┘

drop_view

drop_view(name: str) -> None

Remove a named view

Parameters:

  • name (str) –

    The name of the view

Examples:

>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geom").to_view("foofy")
>>> sd.drop_view("foofy")

lit

lit(value: Any) -> Literal

Create a literal (constant) expression

Creates a Literal object around value, or returns value if it is already a Literal. This is the primary function that should be used to wrap an arbitrary Python object a constant to prepare it as input to any SedonaDB logical expression context (e.g., parameterized SQL).

Literal values can be created from a variety of Python objects whose representation as a scalar constant is unambiguous. Any object that is accepted by pyarrow.array([...]) is supported in addition to:

  • Shapely geometries become SedonaDB geometry objects.
  • GeoSeries objects of length 1 become SedonaDB geometries with CRS preserved.
  • GeoDataFrame objects with a single column and single row become SedonaDB geometries with CRS preserved.
  • Pandas DataFrame objects with a single column and single row are converted using pa.array().
  • SedonaDB DataFrame objects that evaluate to a single column and row become a scalar value according to the single represented value.
  • pyproj CRS objects become PROJJSON strings (e.g., so they may be used in ST_SetCRS(), ST_Point(), or ST_GeomFromWKT()).

read_format

read_format(
    spec: ExternalFormatSpec,
    table_paths: Union[str, Path, Iterable[str]],
    check_extension: bool = False,
    partitioning: Union[str, Iterable[str], None] = None,
) -> DataFrame

Read one or more paths using a Python-defined ExternalFormatSpec.

This is the plugin entry point: a format-specific package (e.g. sedonadb-zarr) defines an ExternalFormatSpec subclass and the user reads through it via this method. Built-in formats have their own dedicated readers (read_parquet, read_pyogrio).

Format-specific options are passed via the spec itself using spec.with_options({...}), which returns a configured copy. Unlike read_pyogrio, this method has no options= keyword — each spec class documents its own supported keys.

Parameters:

  • spec (ExternalFormatSpec) –

    An ExternalFormatSpec instance describing how to open the underlying source.

  • table_paths (Union[str, Path, Iterable[str]]) –

    A str, Path, or iterable of paths/URLs.

  • check_extension (bool, default: False ) –

    When True, error if a non-collection path doesn't end in the spec's extension. Defaults to False.

  • partitioning (Union[str, Iterable[str], None], default: None ) –

    Optional list of column names for hive-style partitioning. When reading from a directory with paths like /col=value/file.ext, partition column names are auto-discovered by default (partitioning=None). Explicitly specify column names (e.g., ["col"]) to override auto-discovery, or pass an empty list [] to disable partitioning entirely.

Examples:

>>> import sedonadb_zarr
>>> sd = sedona.db.connect()
>>> spec = sedonadb_zarr.ZarrFormatSpec().with_options(
...     {"arrays": ["temperature"]}
... )
>>> sd.read_format(spec, "file:///path/to/foo.zarr").show()

read_parquet

read_parquet(
    table_paths: Union[str, Path, Iterable[str]],
    options: Optional[Dict[str, Any]] = None,
    geometry_columns: Optional[Union[str, Dict[str, Any]]] = None,
    validate: bool = False,
    partitioning: Union[str, Iterable[str], None] = None,
) -> DataFrame

Create a DataFrame from one or more Parquet files

Parameters:

  • table_paths (Union[str, Path, Iterable[str]]) –

    A str, Path, or iterable of paths containing URLs to Parquet files.

  • options (Optional[Dict[str, Any]], default: None ) –

    Optional dictionary of options to pass to the Parquet reader. For S3 access, use {"aws.skip_signature": True, "aws.region": "us-west-2"} for anonymous access to public buckets.

  • geometry_columns (Optional[Union[str, Dict[str, Any]]], default: None ) –

    Optional JSON string or dict mapping column name to GeoParquet column metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark binary WKB columns as geometry columns or correct metadata such as the column CRS.

    Supported keys: - encoding: "WKB" (required) - crs: (e.g., "EPSG:4326") - edges: "planar" (default) or "spherical" - ...other supported keys See the specification for details: https://geoparquet.org/releases/v1.1.0/

    Useful for: - Legacy Parquet files with Binary columns containing WKB payloads. - Overriding GeoParquet metadata when fields like crs are missing.

    Precedence: - GeoParquet metadata is used to infer geometry columns first. - geometry_columns then overrides the auto-inferred schema: - If a column is not geometry in metadata but appears in geometry_columns, it is treated as a geometry column. - If a column is geometry in metadata and also appears in geometry_columns, the provided metadata replaces the inferred metadata for that column. Missing optional fields are treated as absent/defaults.

    Example: - For geo.parquet(geo1: geometry, geo2: geometry, geo3: binary), read_parquet("geo.parquet", geometry_columns='{"geo2": {"encoding": "WKB"}, "geo3": {"encoding": "WKB"}}') overrides geo2 metadata and treats geo3 as a geometry column. - If geo inferred from metadata has: - geo: {"encoding": "wkb", "crs": "EPSG:4326", ..} and geometry_columns provides: - geo: {"encoding": "wkb", "crs": "EPSG:3857"} then the result is (full overwrite): - geo: {"encoding": "wkb", "crs": "EPSG:3857", ..} (other fields are defaulted)

    Safety: - Columns specified here can optionally be validated according to the validate option (e.g., WKB encoding checks). If validation is not enabled, inconsistent data may cause undefined behavior.

  • validate (bool, default: False ) –

    When set to True, geometry column contents are validated against their metadata. Metadata can come from the source Parquet file or the user-provided geometry_columns option. Only supported properties are validated; unsupported properties are ignored. If validation fails, execution stops with an error.

    Currently the only property that is validated is the WKB of input geometry columns.

  • partitioning (Union[str, Iterable[str], None], default: None ) –

    Optional list of column names for hive-style partitioning. When reading from a directory with paths like /col=value/file.parquet, partition column names are auto-discovered by default (partitioning=None). Explicitly specify column names (e.g., ["col"]) to override auto-discovery, or pass an empty list [] to disable partitioning entirely.

Examples:

>>> sd = sedona.db.connect()
>>> url = "https://github.com/apache/sedona-testing/raw/refs/heads/main/data/parquet/geoparquet-1.1.0.parquet"
>>> sd.read_parquet(url)
<sedonadb.dataframe.DataFrame object at ...>

read_pyogrio

read_pyogrio(
    table_paths: Union[str, Path, Iterable[str]],
    options: Optional[Dict[str, Any]] = None,
    extension: str = "",
    partitioning: Union[str, Iterable[str], None] = None,
) -> DataFrame

Read spatial file formats using GDAL/OGR via pyogrio

Creates a DataFrame from one or more paths or URLs to a file supported by pyogrio, which is the same package that powers geopandas.read_file() by default. Some common formats that can be opened using GDAL/OGR are FlatGeoBuf, GeoPackage, Shapefile, GeoJSON, and many, many more. See https://gdal.org/en/stable/drivers/vector/index.html for a list of available vector drivers.

Like read_parquet(), globs and directories can be specified in addition to individual file paths. Paths ending in .zip are automatically prepended with /vsizip/ (i.e., are automatically unzipped by GDAL). HTTP(s) URLs are supported via /vsicurl/.

Parameters:

  • table_paths (Union[str, Path, Iterable[str]]) –

    A str, Path, or iterable of paths containing URLs or paths. Globs (i.e., path/*.gpkg), directories, and zipped versions of otherwise readable files are supported.

  • options (Optional[Dict[str, Any]], default: None ) –

    An optional mapping of key/value pairs passed to pyogrio/GDAL. Supports pyogrio keyword arguments (e.g., layer, where, sql, max_features) as well as GDAL driver-specific dataset open options. Additionally, path_suffix can append a subpath to the resolved GDAL source (e.g., {"path_suffix": "data.gdb"} for a GDB stored inside a .zip file).

  • extension (str, default: '' ) –

    An optional file extension (e.g., "fgb") used when table_paths specifies one or more directories or a glob that does not enforce a file extension.

  • partitioning (Union[str, Iterable[str], None], default: None ) –

    Optional list of column names for hive-style partitioning. When reading from a directory with paths like /col=value/file.fgb, partition column names are auto-discovered by default (partitioning=None). Explicitly specify column names (e.g., ["col"]) to override auto-discovery, or pass an empty list [] to disable partitioning entirely.

Examples:

>>> import geopandas
>>> import tempfile
>>> sd = sedona.db.connect()
>>> df = geopandas.GeoDataFrame({
...     "geometry": geopandas.GeoSeries.from_wkt(["POINT (0 1)"], crs=3857)
... })
>>>
>>> with tempfile.TemporaryDirectory() as td:
...     df.to_file(f"{td}/df.fgb")
...     sd.read_pyogrio(f"{td}/df.fgb").show()
...
┌──────────────┐
 wkb_geometry 
   geometry   
╞══════════════╡
 POINT(0 1)   
└──────────────┘

register

register(component: Any, **kwargs: Any) -> None

Register an extension component

The following types of components are currently supported:

  • Python UDFs annotated with arrow_aggregate_udf or arrow_udf
  • An ExternalFormatSpec implementing a custom datasource type
  • An object implementing sedonadb_extension(ctx, **kwargs), which is called with this context and any keyword arguments passed.

The extension interface is experimental and may change.

Parameters:

  • component (Any) –

    A Python object implementing one of the above protocols.

  • **kwargs (Any, default: {} ) –

    Extension-specific options, supported for specific types of components.

Examples:

>>> import pyarrow as pa
>>> from sedonadb import udf
>>> sd = sedona.db.connect()
>>> @udf.arrow_udf(pa.int64(), [udf.STRING])
... def char_count(arg0):
...     arg0 = pa.array(arg0.to_array())
...
...     return pa.array(
...         (len(item) for item in arg0.to_pylist()),
...         pa.int64()
...     )
...
>>> sd.register(char_count)
>>> sd.sql("SELECT char_count('abcde') as col").show()
┌───────┐
  col  
 int64 
╞═══════╡
     5 
└───────┘

sql

sql(sql: str, *, params: Union[List, Tuple, Dict, None] = None) -> DataFrame

Create a DataFrame by executing SQL

Parses a SQL string into a logical plan and returns a DataFrame that can be used to request results or further modify the query.

Parameters:

  • sql (str) –

    A single SQL statement.

  • params (Union[List, Tuple, Dict, None], default: None ) –

    An optional specification of parameters to bind if sql contains placeholders (e.g., $1 or $my_param). Use a list or tuple to replace positional parameters or a dictionary to replace named parameters. This is shorthand for .sql(...).with_params(...) that is syntax-compatible with DuckDB. See lit() for a list of supported Python objects.

Examples:

>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) AS geom").show()
┌────────────┐
│    geom    │
│  geometry  │
╞════════════╡
│ POINT(0 1) │
└────────────┘
>>> sd.sql("SELECT ST_Point($1, $2) AS geom", params=(0, 1)).show()
┌────────────┐
│    geom    │
│  geometry  │
╞════════════╡
│ POINT(0 1) │
└────────────┘
>>> sd.sql("SELECT ST_Point($x, $y) AS geom", params={"x": 0, "y": 1}).show()
┌────────────┐
│    geom    │
│  geometry  │
╞════════════╡
│ POINT(0 1) │
└────────────┘

view

view(name: str) -> DataFrame

Create a DataFrame from a named view

Refer to a named view registered with this context.

Parameters:

  • name (str) –

    The name of the view

Examples:

>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geom").to_view("foofy")
>>> sd.view("foofy").show()
┌────────────┐
│    geom    │
│  geometry  │
╞════════════╡
│ POINT(0 1) │
└────────────┘
>>> sd.drop_view("foofy")

configure_gdal

configure_gdal(
    preset: Optional[
        Literal["auto", "rasterio", "pyogrio", "conda", "homebrew", "system"]
    ] = None,
    *,
    shared_library: Optional[Union[str, Path]] = None,
    verbose: bool = False,
) -> None

Configure GDAL source

SedonaDB loads GDAL dynamically at runtime. This is normally configured on package load but may need additional configuration (particularly if the automatic configuration fails).

This function may be called at any time; however, once a GDAL-backed operation has been performed, subsequent configuration has no effect.

Parameters:

  • preset (Optional[Literal['auto', 'rasterio', 'pyogrio', 'conda', 'homebrew', 'system']], default: None ) –

    One of: - None: Use a custom shared_library path. - auto: Try all presets in the order rasterio, pyogrio, conda, homebrew, system and warn if none succeeded. - pyogrio: Attempt to use the GDAL shared library bundled with pyogrio. This aligns the GDAL version with the one used by read_pyogrio() / geopandas.read_file(). - rasterio: Attempt to use the GDAL shared library bundled with rasterio. - conda: Attempt to load libgdal installed via conda install libgdal. - homebrew: Attempt to load libgdal installed via brew install gdal. - system: Attempt to load libgdal from a directory already on LD_LIBRARY_PATH (Linux), DYLD_LIBRARY_PATH (macOS), or PATH (Windows).

  • shared_library (Optional[Union[str, Path]], default: None ) –

    Path to a GDAL shared library.

  • verbose (bool, default: False ) –

    If True, print information about the configuration process.

Examples:

>>> sedona.db.configure_gdal("auto")

configure_proj

configure_proj(
    preset: Literal[
        "auto", "pyproj", "homebrew", "conda", "system", None
    ] = None,
    *,
    shared_library: Union[str, Path] = None,
    database_path: Union[str, Path] = None,
    search_path: Union[str, Path] = None,
    verbose: bool = False,
)

Configure PROJ source

SedonaDB loads PROJ dynamically to ensure aligned results and configuration against other Python and/or system libraries. This is normally configured on package load but may need additional configuration (particularly if the automatic configuration fails).

This function may be called at any time; however, once ST_Transform has been called, subsequent configuration has no effect.

Parameters:

  • preset (Literal['auto', 'pyproj', 'homebrew', 'conda', 'system', None], default: None ) –

    One of: - None: Use custom values of shared_library and/or other keyword arguments. - auto: Try all presets in the order pyproj, conda, homebrew, system and warn if none succeeded. - pyproj: Attempt to use shared libraries bundled with pyproj. This aligns transformations with those performed by geopandas and is the option that is tried first. - conda: Attempt to load libproj and data files installed via conda install proj. - homebrew: Attempt to load libproj and data files installed via brew install proj. Note that the Homebrew install also includes proj-data grid files and may be able to perform more accurate transforms by default/without network capability. - system: Attempt to load libproj from a directory already on LD_LIBRARY_PATH (linux), DYLD_LIBRARY_PATH (MacOS), or PATH (Windows). This should find the version of PROJ installed by a Linux system package manager.

  • shared_library (Union[str, Path], default: None ) –

    Path to a PROJ shared library.

  • database_path (Union[str, Path], default: None ) –

    Path to the PROJ database (proj.db).

  • search_path (Union[str, Path], default: None ) –

    Path to the directory containing PROJ data files.

  • verbose (bool, default: False ) –

    If True, print information about the configuration process.

Examples:

>>> sedona.db.configure_proj("auto")

connect

connect() -> SedonaContext

Create a new SedonaContext

Runtime configuration (memory limits, spill directory, pool type) can be set via options on the returned context before executing the first query::

sd = sedona.db.connect()
sd.options.memory_limit = "4gb"
sd.options.memory_pool_type = "fair"
sd.options.temp_dir = "/tmp/sedona-spill"

gdal_version

gdal_version() -> Optional[str]

Return the GDAL release version string, or None if GDAL is not loaded.

This function triggers lazy GDAL initialization if configure_gdal() was previously called but the library has not yet been loaded. If GDAL cannot be loaded, None is returned instead of raising an error.

Returns:

Examples:

>>> import sedonadb
>>> sedonadb.gdal_version()  # doctest: +SKIP
'3.8.4'

sedonadb.dataframe

DataFrame

Representation of a (lazy) collection of columns

This object is usually constructed from sd = sedona.db.connect() by importing an object with sd.create_data_frame(), reading a file with sd.read_parquet()/sd.read_pyogrio(), or executing SQL with sd.sql(). Once created, a DataFrame can be modified using the Python API (e.g., .select(), .filter(), .sort(), .limit()) or by creating a temporary view with .to_view("name") and querying the resulting view using sd.sql(). The Python API aims to provide a minimal subset of functionality derived primarily from Ibis and DuckDB's relational APIs.

Examples:

>>> sd = sedona.db.connect()
>>> sd.options.interactive = True
>>> df = sd.sql("SELECT * FROM (VALUES (1, 'a'), (2, 'b'), (3, 'c')) AS t(x, y)")
>>> df.limit(2)
┌───────┬──────┐
   x      y  
 int64  utf8 
╞═══════╪══════╡
     1  a    
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
     2  b    
└───────┴──────┘

Columns can be specified as part of the Python API in several ways:

>>> df.select("y", z=df.x + 1)
┌──────┬───────┐
   y     z   
 utf8  int64 
╞══════╪═══════╡
 a         2 
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 b         3 
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 c         4 
└──────┴───────┘
>>> df.select("y", z=df["x"] + 1)
┌──────┬───────┐
   y     z   
 utf8  int64 
╞══════╪═══════╡
 a         2 
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 b         3 
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 c         4 
└──────┴───────┘
>>> df.select("y", z=sd.col("x") + 1)
┌──────┬───────┐
   y     z   
 utf8  int64 
╞══════╪═══════╡
 a         2 
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 b         3 
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 c         4 
└──────┴───────┘

Literals can be specified explicitly using `sd.lit(obj)` but can also
be used directly in function calls or comparisons.

>>> df.filter(df.x > 1)
┌───────┬──────┐
   x      y  
 int64  utf8 
╞═══════╪══════╡
     2  b    
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
     3  c    
└───────┴──────┘
>>> df.filter(df.x > sd.lit(1))
┌───────┬──────┐
   x      y  
 int64  utf8 
╞═══════╪══════╡
     2  b    
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
     3  c    
└───────┴──────┘

Supported literals include any Python object supported by pyarrow
in addition to GeoPandas, Shapely, and PyProj objects for use in
geometry functions.

Functions are available from `sd.funcs`. Assigning this to a local
variable generally leads to better autocomplete but is not necessary.
In addition to standard DataFusion scalar and aggregate functions,
a number of spatial functions are provided. See the
[SedonaDB SQL Reference](https://sedona.apache.org/sedonadb/latest/reference/sql/)
for a list of supported functions.

>>> f = sd.funcs
>>> df.select(geometry=f.st_point(df.x, df.x + 1))
┌────────────┐
  geometry  
  geometry  
╞════════════╡
 POINT(1 2) 
├╌╌╌╌╌╌╌╌╌╌╌╌┤
 POINT(2 3) 
├╌╌╌╌╌╌╌╌╌╌╌╌┤
 POINT(3 4) 
└────────────┘

columns property

columns: list[str]

Return a list of column names

schema property

schema

Return the column names and data types

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT 1 as one")
>>> df.schema
SedonaSchema with 1 field:
  one: non-nullable int64<Int64>
>>> df.schema.field(0)
SedonaField one: non-nullable int64<Int64>
>>> df.schema.field(0).name, df.schema.field(0).type
('one', SedonaType int64<Int64>)

__arrow_c_schema__

__arrow_c_schema__()

ArrowSchema PyCapsule interface

Returns a PyCapsule wrapping an Arrow C Schema for interoperability with libraries that understand Arrow C data types. See the Arrow PyCapsule interface for more details.

__arrow_c_stream__

__arrow_c_stream__(requested_schema: Any = None)

ArrowArrayStream Stream PyCapsule interface

Returns a PyCapsule wrapping an Arrow C ArrayStream for interoperability with libraries that understand Arrow C data types. See the Arrow PyCapsule interface for more details.

Parameters:

  • requested_schema (Any, default: None ) –

    A PyCapsule representing the desired output schema.

__dir__

__dir__()

List attributes of this object

This is primarily intended to power autocomplete, so columns are placed first.

__getattr__

__getattr__(name)

Syntactic sugar for column access

Allows columns to be accessed like t.geometry for columns that do not collide with existing attributes. Programmatic usage should use sd.col() or t["col"].

__getitem__

__getitem__(key: Union[str, int]) -> Expr

Reference a single column by name or position.

Returns an Expr referencing the requested column. The return-type is always Expr (no DataFrame ⏐ Expr union) so that IDEs and type-aware tools can resolve df["x"].<method> cleanly.

For row filtering use df.filter(predicate). For multi-column projection use df.select(*cols).

Parameters:

  • key (Union[str, int]) –

    A column name (str) or a 0-based column position (int, negative indices count from the end).

Raises:

  • KeyError

    A string key that does not match any column. The error message lists the available columns.

  • IndexError

    An integer index outside the column range.

  • TypeError

    Any other key type, including bool, slice, list, and Expr. The message points at select or filter for those use cases.

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES (1, 10), (2, 20)) AS t(x, y)").alias("t")
>>> df["x"]
Expr(t.x)
>>> df[1]
Expr(t.y)
>>> df[-1]
Expr(t.y)

agg

agg(*exprs: Expr, **named_exprs: Expr) -> DataFrame

Aggregate the entire DataFrame to a single row.

Aggregate expressions can be passed positionally or as keyword arguments. With keyword arguments the keyword becomes the output column name — df.agg(total=sd.funcs.sum(sd.col("x"))) is shorthand for df.agg(sd.funcs.sum(sd.col("x")).alias("total")). The two forms can be mixed in a single call.

Parameters:

  • *exprs (Expr, default: () ) –

    Positional aggregate expressions.

  • **named_exprs (Expr, default: {} ) –

    Keyword aggregate expressions; each keyword is applied as the output alias of the corresponding expression.

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")
>>> df.agg(sd.funcs.sum(sd.col("x")).alias("total")).show()
┌───────┐
│ total │
│ int64 │
╞═══════╡
│    10 │
└───────┘
>>> df.agg(total=sd.funcs.sum(sd.col("x"))).show()
┌───────┐
│ total │
│ int64 │
╞═══════╡
│    10 │
└───────┘

alias

alias(name: str) -> DataFrame

Qualify all columns of this DataFrame with a given name

Returns a DataFrame where all columns are qualified to disambiguate references in join expressions. This is the equivalent of aliasing a subquery in SQL ((SELECT * FROM df) AS name).

arrow

arrow(*, simplify: bool = False) -> RecordBatchReader

Alias of to_arrow_reader()

count

count() -> int

Compute the number of rows in this DataFrame

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES ('one'), ('two'), ('three')) AS t(val)")
>>> df.count()
3

cross_join

cross_join(other: DataFrame) -> DataFrame

Cartesian product of two DataFrames.

Returns a DataFrame containing every pair of rows from self and other; the row count is the product of the two input row counts. Both sides' columns are kept verbatim — disambiguate with df.alias(...) on either side if column names collide.

Parameters:

  • other (DataFrame) –

    The right-hand DataFrame.

Examples:

>>> sd = sedona.db.connect()
>>> left = sd.sql("SELECT * FROM (VALUES (1), (2)) AS t(x)")
>>> right = sd.sql("SELECT * FROM (VALUES ('a'), ('b')) AS t(y)")
>>> left.cross_join(right).sort("x", "y").show()
┌───────┬──────┐
│   x   ┆   y  │
│ int64 ┆ utf8 │
╞═══════╪══════╡
│     1 ┆ a    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│     1 ┆ b    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│     2 ┆ a    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│     2 ┆ b    │
└───────┴──────┘

drop

drop(*cols: str) -> DataFrame

Drop the named columns.

Returns a new lazy DataFrame with each named column removed. Only column-name strings are accepted; expression arguments are rejected because "drop a computed expression" has no meaning at the schema level. Unknown column names raise a SedonaError at plan-build time, with the list of valid field names included in the message.

Parameters:

  • *cols (str, default: () ) –

    One or more column names to drop. At least one is required.

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT 1 AS a, 2 AS b, 3 AS c")
>>> df.drop("b").show()
┌───────┬───────┐
│   a      c   │
│ int64  int64 │
╞═══════╪═══════╡
│     1      3 │
└───────┴───────┘

execute

execute() -> None

Execute the plan represented by this DataFrame

This will execute the query without collecting results into memory, which is useful for executing SQL statements like SET, CREATE VIEW, and CREATE EXTERNAL TABLE.

Note that this is functionally similar to .count() except it does not apply any optimizations (e.g., does not use statistics to avoid reading data to calculate a count).

Examples:

>>> sd = sedona.db.connect()
>>> sd.sql("CREATE OR REPLACE VIEW temp_view AS SELECT 1 as one").execute()
0
>>> sd.view("temp_view").show()
┌───────┐
│  one  │
│ int64 │
╞═══════╡
│     1 │
└───────┘

explain

explain(type: str = 'standard', format: str = 'indent') -> DataFrame

Return the execution plan for this DataFrame as a DataFrame

Retrieves the logical and physical execution plans that will be used to compute this DataFrame. This is useful for understanding query performance and optimization.

Parameters:

  • type (str, default: 'standard' ) –

    The type of explain plan to generate. Supported values are: "standard" (default) - shows logical and physical plans, "extended" - includes additional query optimization details, "analyze" - executes the plan and reports actual metrics.

  • format (str, default: 'indent' ) –

    The format to use for displaying the plan. Supported formats are "indent" (default), "tree", "pgjson" and "graphviz".

Returns:

  • DataFrame

    A DataFrame containing the execution plan information with columns

  • DataFrame

    'plan_type' and 'plan'.

Examples:

>>> import sedonadb
>>> con = sedonadb.connect()
>>> df = con.sql("SELECT 1 as one")
>>> df.explain().show()
┌───────────────┬─────────────────────────────────┐
   plan_type                  plan              
      utf8                    utf8              
╞═══════════════╪═════════════════════════════════╡
 logical_plan   Projection: Int64(1) AS one     
                  EmptyRelation: rows=1         
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 physical_plan  ProjectionExec: expr=[1 as one] 
                  PlaceholderRowExec            
                                                
└───────────────┴─────────────────────────────────┘

filter

filter(*exprs: Expr) -> DataFrame

Filter rows by one or more boolean expressions.

Multiple expressions are combined with logical AND, so df.filter(a, b) is equivalent to df.filter(a & b) and to df.filter(a).filter(b) (the planner sees one conjunction in the first two forms and two filter nodes in the third).

Only Expr arguments are accepted. Strings are not interpreted as SQL predicates (that is a separate feature). Bare Literal values are also rejected — filter(lit(True)) is almost certainly a typo; if you really mean a constant predicate, wrap a column expression like col("flag") == lit(True).

Parameters:

  • *exprs (Expr, default: () ) –

    One or more boolean sedonadb.expr.Expr predicates. At least one argument is required.

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")
>>> df.filter(df.x > 2).show()
┌───────┐
│   x   │
│ int64 │
╞═══════╡
│     3 │
├╌╌╌╌╌╌╌┤
│     4 │
└───────┘

group_by

group_by(*keys: Union[str, Expr]) -> GroupedDataFrame

Group rows by one or more keys for aggregation.

Returns a GroupedDataFrame whose .agg(...) method runs the aggregation. Strings are auto-promoted to column references (same pattern as sort); arbitrary Expr values are accepted as computed group keys.

Parameters:

  • *keys (Union[str, Expr], default: () ) –

    One or more str column names or Expr group keys. At least one is required.

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql(
...     "SELECT * FROM (VALUES ('a', 1), ('a', 2), ('b', 3)) AS t(k, v)"
... )
>>> df.group_by("k").agg(total=sd.funcs.sum(sd.col("v"))).sort("k").show()
┌──────┬───────┐
│   k  ┆ total │
│ utf8 ┆ int64 │
╞══════╪═══════╡
│ a    ┆     3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b    ┆     3 │
└──────┴───────┘

head

head(n: int = 5) -> DataFrame

Limit result to the first n rows

Note that this is non-deterministic for many queries.

Parameters:

  • n (int, default: 5 ) –

    The number of rows to return

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES ('one'), ('two'), ('three')) AS t(val)")
>>> df.head(1).show()
┌──────┐
│  val │
│ utf8 │
╞══════╡
│ one  │
└──────┘

join

join(
    other: DataFrame,
    on: Union[str, List[str], Expr, List[Expr]],
    how: Literal[
        "inner",
        "left",
        "right",
        "outer",
        "full",
        "left_semi",
        "semi",
        "left_anti",
        "anti",
        "right_semi",
        "right_anti",
    ] = "inner",
) -> DataFrame

Join two DataFrames.

on accepts either common column names or arbitrary boolean predicates:

  • Column names (str or list[str]): the named column(s) must exist on both sides. Result has a single copy of each join key — matching pandas / Polars / PySpark output shape.
  • Predicate expressions (Expr or list[Expr]): each Expr is a boolean predicate combining columns from both sides (e.g. left.k == right.k, or f.st_intersects(left.g, right.g)). Result keeps both sides' columns verbatim — disambiguate with df.alias(...) on either side.

Parameters:

  • other (DataFrame) –

    The right-hand DataFrame to join against.

  • on (Union[str, List[str], Expr, List[Expr]]) –

    Join key(s). A column name (str), a list of column names, a single boolean Expr, or a list of boolean Exprs combined with logical AND.

  • how (Literal['inner', 'left', 'right', 'outer', 'full', 'left_semi', 'semi', 'left_anti', 'anti', 'right_semi', 'right_anti'], default: 'inner' ) –

    Join type. Canonical: "inner" (default), "left", "right", "outer", "left_semi", "left_anti", "right_semi", "right_anti". PySpark aliases also accepted: "full" (= outer), "semi" (= left_semi), "anti" (= left_anti).

Examples:

>>> sd = sedona.db.connect()
>>> left = sd.sql(
...     "SELECT * FROM (VALUES (1, 'a'), (2, 'b')) AS t(k, v)"
... )
>>> right = sd.sql(
...     "SELECT * FROM (VALUES (1, 'x'), (2, 'y'), (3, 'z')) AS t(k, w)"
... )
>>> left.join(right, on="k").sort("k").show()
┌───────┬──────┬──────┐
│   k   ┆   v  ┆   w  │
│ int64 ┆ utf8 ┆ utf8 │
╞═══════╪══════╪══════╡
│     1 ┆ a    ┆ x    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│     2 ┆ b    ┆ y    │
└───────┴──────┴──────┘

limit

limit(n: Optional[int], /, *, offset: int = 0) -> DataFrame

Limit result to n rows starting at offset

Note that this is non-deterministic for many queries.

Parameters:

  • n (Optional[int]) –

    The number of rows to return

  • offset (int, default: 0 ) –

    The number of rows to skip (optional)

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES ('one'), ('two'), ('three')) AS t(val)")
>>> df.limit(1).show()
┌──────┐
│  val │
│ utf8 │
╞══════╡
│ one  │
└──────┘

>>> df.limit(1, offset=2).show()
┌───────┐
│  val  │
│  utf8 │
╞═══════╡
│ three │
└───────┘

select

select(
    *exprs: Union[Expr, str, Literal], **kwargs: Union[Expr, str, Literal]
) -> DataFrame

Project a set of columns or expressions.

Returns a new lazy DataFrame whose columns are exactly the projection. Column-name strings are converted to column references via sedonadb.expr.col internally, so df.select("x", "y") and df.select(col("x"), col("y")) produce the same plan. Literals produced by sedonadb.expr.lit() are also accepted; use lit(value).alias(name) to give the literal column a name.

Keyword arguments provide a shorthand for aliasing: the key becomes the output column name and the value is the expression. For example, df.select(z=df.x + 1) is equivalent to df.select((df.x + 1).alias("z")).

Parameters:

  • *exprs (Union[Expr, str, Literal], default: () ) –

    Zero or more positional arguments. Each argument is either a column name (str), a sedonadb.expr.Expr, or a sedonadb.expr.Literal.

  • **kwargs (Union[Expr, str, Literal], default: {} ) –

    Zero or more keyword arguments where each key is the desired output column name and each value is a column name (str), an Expr, or a Literal.

Note

At least one positional or keyword argument is required.

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT 1 AS a, 2 AS b")
>>> df.select("a", (df.b + 1).alias("b_plus_1")).show()
┌───────┬──────────┐
│   a   ┆ b_plus_1 │
│ int64 ┆   int64  │
╞═══════╪══════════╡
│     1 ┆        3 │
└───────┴──────────┘

>>> df.select("a", b_plus_1=df.b + 1).show()
┌───────┬──────────┐
│   a   ┆ b_plus_1 │
│ int64 ┆   int64  │
╞═══════╪══════════╡
│     1 ┆        3 │
└───────┴──────────┘

show

show(
    limit: Optional[int] = 10, width: Optional[int] = None, ascii: bool = False
) -> str

Print the first limit rows to the console

Parameters:

  • limit (Optional[int], default: 10 ) –

    The number of rows to display. Using None will display the entire table which may result in very large output.

  • width (Optional[int], default: None ) –

    The number of characters to use to display the output. If None, uses Options.width or detects the value from the current terminal if available. The default width is 100 characters if a width is not set by another mechanism.

  • ascii (bool, default: False ) –

    Use True to disable UTF-8 characters in the output.

Examples:

>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geometry").show()
┌────────────┐
│  geometry  │
│  geometry  │
╞════════════╡
│ POINT(0 1) │
└────────────┘

sort

sort(*keys: Union[str, Expr, SortExpr]) -> DataFrame

Sort rows by one or more keys.

Each argument is either a column name (str), an Expr, or a SortExpr (built via Expr.asc() / Expr.desc() or sedonadb.expr.sort_expr(...)). Strings and bare Exprs auto-promote to ascending sort keys with nulls placed last; for descending order or null-first placement, use the explicit SortExpr forms.

Null placement defaults to "nulls last" for both ascending and descending sorts (overriding DataFusion's SQL-style nulls-first- on-descending). Override via sort_expr(expr, asc=..., nulls_first=...) for the SQL behavior.

Parameters:

  • *keys (Union[str, Expr, SortExpr], default: () ) –

    One or more sort keys, in order of priority. At least one is required.

Examples:

>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES (3), (1), (2)) AS t(x)")
>>> df.sort("x").show()
┌───────┐
│   x   │
│ int64 │
╞═══════╡
│     1 │
├╌╌╌╌╌╌╌┤
│     2 │
├╌╌╌╌╌╌╌┤
│     3 │
└───────┘
>>> df.sort(sd.col("x").desc()).show()
┌───────┐
│   x   │
│ int64 │
╞═══════╡
│     3 │
├╌╌╌╌╌╌╌┤
│     2 │
├╌╌╌╌╌╌╌┤
│     1 │
└───────┘

to_arrow_reader

to_arrow_reader(*, simplify: bool = False) -> RecordBatchReader

Execute and stream results as a PyArrow RecordBatchReader

Executes the logical plan represented by this object and returns a PyArrow RecordBatchReader. This requires that pyarrow is installed.

Parameters:

  • simplify (bool, default: False ) –

    Use True to simplify Arrow storage types at the export boundary, for example Utf8View to Utf8 and BinaryView to Binary.

Examples:

>>> sd = sedona.db.connect()
>>> reader = sd.sql(
...     "SELECT ST_Point(0, 1) as geometry"
... ).to_arrow_reader()
>>> reader.read_all()
pyarrow.Table
geometry: extension<geoarrow.wkb<WkbType>> not null
----
geometry: [[01010000000000000000000000000000000000F03F]]

to_arrow_table

to_arrow_table(schema: Any = None) -> Table

Execute and collect results as a PyArrow Table

Executes the logical plan represented by this object and returns a PyArrow Table. This requires that pyarrow is installed.

Parameters:

  • schema (Any, default: None ) –

    The requested output schema or None to use the inferred schema.

Examples:

>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geometry").to_arrow_table()
pyarrow.Table
geometry: extension<geoarrow.wkb<WkbType>> not null
----
geometry: [[01010000000000000000000000000000000000F03F]]

to_memtable

to_memtable() -> DataFrame

Collect a data frame into a memtable

Executes the logical plan represented by this object and returns a DataFrame representing it.

Does not guarantee ordering of rows. Use to_arrow_table() if ordering is needed.

Examples:

>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geom").to_memtable().show()
┌────────────┐
│    geom    │
│  geometry  │
╞════════════╡
│ POINT(0 1) │
└────────────┘

to_pandas

to_pandas(geometry: Optional[str] = None) -> Union[DataFrame, GeoDataFrame]

Execute and collect results as a pandas DataFrame or GeoDataFrame

If this data frame contains geometry columns, collect results as a single geopandas.GeoDataFrame. Otherwise, collect results as a pandas.DataFrame.

Parameters:

  • geometry (Optional[str], default: None ) –

    If specified, the name of the column to use for the default geometry column. If not specified, this is inferred as the column named "geometry", the column named "geography", or the first column with a spatial data type (in that order).

Examples:

>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geometry").to_pandas()
      geometry
0  POINT (0 1)

to_parquet

to_parquet(
    path: Union[str, Path],
    *,
    options: Optional[Dict[str, Any]] = None,
    partition_by: Optional[Union[str, Iterable[str]]] = None,
    sort_by: Optional[Union[str, Iterable[str]]] = None,
    single_file_output: Optional[bool] = None,
    geoparquet_version: Literal["1.0", "1.1", "2.0", "none", None] = None,
    overwrite_bbox_columns: Optional[bool] = None,
    max_row_group_size: Optional[int] = None,
    compression: Optional[str] = None,
)

Write this DataFrame to one or more (Geo)Parquet files

For input that contains geometry columns, GeoParquet metadata is written such that suitable readers can recreate Geometry/Geography types when reading the output and potentially read fewer row groups when only a subset of the file is needed for a given query.

Parameters:

  • path (Union[str, Path]) –

    A filename or directory to which parquet file(s) should be written.

  • options (Optional[Dict[str, Any]], default: None ) –

    Key/value options to be used when constructing a parquet writer. Common options are exposed as other arguments to to_parquet(); however, this argument allows setting any DataFusion Parquet writer option. If an option is specified here and by an argument to this function, the value specified as a keyword argument takes precedence.

  • partition_by (Optional[Union[str, Iterable[str]]], default: None ) –

    A vector of column names to partition by. If non-empty, applies hive-style partitioning to the output.

  • sort_by (Optional[Union[str, Iterable[str]]], default: None ) –

    A vector of column names to sort by. Currently only ascending sort is supported.

  • single_file_output (Optional[bool], default: None ) –

    Use True or False to force writing a single Parquet file vs. writing one file per partition to a directory. By default, a single file is written if partition_by is unspecified and path ends with .parquet.

  • geoparquet_version (Literal['1.0', '1.1', '2.0', 'none', None], default: None ) –

    GeoParquet metadata version to write if output contains one or more geometry columns. The default (1.0) is the most widely supported and will result in geometry columns being recognized in many readers; however, only includes statistics at the file level.

    Use GeoParquet 1.1 to compute an additional bounding box column for every geometry column in the output: some readers can use these columns to prune row groups when files contain an effective spatial ordering. The extra columns will appear just before their geometry column and will be named "[geom_col_name]_bbox" for all geometry columns except "geometry", whose bounding box column name is just "bbox".

    Use GeoParquet 2.0 to write compatible GeoParquet metadata with Parquet-native Geometry and/or Geography data types; use "none" to omit GeoParquet metadata completely.

  • overwrite_bbox_columns (Optional[bool], default: None ) –

    Use True to overwrite any bounding box columns that already exist in the input. This is useful in a read -> modify -> write scenario to ensure these columns are up-to-date. If False (the default), an error will be raised if a bbox column already exists.

  • max_row_group_size (Optional[int], default: None ) –

    Target maximum number of rows in each row group. Defaults to the global configuration value (1M rows).

  • compression (Optional[str], default: None ) –

    Sets the Parquet compression codec. Valid values are: uncompressed, snappy, gzip(level), brotli(level), lz4, zstd(level), and lz4_raw. Defaults to the global configuration value (zstd(3)).

Examples:

>>> import tempfile
>>> sd = sedona.db.connect()
>>> td = tempfile.TemporaryDirectory()
>>> url = "https://github.com/apache/sedona-testing/raw/refs/heads/main/data/parquet/geoparquet-1.1.0.parquet"
>>> sd.read_parquet(url).to_parquet(f"{td.name}/tmp.parquet")

to_pyogrio

to_pyogrio(
    path: Union[str, Path, BytesIO],
    *,
    driver: Optional[str] = None,
    geometry_type: Optional[str] = None,
    geometry_name: Optional[str] = None,
    crs: Optional[str] = None,
    append: bool = False,
    **kwargs: Dict[str, Any],
)

Write using GDAL/OGR via pyogrio

Writes this DataFrame batchwise to a file using GDAL/OGR using the implementation provided by the pyogrio package. This is the same backend used by GeoPandas and this function is a light wrapper around pyogrio.raw.write_arrow() that fills in default values using information available to the DataFrame (e.g., geometry column and CRS).

Parameters:

  • path (Union[str, Path, BytesIO]) –

    An output path or BytesIO output buffer.

  • driver (Optional[str], default: None ) –

    An explicit GDAL OGR driver. Usually inferred from path but must be provided if path is a BytesIO. Not all drivers support writing to BytesIO.

  • geometry_type (Optional[str], default: None ) –

    A GeoJSON-style geometry type or None to provide an inferred default value (which may be "Unknown"). This is required to write some types of output (e.g. Shapefiles) and may provide files that are more efficiently read.

  • geometry_name (Optional[str], default: None ) –

    The column to write as the primary geometry column. If None, the name of the geometry column will be inferred.

  • crs (Optional[str], default: None ) –

    An optional string overriding the CRS of geometry_name.

  • append (bool, default: False ) –

    Use True to append to the file for drivers that support appending.

  • kwargs (Dict[str, Any], default: {} ) –

    Extra arguments passed to pyogrio.raw.write_arrow().

Examples:

>>> import tempfile
>>> sd = sedona.db.connect()
>>> td = tempfile.TemporaryDirectory()
>>> sd.sql("SELECT ST_Point(0, 1, 3857)").to_pyogrio(f"{td.name}/tmp.fgb")
>>> sd.read_pyogrio(f"{td.name}/tmp.fgb").show()
┌──────────────┐
 wkb_geometry 
   geometry   
╞══════════════╡
 POINT(0 1)   
└──────────────┘

to_view

to_view(name: str, overwrite: bool = False)

Create a view based on the query represented by this object

Registers this logical plan as a named view with the underlying context such that it can be referred to in SQL.

Parameters:

  • name (str) –

    The name to which this query should be referred

  • overwrite (bool, default: False ) –

    Use True to overwrite an existing view of this name

Examples:

>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geom").to_view("foofy")
>>> sd.view("foofy").show()
┌────────────┐
│    geom    │
│  geometry  │
╞════════════╡
│ POINT(0 1) │
└────────────┘

with_params

with_params(*args: List[Any], **kwargs: Dict[str, Any])

Replace unbound parameters in this query

For DataFrames that represent a logical plan that contains parameters (e.g., a SQL query of SELECT $1 + 2), replace parameters with concrete values. See lit() for a list of supported Python objects.

Parameters:

  • args (List[Any], default: () ) –

    Values to bind to positional parameters (e.g., $1, $2, $3)

  • kwargs (Dict[str, Any], default: {} ) –

    Values to bind to named parameters (e.g., $my_param). Note that positional and named parameters cannot currently be mixed (i.e., parameters must be all positional or all named).

Examples:

>>> sd = sedona.db.connect()
>>> sd.sql("SELECT $1 + 2 AS c").with_params(100).show()
┌───────┐
│   c   │
│ int64 │
╞═══════╡
│   102 │
└───────┘
>>> sd.sql("SELECT $my_param + 2 AS c").with_params(my_param=100).show()
┌───────┐
│   c   │
│ int64 │
╞═══════╡
│   102 │
└───────┘

GroupedDataFrame

A DataFrame partitioned by one or more group keys.

Produced by DataFrame.group_by(...). The class exists as a step in the chain to simplify aggregation expressions.

agg

agg(*exprs: Expr, **named_exprs: Expr) -> DataFrame

Aggregate within each group.

Same signature as DataFrame.agg: positional aggregate Exprs and/or keyword aggregates where the keyword is the output column name.

Parameters:

  • *exprs (Expr, default: () ) –

    Positional aggregate expressions.

  • **named_exprs (Expr, default: {} ) –

    Keyword aggregate expressions; each keyword becomes the output alias.

sedonadb.testing

ArrowSQLCache

A YAML-file-backed cache for Arrow-based query results.

Each entry stores a pa.Table as base64-encoded Arrow IPC. Queries are sorted alphabetically when written for stable git diffs. Results are nested under results.<engine_name> in the YAML output.

Leading comment lines (e.g., a license header) are preserved across rewrites.

BigQuery

Bases: DBEngine

A BigQuery implementation of the DBEngine using ADBC

Uses the ADBC BigQuery driver. Authentication uses Application Default Credentials (ADC) by default — run gcloud auth application-default login once to set that up. Set the following environment variables to configure the connection:

  • SEDONADB_BIGQUERY_TEST_PROJECT_ID: GCP project identifier. Defaults to "sedonadb-testing".
  • SEDONADB_BIGQUERY_TEST_DATASET_ID: Dataset identifier. In general data is not scanned for these tests because doing so would incur cost.
  • SEDONADB_BIGQUERY_TEST_CREDENTIALS_FILE: (optional) Path to a service account JSON key file. When omitted, ADC is used instead.

Unless modifying these tests, the cached results should allow these tests to run without an active connection (and should allow tests to run locally much faster as opening a connection to BigQuery is slow).

close

close()

Close the connection and flush any new cache entries to disk

set_cache_dir classmethod

set_cache_dir(cache_dir: Path | str) -> None

Set the directory containing bigquery_cache.yml.

Call this before any BigQuery instances are created to configure where the cache file is located. This is typically called from a conftest.py in the test directory.

DBEngine

Engine-agnostic catalog and SQL engine

Represents a connection to an engine, abstracting the details of registering a few common types of inputs and generating a few common types of outputs. This is intended for general testing and benchmarking usage and should not be used for anything other than that purpose. Notably, generated SQL is not hardened against injection and table creators always drop any existing table of that name.

assert_query_result

assert_query_result(query: str, expected, **kwargs) -> DBEngine

Assert a SQL query result matches an expected target

A wrapper around execute_and_collect() and assert_result() that captures the most common usage of the DBEngine.

assert_result

assert_result(result, expected, **kwargs) -> DBEngine

Assert a result against an expected target

Supported expected targets include:

  • A pyarrow.Table (compared using ==)
  • A geopandas.GeoDataFrame (compared using geopandas.testing)
  • A pandas.DataFrame (for non-spatial results; compared using pandas.testing)
  • A list of tuples where all values have been converted to strings. For geometry results, these strings are converted to WKT using geoarrow.pyarrow (which ensures a consistent WKT output format).
  • A tuple of strings as the string output of a single row
  • A string as the string output of a single column of a single row
  • A bool for a single boolean value
  • An int or float for single numeric values (optionally with a numeric_epsilon)
  • bytes for single binary values

Using Arrow table equality is the most strict (ensures exact type equality and byte-for-byte value equality); however, string output is most useful for checking logical value quality among engines. GeoPandas/Pandas expected targets generate the most useful assertion failures and are probably the best option for general usage.

close

close()

Close the connection - base implementation does nothing

create_or_skip classmethod

create_or_skip(*args, **kwargs) -> DBEngine

Create this engine or call pytest.skip()

This is the constructor that should be used in tests to ensure that integration style tests don't cause failure for contributors working on Python-only behaviour.

If SEDONADB_PYTHON_NO_SKIP_TESTS is set, this function will never skip to avoid accidentally skipping tests on CI.

create_table_arrow

create_table_arrow(name, obj) -> DBEngine

Copy an Arrow readable into an engine's native table format

create_table_pandas

create_table_pandas(name, obj) -> DBEngine

Copy a GeoPandas or Pandas table into an engine's native table format

create_table_parquet

create_table_parquet(name, paths) -> DBEngine

Scan one or more Parquet files and bring them an the engine's native table format

This is needed for engines that can't lazily scan Parquet (e.g., PostGIS) or engines that have an optimized internal format (e.g., DuckDB). The ability of engines to push down a scan into their own table format is variable.

create_view_parquet

create_view_parquet(name, paths) -> DBEngine

Create a named view of Parquet files without scanning them

This is usually the best option for a benchmark if both engines support pushing down a spatial filter into the Parquet files in question. This is not supported by the PostGIS engine.

execute_and_collect

execute_and_collect(query)

Execute a query and collect results to the driver

The output type here is engine-specific (use other methods to resolve the result into concrete output formats). Current engines typically collect results as Arrow; however, result_to_table() is required to guarantee that geometry results are encoded as GeoArrow.

This is typically the execution step that should be benchmarked (although the end-to-end time that includes data loading can also be a useful number for some result types)

geography_numeric_epsilon

geography_numeric_epsilon() -> float

Relative numeric_epsilon to use when comparing results to the SedonaDB value

install_hint classmethod

install_hint() -> str

A short install hint printed when skipping tests due to failed construction

name classmethod

name() -> str

This engine's name

A short string used to identify this engine in error messages and work around differences in behaviour.

result_to_pandas

result_to_pandas(result) -> DataFrame

Convert a query result into a pandas.DataFrame or geopandas.GeoDataFrame

result_to_table

result_to_table(result) -> Table

Convert a query result into a PyArrow Table

result_to_tuples

result_to_tuples(result, *, wkt_precision=None, **kwargs) -> List[Tuple[str]]

Convert a query result into row tuples

This option strips away fine-grained type information but is helpful for generally asserting a query result or verifying results between engines that have (e.g.) differing integer handling.

val_or_null

val_or_null(arg: Any) -> str

Format SQL expression for a value or NULL

DuckDB

Bases: DBEngine

A DuckDB implementation of the DBEngine using DuckDB Python

DuckDBSingleThread

Bases: DuckDB

DuckDB configured for single-threaded execution

PostGIS

Bases: DBEngine

A PostGIS implementation of the DBEngine using ADBC

The default constructor uses the URI of the container provided in the source repository's compose.yml; however, a custom URI can be provided as well to connect to non-docker PostGIS.

close

close()

Close the connection

PostGISSingleThread

Bases: PostGIS

PostGIS configured for single-threaded (no parallel workers) execution

SedonaDB

Bases: DBEngine

A SedonaDB implementation of the DBEngine using the Python bindings

SedonaDBSingleThread

Bases: SedonaDB

SedonaDB configured for single-threaded execution

geog_or_null

geog_or_null(arg)

Format SQL expression for a geography object or NULL

geom_or_null

geom_or_null(arg, srid=None)

Format SQL expression for a geometry object or NULL

random_geometry

random_geometry(*args, **kwargs) -> DataFrame

Generate a DataFrame with random geometries for testing purposes by calling sd_random_geometry() on an isolated SedonaDB session.

skip_if_not_exists

skip_if_not_exists(path: Path)

Skip a test using pytest.skip() if path does not exist

If SEDONADB_PYTHON_NO_SKIP_TESTS is set, this function will never skip to avoid accidentally skipping tests on CI.

val_or_null

val_or_null(arg)

Format SQL expression for a value or NULL

Use an engine-specific method when formatting bytes as there is no engine-agnostic way to to represent bytes as a SQL literal.

This is not secure (i.e., does not prevent SQL injection of any kind) and should only be used for testing.

sedonadb.dbapi

connect

connect(**kwargs: Mapping[str, Any]) -> Connection

Connect to Sedona via Python DBAPI

Creates a DBAPI-compatible connection as a thin wrapper around the ADBC Python driver manager's DBAPI compatibility layer. Support for DBAPI is experimental.

Parameters:

  • kwargs (Mapping[str, Any], default: {} ) –

    Extra keyword arguments passed to adbc_driver_manager.dbapi.Connection().

Examples:

>>> con = sedona.dbapi.connect()
>>> with con.cursor() as cur:
...     _ = cur.execute("SELECT 1 as one")
...     cur.fetchall()
[(1,)]

sedonadb.udf

BINARY module-attribute

BINARY: TypeMatcher = 'binary'

Match any binary argument (i.e., binary, binary view, large binary, fixed-size binary)

BOOLEAN module-attribute

BOOLEAN: TypeMatcher = 'boolean'

Match a boolean argument

GEOGRAPHY module-attribute

GEOGRAPHY: TypeMatcher = 'geography'

Match a geography argument

GEOMETRY module-attribute

GEOMETRY: TypeMatcher = 'geometry'

Match a geometry argument

NUMERIC module-attribute

NUMERIC: TypeMatcher = 'numeric'

Match any numeric argument

STRING module-attribute

STRING: TypeMatcher = 'string'

Match any string argument (i.e., string, string view, large string)

AggregateUdfImpl

Aggregate user-defined function wrapper.

Returned by arrow_aggregate_udf. Holds the user's class plus the type schemas; on registration, builds a factory that produces a per-accumulator _AccumulatorWrapper bridging the user's class to the Rust Accumulator trait.

ScalarUdfImpl

Scalar user-defined function wrapper

This class is a wrapper class used as the return value for user-defined function constructors. This wrapper allows the UDF to be registered with a SedonaDB context or any context that accepts DataFusion Python Scalar UDFs. This object is not intended to be used to call a UDF.

TypeMatcher

Bases: str

Helper class to mark type matchers that can be used as the input_types for user-defined functions

Note that the internal storage of the type matcher (currently a string) is arbitrary and may change in a future release. Use the constants provided by the udf module.

arrow_aggregate_udf

arrow_aggregate_udf(
    return_type: Any,
    input_types: List[Union[TypeMatcher, Any]],
    state_types: List[Any],
    *,
    volatility: Literal["immutable", "stable", "volatile"] = "immutable",
    name: Optional[str] = None,
)

Decorator for Python-implemented aggregate UDFs.

Decorates a class whose instances act as a stateful accumulator. Each grouped/global aggregation builds one instance per partial-state slot and then merges them.

Warning

SedonaDB Python UDFs are experimental and this interface may change based on user feedback.

The decorated class must define:

  • __init__(self): build a fresh accumulator (no arguments).
  • update(self, *arrays): receives one pa.Array per declared input column, splatted positionally (single input today; the array can contain nulls).
  • state(self) -> tuple: serialize the accumulator into a tuple of Python values matching state_types in order.
  • merge(self, *arrays): receives one pa.Array per element of state_types, splatted positionally. Each array has N rows for N partial states being merged in.
  • evaluate(self): return the final scalar Python value (or None for SQL NULL) matching return_type.

Parameters:

  • return_type (Any) –

    A pyarrow data type for the final aggregate result. Must be a concrete pyarrow type — TypeMatcher constants are not accepted here.

  • input_types (List[Union[TypeMatcher, Any]]) –

    One or more types describing the columns the aggregate consumes. Each entry is either a TypeMatcher constant (udf.NUMERIC, udf.GEOMETRY, …) or a concrete pyarrow type.

  • state_types (List[Any]) –

    A list of concrete pyarrow types describing the serialized state. The length must match the tuple returned by state().

  • volatility (Literal['immutable', 'stable', 'volatile'], default: 'immutable' ) –

    "immutable" (default), "stable", or "volatile".

  • name (Optional[str], default: None ) –

    SQL-visible name. Defaults to the decorated class name converted from CamelCase to snake_case (so a MyMean class is callable as my_mean).

Examples:

>>> import pyarrow as pa
>>> import pandas as pd
>>> import sedonadb
>>> from sedonadb import udf
>>> sd = sedonadb.connect()
>>>
>>> @udf.arrow_aggregate_udf(
...     return_type=pa.float64(),
...     input_types=[udf.NUMERIC],
...     state_types=[pa.float64(), pa.int64()],
... )
... class my_mean:
...     def __init__(self):
...         self.total = 0.0
...         self.count = 0
...     def update(self, batch):
...         for v in batch:
...             if v.is_valid:
...                 self.total += float(v.as_py())
...                 self.count += 1
...     def state(self):
...         return (self.total, self.count)
...     def merge(self, totals, counts):
...         for i in range(len(totals)):
...             self.total += totals[i].as_py()
...             self.count += counts[i].as_py()
...     def evaluate(self):
...         return None if self.count == 0 else self.total / self.count
...
>>> sd.register(my_mean)
>>> sd.create_data_frame(
...     pd.DataFrame({"k": ["a", "a", "b"], "v": [1.0, 3.0, 7.0]})
... ).to_view("t", overwrite=True)
>>> sd.sql("SELECT k, my_mean(v) AS m FROM t GROUP BY k ORDER BY k").show()
┌──────┬─────────┐
   k      m    
 utf8  float64 
╞══════╪═════════╡
 a         2.0 
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
 b         7.0 
└──────┴─────────┘

arrow_udf

arrow_udf(
    return_type: Any,
    input_types: List[Union[TypeMatcher, Any]] = None,
    volatility: Literal["immutable", "stable", "volatile"] = "immutable",
    name: Optional[str] = None,
)

Generic Arrow-based user-defined scalar function decorator

This decorator may be used to annotate a function that accepts arguments as Arrow array wrappers implementing the Arrow PyCapsule Interface. The annotated function must return a value of a consistent length of the appropriate type.

Warning

SedonaDB will call the provided function from multiple threads. Attempts to modify shared state from the body of the function may crash or cause unusual behaviour.

SedonaDB Python UDFs are experimental and this interface may change based on user feedback.

Parameters:

  • return_type (Any) –

    One of - A data type (e.g., pyarrow.DataType, arro3.core.DataType, nanoarrow.Schema) if this function returns the same type regardless of its inputs. - A function of arg_types (list of data types) and scalar_args (list of optional scalars) that returns a data type. This function is also responsible for returning None if this function does not apply to the input types.

  • input_types (List[Union[TypeMatcher, Any]], default: None ) –

    One of - A list where each member is a data type or a TypeMatcher. The udf.GEOMETRY and udf.GEOGRAPHY type matchers are the most useful because otherwise the function will only match spatial data types whose coordinate reference system (CRS) also matches (i.e., based on simple equality). Using these type matchers will also ensure input CRS consistency and will automatically propagate input CRSes into the output. - None, indicating that this function can accept any number of arguments of any type. Usually this is paired with a functional return_type that dynamically computes a return type or returns None if the number or types of arguments do not match.

  • volatility (Literal['immutable', 'stable', 'volatile'], default: 'immutable' ) –

    Use "immutable" for functions whose output is always consistent for the same inputs (even between queries); use "stable" for functions whose output is always consistent for the same inputs but only within the same query, and use "volatile" for functions that generate random or otherwise non-deterministic output.

  • name (Optional[str], default: None ) –

    An optional name for the UDF. If not given, it will be derived from the name of the provided function.

Examples:

>>> import pyarrow as pa
>>> from sedonadb import udf
>>> sd = sedona.db.connect()

The simplest scalar UDF only specifies return types. This implies that
the function can handle input of any type.

>>> @udf.arrow_udf(pa.string())
... def some_udf(arg0, arg1):
...     arg0, arg1 = (
...         pa.array(arg0.to_array()).to_pylist(),
...         pa.array(arg1.to_array()).to_pylist(),
...     )
...     return pa.array(
...         (f"{item0} / {item1}" for item0, item1 in zip(arg0, arg1)),
...         pa.string(),
...     )
...
>>> sd.register(some_udf)
>>> sd.sql("SELECT some_udf(123, 'abc') as col").show()
┌───────────┐
    col    
    utf8   
╞═══════════╡
 123 / abc 
└───────────┘

Use the `TypeMatcher` constants where possible to specify input.
This ensures that the function can handle the usual range of input
types that might exist for a given input.

>>> @udf.arrow_udf(pa.int64(), [udf.STRING])
... def char_count(arg0):
...     arg0 = pa.array(arg0.to_array())
...
...     return pa.array(
...         (len(item) for item in arg0.to_pylist()),
...         pa.int64()
...     )
...
>>> sd.register(char_count)
>>> sd.sql("SELECT char_count('abcde') as col").show()
┌───────┐
  col  
 int64 
╞═══════╡
     5 
└───────┘

In this case, the type matcher ensures we can also use the function
for string view input which is the usual type SedonaDB emits when
reading Parquet files.

>>> sd.sql("SELECT char_count(arrow_cast('abcde', 'Utf8View')) as col").show()
┌───────┐
  col  
 int64 
╞═══════╡
     5 
└───────┘

Geometry UDFs are best written using Shapely because pyproj (including its use
in GeoPandas) is not thread safe and can crash when attempting to look up
CRSes when importing an Arrow array. The UDF framework supports returning
geometry storage to make this possible. Coordinate reference system metadata
is propagated automatically from the input.

>>> import shapely
>>> import geoarrow.pyarrow as ga
>>> @udf.arrow_udf(ga.wkb(), [udf.GEOMETRY, udf.NUMERIC])
... def shapely_udf(geom, distance):
...     geom_wkb = pa.array(geom.storage.to_array())
...     distance = pa.array(distance.to_array())
...     geom = shapely.from_wkb(geom_wkb)
...     result_shapely = shapely.buffer(geom, distance)
...     return pa.array(shapely.to_wkb(result_shapely))
...
>>>
>>> sd.register(shapely_udf)
>>> sd.sql("SELECT ST_SRID(shapely_udf(ST_Point(0, 0), 2.0)) as col").show()
┌────────┐
   col  
 uint32 
╞════════╡
      0 
└────────┘

>>> sd.sql("SELECT ST_SRID(shapely_udf(ST_Point(0, 0, 3857), 2.0)) as col").show()
┌────────┐
   col  
 uint32 
╞════════╡
   3857 
└────────┘

Annotated functions may also declare keyword arguments `return_type` and/or `num_rows`,
which will be passed the appropriate value by the UDF framework. This facilitates writing
generic UDFs and/or UDFs with no arguments.

>>> import numpy as np
>>> def random_impl(return_type, num_rows):
...     pa_type = pa.field(return_type).type
...     return pa.array(np.random.random(num_rows), pa_type)
...
>>> @udf.arrow_udf(pa.float32(), [])
... def random_f32(*, return_type=None, num_rows=None):
...     return random_impl(return_type, num_rows)
...
>>> @udf.arrow_udf(pa.float64(), [])
... def random_f64(*, return_type=None, num_rows=None):
...     return random_impl(return_type, num_rows)
...
>>> np.random.seed(487)
>>> sd.register(random_f32)
>>> sd.register(random_f64)
>>> sd.sql("SELECT random_f32() AS f32, random_f64() as f64;").show()
┌────────────┬─────────────────────┐
     f32             f64         
   float32         float64       
╞════════════╪═════════════════════╡
 0.35385555  0.24793247139474195 
└────────────┴─────────────────────┘