Python API Reference¶
sedonadb.context ¶
SedonaContext ¶
Context for executing queries using Sedona
This object keeps track of state such as registered functions, registered tables, and available memory. This is similar to a Spark SessionContext or a database connection.
Examples:
>>> sd = sedona.db.connect()
>>> sd.options.interactive = True
>>> sd.sql("SELECT 1 as one")
┌───────┐
│ one │
│ int64 │
╞═══════╡
│ 1 │
└───────┘
create_data_frame ¶
Create a DataFrame from an in-memory or protocol-enabled object.
Converts supported Python objects into a SedonaDB DataFrame so you can run SQL and spatial operations on them.
Parameters:
-
obj
(Any
) –A supported object: - pandas DataFrame - GeoPandas DataFrame - Polars DataFrame - pyarrow Table
-
schema
(Any
, default:None
) –Optional object implementing
__arrow_schema__
for providing an Arrow schema.
Returns:
-
DataFrame
(DataFrame
) –A SedonaDB DataFrame.
Examples:
>>> import pandas as pd
>>> sd = sedona.db.connect()
>>> sd.create_data_frame(pd.DataFrame({"x": [1, 2]})).head(1).show()
┌───────┐
│ x │
│ int64 │
╞═══════╡
│ 1 │
└───────┘
drop_view ¶
drop_view(name: str) -> None
Remove a named view
Parameters:
-
name
(str
) –The name of the view
Examples:
>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geom").to_view("foofy")
>>> sd.drop_view("foofy")
read_parquet ¶
read_parquet(
table_paths: Union[str, Path, Iterable[str]],
options: Optional[Dict[str, Any]] = None,
) -> DataFrame
Create a DataFrame from one or more Parquet files
Parameters:
-
table_paths
(Union[str, Path, Iterable[str]]
) –A str, Path, or iterable of paths containing URLs to Parquet files.
-
options
(Optional[Dict[str, Any]]
, default:None
) –Optional dictionary of options to pass to the Parquet reader. For S3 access, use {"aws.skip_signature": True, "aws.region": "us-west-2"} for anonymous access to public buckets.
Examples:
>>> sd = sedona.db.connect()
>>> url = "https://github.com/apache/sedona-testing/raw/refs/heads/main/data/parquet/geoparquet-1.1.0.parquet"
>>> sd.read_parquet(url)
<sedonadb.dataframe.DataFrame object at ...>
sql ¶
Create a DataFrame by executing SQL
Parses a SQL string into a logical plan and returns a DataFrame that can be used to request results or further modify the query.
Parameters:
-
sql
(str
) –A single SQL statement.
Examples:
>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geom")
<sedonadb.dataframe.DataFrame object at ...>
view ¶
Create a DataFrame from a named view
Refer to a named view registered with this context.
Parameters:
-
name
(str
) –The name of the view
Examples:
>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geom").to_view("foofy")
>>> sd.view("foofy").show()
┌────────────┐
│ geom │
│ geometry │
╞════════════╡
│ POINT(0 1) │
└────────────┘
>>> sd.drop_view("foofy")
configure_proj ¶
configure_proj(
preset: Literal[
"auto", "pyproj", "homebrew", "conda", "system", None
] = None,
*,
shared_library: Union[str, Path] = None,
database_path: Union[str, Path] = None,
search_path: Union[str, Path] = None,
verbose: bool = False,
)
Configure PROJ source
SedonaDB loads PROJ dynamically to ensure aligned results and configuration against other Python and/or system libraries. This is normally configured on package load but may need additional configuration (particularly if the automatic configuration fails).
This function may be called at any time; however, once ST_Transform has been called, subsequent configuration has no effect.
Parameters:
-
preset
(Literal['auto', 'pyproj', 'homebrew', 'conda', 'system', None]
, default:None
) –One of: - None: Use custom values of shared_library and/or other keyword arguments. - auto: Try all presets in the order pyproj, conda, homebrew, system and warn if none succeeded. - pyproj: Attempt to use shared libraries bundled with pyproj. This aligns transformations with those performed by geopandas and is the option that is tried first. - conda: Attempt to load libproj and data files installed via
conda install proj
. - homebrew: Attempt to load libproj and data files installed viabrew install proj
. Note that the Homebrew install also includes proj-data grid files and may be able to perform more accurate transforms by default/without network capability. - system: Attempt to load libproj from a directory already on LD_LIBRARY_PATH (linux), DYLD_LIBRARY_PATH (MacOS), or PATH (Windows). This should find the version of PROJ installed by a Linux system package manager. -
shared_library
(Union[str, Path]
, default:None
) –Path to a PROJ shared library.
-
database_path
(Union[str, Path]
, default:None
) –Path to the PROJ database (proj.db).
-
search_path
(Union[str, Path]
, default:None
) –Path to the directory containing PROJ data files.
-
verbose
(bool
, default:False
) –If True, print information about the configuration process.
Examples:
>>> sedona.db.configure_proj("auto")
sedonadb.dataframe ¶
DataFrame ¶
Representation of a (lazy) collection of columns
This object is usually constructed from a SedonaContext][sedonadb.context.SedonaContext] by importing an object, reading a file, or executing SQL.
schema
property
¶
schema
Return the column names and data types
Examples:
>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT 1 as one")
>>> df.schema
SedonaSchema with 1 field:
one: non-nullable int64<Int64>
>>> df.schema.field(0)
SedonaField one: non-nullable int64<Int64>
>>> df.schema.field(0).name, df.schema.field(0).type
('one', SedonaType int64<Int64>)
__arrow_c_schema__ ¶
__arrow_c_schema__()
ArrowSchema PyCapsule interface
Returns a PyCapsule wrapping an Arrow C Schema for interoperability with libraries that understand Arrow C data types. See the Arrow PyCapsule interface for more details.
__arrow_c_stream__ ¶
__arrow_c_stream__(requested_schema: Any = None)
ArrowArrayStream Stream PyCapsule interface
Returns a PyCapsule wrapping an Arrow C ArrayStream for interoperability with libraries that understand Arrow C data types. See the Arrow PyCapsule interface for more details.
Parameters:
-
requested_schema
(Any
, default:None
) –A PyCapsule representing the desired output schema.
count ¶
count() -> int
Compute the number of rows in this DataFrame
Examples:
>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES ('one'), ('two'), ('three')) AS t(val)")
>>> df.count()
3
execute ¶
execute() -> None
Execute the plan represented by this DataFrame
This will execute the query without collecting results into memory, which is useful for executing SQL statements like SET, CREATE VIEW, and CREATE EXTERNAL TABLE.
Note that this is functionally similar to .count()
except it does
not apply any optimizations (e.g., does not use statistics to avoid
reading data to calculate a count).
Examples:
>>> sd = sedona.db.connect()
>>> sd.sql("CREATE OR REPLACE VIEW temp_view AS SELECT 1 as one").execute()
0
>>> sd.view("temp_view").show()
┌───────┐
│ one │
│ int64 │
╞═══════╡
│ 1 │
└───────┘
explain ¶
Return the execution plan for this DataFrame as a DataFrame
Retrieves the logical and physical execution plans that will be used to compute this DataFrame. This is useful for understanding query performance and optimization.
Parameters:
-
type
(str
, default:'standard'
) –The type of explain plan to generate. Supported values are: "standard" (default) - shows logical and physical plans, "extended" - includes additional query optimization details, "analyze" - executes the plan and reports actual metrics.
-
format
(str
, default:'indent'
) –The format to use for displaying the plan. Supported formats are "indent" (default), "tree", "pgjson" and "graphviz".
Returns:
-
DataFrame
–A DataFrame containing the execution plan information with columns
-
DataFrame
–'plan_type' and 'plan'.
Examples:
>>> import sedonadb
>>> con = sedonadb.connect()
>>> df = con.sql("SELECT 1 as one")
>>> df.explain().show()
┌───────────────┬─────────────────────────────────┐
│ plan_type ┆ plan │
│ utf8 ┆ utf8 │
╞═══════════════╪═════════════════════════════════╡
│ logical_plan ┆ Projection: Int64(1) AS one │
│ ┆ EmptyRelation │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ physical_plan ┆ ProjectionExec: expr=[1 as one] │
│ ┆ PlaceholderRowExec │
│ ┆ │
└───────────────┴─────────────────────────────────┘
head ¶
Limit result to the first n rows
Note that this is non-deterministic for many queries.
Parameters:
-
n
(int
, default:5
) –The number of rows to return
Examples:
>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES ('one'), ('two'), ('three')) AS t(val)")
>>> df.head(1).show()
┌──────┐
│ val │
│ utf8 │
╞══════╡
│ one │
└──────┘
limit ¶
Limit result to n rows starting at offset
Note that this is non-deterministic for many queries.
Parameters:
-
n
(Optional[int]
) –The number of rows to return
-
offset
(int
, default:0
) –The number of rows to skip (optional)
Examples:
>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES ('one'), ('two'), ('three')) AS t(val)")
>>> df.limit(1).show()
┌──────┐
│ val │
│ utf8 │
╞══════╡
│ one │
└──────┘
>>> df.limit(1, offset=2).show()
┌───────┐
│ val │
│ utf8 │
╞═══════╡
│ three │
└───────┘
show ¶
Print the first limit rows to the console
Parameters:
-
limit
(Optional[int]
, default:10
) –The number of rows to display. Using None will display the entire table which may result in very large output.
-
width
(Optional[int]
, default:None
) –The number of characters to use to display the output. If None, uses
Options.width
or detects the value from the current terminal if available. The default width is 100 characters if a width is not set by another mechanism. -
ascii
(bool
, default:False
) –Use True to disable UTF-8 characters in the output.
Examples:
>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geometry").show()
┌────────────┐
│ geometry │
│ geometry │
╞════════════╡
│ POINT(0 1) │
└────────────┘
to_arrow_table ¶
to_arrow_table(schema: Any = None) -> Table
Execute and collect results as a PyArrow Table
Executes the logical plan represented by this object and returns a PyArrow Table. This requires that pyarrow is installed.
Parameters:
-
schema
(Any
, default:None
) –The requested output schema or
None
to use the inferred schema.
Examples:
>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geometry").to_arrow_table()
pyarrow.Table
geometry: extension<geoarrow.wkb<WkbType>> not null
----
geometry: [[01010000000000000000000000000000000000F03F]]
to_memtable ¶
to_memtable() -> DataFrame
Collect a data frame into a memtable
Executes the logical plan represented by this object and returns a DataFrame representing it.
Does not guarantee ordering of rows. Use to_arrow_table()
if
ordering is needed.
Examples:
>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geom").to_memtable().show()
┌────────────┐
│ geom │
│ geometry │
╞════════════╡
│ POINT(0 1) │
└────────────┘
to_pandas ¶
to_pandas(geometry: Optional[str] = None) -> Union[DataFrame, GeoDataFrame]
Execute and collect results as a pandas DataFrame or GeoDataFrame
If this data frame contains geometry columns, collect results as a
single geopandas.GeoDataFrame
. Otherwise, collect results as a
pandas.DataFrame
.
Parameters:
-
geometry
(Optional[str]
, default:None
) –If specified, the name of the column to use for the default geometry column. If not specified, this is inferred as the column named "geometry", the column named "geography", or the first column with a spatial data type (in that order).
Examples:
>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geometry").to_pandas()
geometry
0 POINT (0 1)
to_parquet ¶
to_parquet(
path: Union[str, Path],
*,
partition_by: Optional[Union[str, Iterable[str]]] = None,
sort_by: Optional[Union[str, Iterable[str]]] = None,
single_file_output: Optional[bool] = None,
)
Write this DataFrame to one or more (Geo)Parquet files
For input that contains geometry columns, GeoParquet metadata is written such that suitable readers can recreate Geometry/Geography types when reading the output.
Parameters:
-
path
(Union[str, Path]
) –A filename or directory to which parquet file(s) should be written.
-
partition_by
(Optional[Union[str, Iterable[str]]]
, default:None
) –A vector of column names to partition by. If non-empty, applies hive-style partitioning to the output.
-
sort_by
(Optional[Union[str, Iterable[str]]]
, default:None
) –A vector of column names to sort by. Currently only ascending sort is supported.
-
single_file_output
(Optional[bool]
, default:None
) –Use True or False to force writing a single Parquet file vs. writing one file per partition to a directory. By default, a single file is written if
partition_by
is unspecified andpath
ends with.parquet
.
Examples:
>>> import tempfile
>>> sd = sedona.db.connect()
>>> td = tempfile.TemporaryDirectory()
>>> url = "https://github.com/apache/sedona-testing/raw/refs/heads/main/data/parquet/geoparquet-1.1.0.parquet"
>>> sd.read_parquet(url).to_parquet(f"{td.name}/tmp.parquet")
to_view ¶
Create a view based on the query represented by this object
Registers this logical plan as a named view with the underlying context such that it can be referred to in SQL.
Parameters:
-
name
(str
) –The name to which this query should be referred
-
overwrite
(bool
, default:False
) –Use
True
to overwrite an existing view of this name
Examples:
>>> sd = sedona.db.connect()
>>> sd.sql("SELECT ST_Point(0, 1) as geom").to_view("foofy")
>>> sd.view("foofy").show()
┌────────────┐
│ geom │
│ geometry │
╞════════════╡
│ POINT(0 1) │
└────────────┘
sedonadb.testing ¶
DBEngine ¶
Engine-agnostic catalog and SQL engine
Represents a connection to an engine, abstracting the details of registering a few common types of inputs and generating a few common types of outputs. This is intended for general testing and benchmarking usage and should not be used for anything other than that purpose. Notably, generated SQL is not hardened against injection and table creators always drop any existing table of that name.
assert_query_result ¶
Assert a SQL query result matches an expected target
A wrapper around execute_and_collect() and assert_result() that captures the most common usage of the DBEngine.
assert_result ¶
assert_result(result, expected, **kwargs) -> DBEngine
Assert a result against an expected target
Supported expected targets include:
- A pyarrow.Table (compared using ==)
- A geopandas.GeoDataFrame (compared using geopandas.testing)
- A pandas.DataFrame (for non-spatial results; compared using pandas.testing)
- A list of tuples where all values have been converted to strings. For geometry results, these strings are converted to WKT using geoarrow.pyarrow (which ensures a consistent WKT output format).
- A tuple of strings as the string output of a single row
- A string as the string output of a single column of a single row
- A bool for a single boolean value
- An int or float for single numeric values (optionally with a numeric_epsilon)
- bytes for single binary values
Using Arrow table equality is the most strict (ensures exact type equality and byte-for-byte value equality); however, string output is most useful for checking logical value quality among engines. GeoPandas/Pandas expected targets generate the most useful assertion failures and are probably the best option for general usage.
create_or_skip
classmethod
¶
create_or_skip(*args, **kwargs) -> DBEngine
Create this engine or call pytest.skip()
This is the constructor that should be used in tests to ensure that integration style tests don't cause failure for contributors working on Python-only behaviour.
If SEDONADB_PYTHON_NO_SKIP_TESTS is set, this function will never skip to avoid accidentally skipping tests on CI.
create_table_arrow ¶
create_table_arrow(name, obj) -> DBEngine
Copy an Arrow readable into an engine's native table format
create_table_pandas ¶
create_table_pandas(name, obj) -> DBEngine
Copy a GeoPandas or Pandas table into an engine's native table format
create_table_parquet ¶
create_table_parquet(name, paths) -> DBEngine
Scan one or more Parquet files and bring them an the engine's native table format
This is needed for engines that can't lazily scan Parquet (e.g., PostGIS) or engines that have an optimized internal format (e.g., DuckDB). The ability of engines to push down a scan into their own table format is variable.
create_view_parquet ¶
create_view_parquet(name, paths) -> DBEngine
Create a named view of Parquet files without scanning them
This is usually the best option for a benchmark if both engines support pushing down a spatial filter into the Parquet files in question. This is not supported by the PostGIS engine.
execute_and_collect ¶
execute_and_collect(query)
Execute a query and collect results to the driver
The output type here is engine-specific (use other methods to resolve the result into concrete output formats). Current engines typically collect results as Arrow; however, result_to_table() is required to guarantee that geometry results are encoded as GeoArrow.
This is typically the execution step that should be benchmarked (although the end-to-end time that includes data loading can also be a useful number for some result types)
install_hint
classmethod
¶
install_hint() -> str
A short install hint printed when skipping tests due to failed construction
name
classmethod
¶
name() -> str
This engine's name
A short string used to identify this engine in error messages and work around differences in behaviour.
result_to_pandas ¶
result_to_pandas(result) -> DataFrame
Convert a query result into a pandas.DataFrame or geopandas.GeoDataFrame
result_to_tuples ¶
Convert a query result into row tuples
This option strips away fine-grained type information but is helpful for generally asserting a query result or verifying results between engines that have (e.g.) differing integer handling.
PostGIS ¶
sedonadb.dbapi ¶
connect ¶
Connect to Sedona via Python DBAPI
Creates a DBAPI-compatible connection as a thin wrapper around the ADBC Python driver manager's DBAPI compatibility layer. Support for DBAPI is experimental.
Parameters:
-
kwargs
(Mapping[str, Any]
, default:{}
) –Extra keyword arguments passed to
adbc_driver_manager.dbapi.Connection()
.
Examples:
>>> con = sedona.dbapi.connect()
>>> with con.cursor() as cur:
... cur.execute("SELECT 1 as one")
... cur.fetchall()
[(1,)]