GeoPandas API for Apache Sedona¶
The GeoPandas API for Apache Sedona provides a familiar GeoPandas interface that scales your geospatial analysis beyond single-node limitations. This API combines the intuitive GeoPandas DataFrame syntax with the distributed processing power of Apache Sedona on Apache Spark, enabling you to work with planetary-scale datasets using the same code patterns you already know.
Overview¶
What is the GeoPandas API for Apache Sedona?¶
The GeoPandas API for Apache Sedona is a compatibility layer that allows you to use GeoPandas-style operations on distributed geospatial data. Instead of being limited to single-node processing, your GeoPandas code can leverage the full power of Apache Spark clusters for large-scale geospatial analysis.
Key Benefits¶
- Familiar API: Use the same GeoPandas syntax and methods you're already familiar with
- Distributed Processing: Scale beyond single-node limitations to handle large datasets
- Lazy Evaluation: Benefit from Apache Sedona's query optimization and lazy execution
- Performance: Leverage distributed computing for complex geospatial operations
- Seamless Migration: Minimal code changes required to migrate existing GeoPandas workflows
Setup¶
The GeoPandas API for Apache Sedona automatically handles SparkSession management through PySpark's pandas-on-Spark integration. You have two options for setup:
Option 1: Automatic SparkSession (Recommended)¶
The GeoPandas API automatically uses the default SparkSession from PySpark:
from sedona.spark.geopandas import GeoDataFrame, read_parquet
# No explicit SparkSession setup needed - uses default session
# The API automatically handles Sedona context initialization
Option 2: Manual SparkSession Setup¶
If you need to configure a custom SparkSession or are working in an environment where you need explicit control:
from sedona.spark.geopandas import GeoDataFrame, read_parquet
from sedona.spark import SedonaContext
# Create and configure SparkSession
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
# The GeoPandas API will use this configured session
Option 3: Using Existing SparkSession¶
If you already have a SparkSession (e.g., in Databricks, EMR, or other managed environments):
from sedona.spark.geopandas import GeoDataFrame, read_parquet
from sedona.spark import SedonaContext
# Use existing SparkSession (e.g., 'spark' in Databricks)
sedona = SedonaContext.create(spark) # 'spark' is the existing session
How SparkSession Management Works¶
The GeoPandas API leverages PySpark's pandas-on-Spark functionality, which automatically manages the SparkSession lifecycle:
-
Default Session: When you import
sedona.spark.geopandas
, it automatically uses PySpark's default session viapyspark.pandas.utils.default_session()
-
Automatic Sedona Registration: The API automatically registers Sedona's spatial functions and optimizations with the SparkSession when needed
-
Transparent Integration: All GeoPandas operations are translated to Spark SQL operations under the hood, using the configured SparkSession
-
No Manual Context Management: Unlike traditional Sedona usage, you don't need to explicitly call
SedonaContext.create()
unless you need custom configuration
This design makes the API more user-friendly by hiding the complexity of SparkSession management while still providing the full power of distributed processing.
S3 Configuration¶
When working with S3 data, the GeoPandas API uses Spark's built-in S3 support rather than external libraries like s3fs. Configure anonymous access to public S3 buckets using Spark configuration:
from sedona.spark import SedonaContext
# For anonymous access to public S3 buckets
config = (
SedonaContext.builder()
.config(
"spark.hadoop.fs.s3a.bucket.bucket-name.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
)
.getOrCreate()
)
sedona = SedonaContext.create(config)
For authenticated S3 access, use appropriate AWS credential providers:
# For IAM roles (recommended for EC2/EMR)
config = (
SedonaContext.builder()
.config(
"spark.hadoop.fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider",
)
.getOrCreate()
)
# For access keys (not recommended for production)
config = (
SedonaContext.builder()
.config("spark.hadoop.fs.s3a.access.key", "your-access-key")
.config("spark.hadoop.fs.s3a.secret.key", "your-secret-key")
.getOrCreate()
)
Basic Usage¶
Importing the API¶
Instead of importing GeoPandas directly, import from the Sedona GeoPandas module:
# Traditional GeoPandas import
# import geopandas as gpd
# Sedona GeoPandas API import
import sedona.spark.geopandas as gpd
# or
from sedona.spark.geopandas import GeoDataFrame, read_parquet
Reading Data¶
The API supports reading from various geospatial formats, including Parquet files from cloud storage. For S3 access with anonymous credentials, configure Spark to use anonymous AWS credentials:
from sedona.spark import SedonaContext
# Configure Spark for anonymous S3 access
config = (
SedonaContext.builder()
.config(
"spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
)
.getOrCreate()
)
sedona = SedonaContext.create(config)
# Load GeoParquet file directly from S3
s3_path = "s3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet"
nyc_buildings = gpd.read_parquet(s3_path)
# Display basic information
print(f"Dataset shape: {nyc_buildings.shape}")
print(f"Columns: {nyc_buildings.columns.tolist()}")
nyc_buildings.head()
Spatial Filtering¶
Use spatial indexing and filtering methods. Note that cx
spatial indexing is not yet implemented in the current version:
from shapely.geometry import box
# Define bounding box for Central Park
central_park_bbox = box(
-73.973,
40.764, # bottom-left (longitude, latitude)
-73.951,
40.789, # top-right (longitude, latitude)
)
# Filter buildings within the bounding box using spatial index
# Note: This requires collecting data to driver for spatial filtering
# For large datasets, consider using spatial joins instead
buildings_sample = nyc_buildings.sample(1000) # Sample for demonstration
central_park_buildings = buildings_sample[
buildings_sample.geometry.intersects(central_park_bbox)
]
# Display results
print(
central_park_buildings[["BUILD_ID", "PROP_ADDR", "height_val", "geometry"]].head()
)
Alternative approach for large datasets using spatial joins:
# Create a GeoDataFrame with the bounding box
bbox_gdf = gpd.GeoDataFrame({"id": [1]}, geometry=[central_park_bbox], crs="EPSG:4326")
# Use spatial join to filter buildings within the bounding box
central_park_buildings = nyc_buildings.sjoin(bbox_gdf, predicate="intersects")
Advanced Operations¶
Spatial Joins¶
Perform spatial joins using the same syntax as GeoPandas:
# Load two datasets
left_df = gpd.read_parquet("s3://bucket/left_data.parquet")
right_df = gpd.read_parquet("s3://bucket/right_data.parquet")
# Spatial join with distance predicate
result = left_df.sjoin(right_df, predicate="dwithin", distance=50)
# Other spatial predicates
intersects_result = left_df.sjoin(right_df, predicate="intersects")
contains_result = left_df.sjoin(right_df, predicate="contains")
Coordinate Reference System Operations¶
Transform geometries between different coordinate reference systems:
# Set initial CRS
buildings = gpd.read_parquet("buildings.parquet")
buildings = buildings.set_crs("EPSG:4326")
# Transform to projected CRS for area calculations
buildings_projected = buildings.to_crs("EPSG:3857")
# Calculate areas
buildings_projected["area"] = buildings_projected.geometry.area
Geometric Operations¶
Apply geometric transformations and analysis:
# Buffer operations
buffered = buildings.geometry.buffer(100) # 100 meter buffer
# Geometric properties
buildings["is_valid"] = buildings.geometry.is_valid
buildings["is_simple"] = buildings.geometry.is_simple
buildings["bounds"] = buildings.geometry.bounds
# Distance calculations
from shapely.geometry import Point
reference_point = Point(-73.9857, 40.7484) # Times Square
buildings["distance_to_times_square"] = buildings.geometry.distance(reference_point)
# Area and length calculations (requires projected CRS)
buildings_projected = buildings.to_crs("EPSG:3857") # Web Mercator
buildings_projected["area"] = buildings_projected.geometry.area
buildings_projected["perimeter"] = buildings_projected.geometry.length
Performance Considerations¶
Use Traditional GeoPandas when:¶
- Working with small datasets (< 1GB)
- Simple operations on local data
- Complete functional coverage is required
- Single-node processing is sufficient
Use GeoPandas API for Apache Sedona when:¶
- Working with large datasets (> 1GB)
- Complex geospatial analyses
- Distributed processing is needed
- Data is stored in cloud storage (S3, HDFS, etc.)
Supported Operations¶
The GeoPandas API for Apache Sedona has implemented 39 GeoSeries functions and 10 GeoDataFrame functions, covering the most commonly used GeoPandas operations:
Data I/O¶
read_parquet()
- Read GeoParquet filesread_file()
- Read various geospatial formatsto_parquet()
- Write to Parquet format
Spatial Operations¶
sjoin()
- Spatial joins with various predicatesbuffer()
- Geometric bufferingdistance()
- Distance calculationsintersects()
,contains()
,within()
- Spatial predicatessindex
- Spatial indexing (limited functionality)
CRS Operations¶
set_crs()
- Set coordinate reference systemto_crs()
- Transform between CRScrs
- Access CRS information
Geometric Properties¶
area
,length
,bounds
- Geometric measurementsis_valid
,is_simple
,is_empty
- Geometric validationcentroid
,envelope
,boundary
- Geometric propertiesx
,y
,z
,has_z
- Coordinate accesstotal_bounds
,estimate_utm_crs
- Bounds and CRS utilities
Spatial Operations¶
buffer()
- Geometric bufferingdistance()
- Distance calculationsintersects()
,contains()
,within()
- Spatial predicatesintersection()
- Geometric intersectionmake_valid()
- Geometry validation and repairsindex
- Spatial indexing (limited functionality)
Data Conversion¶
to_geopandas()
- Convert to traditional GeoPandasto_wkb()
,to_wkt()
- Convert to WKB/WKT formatsfrom_xy()
- Create geometries from coordinatesgeom_type
- Get geometry types
Complete Workflow Example¶
import sedona.spark.geopandas as gpd
from sedona.spark import SedonaContext
# Configure Spark for anonymous S3 access
config = (
SedonaContext.builder()
.config(
"spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider",
)
.getOrCreate()
)
sedona = SedonaContext.create(config)
# Load data
DATA_DIR = "s3://wherobots-examples/data/geopandas_blog/"
overture_size = "1M"
postal_codes_path = DATA_DIR + "postal-code/"
overture_path = DATA_DIR + overture_size + "/" + "overture-buildings/"
postal_codes = gpd.read_parquet(postal_codes_path)
buildings = gpd.read_parquet(overture_path)
# Spatial analysis
buildings = buildings.set_crs("EPSG:4326")
buildings_projected = buildings.to_crs("EPSG:3857")
# Calculate areas and filter
buildings_projected["area"] = buildings_projected.geometry.area
large_buildings = buildings_projected[buildings_projected["area"] > 1000]
result = large_buildings.sjoin(postal_codes, predicate="intersects")
# Aggregate by postal code
summary = (
result.groupby("postal_code")
.agg({"area": "sum", "BUILD_ID": "count"})
.rename(columns={"BUILD_ID": "building_count"})
)
print(summary.head())
Resources and Contributing¶
For detailed and up-to-date API documentation, including complete method signatures, parameters, and examples, see:
📚 GeoPandas API Documentation
The GeoPandas API for Apache Sedona is an open-source project. Contributions are welcome through the GitHub issue tracker for reporting bugs, requesting features, or contributing code. For more information on contributing, see the Contributor Guide.