Skip to content

Install Sedona R

Introduction

apache.sedona (cran.r-project.org/package=apache.sedona) is a sparklyr-based R interface for Apache Sedona. It presents what Apache Sedona has to offer through idiomatic frameworks and constructs in R (e.g., one can build spatial Spark SQL queries using Sedona UDFs in conjunction with a wide range of dplyr expressions), hence making Apache Sedona highly friendly for R users.

Generally speaking, when working with Apache Sedona, one choose between the following two modes:

While the former option enables more fine-grained control over low-level implementation details (e.g., which index to build for spatial queries, which data structure to use for spatial partitioning, etc), the latter is simpler and leads to a straightforward integration with dplyr, sparklyr, and other sparklyr extensions (e.g., one can build ML feature extractors with Sedona UDFs and connect them with ML pipelines using ml_*() family of functions in sparklyr, hence creating ML workflows capable of understanding spatial data).

Because data from spatial RDDs can be imported into Spark dataframes as geometry columns and vice versa, one can switch between the abovementioned two modes fairly easily.

At the moment apache.sedona consists of the following components:

  • R interface for Spatial-RDD-related functionalities
    • Reading/writing spatial data in WKT, WKB, and GeoJSON formats
    • Shapefile reader
    • Spatial partition, index, join, KNN query, and range query operations
    • Visualization routines
  • dplyr-integration for Sedona spatial UDTs and UDFs
    • See SQL APIs for the list of available UDFs
  • Functions importing data from spatial RDDs to Spark dataframes and vice versa

Connect to Spark

To ensure Sedona serialization routines, UDTs, and UDFs are properly registered when creating a Spark session, one simply needs to attach apache.sedona before instantiating a Spark conneciton. apache.sedona will take care of the rest. For example,

library(sparklyr)
library(apache.sedona)

spark_home <- "/usr/lib/spark"  # NOTE: replace this with your $SPARK_HOME directory
sc <- spark_connect(master = "yarn", spark_home = spark_home)

will create a Sedona-capable Spark connection in YARN client mode, and

library(sparklyr)
library(apache.sedona)

sc <- spark_connect(master = "local")

will create a Sedona-capable Spark connection to an Apache Spark instance running locally.

In sparklyr, one can easily inspect the Spark connection object to sanity-check it has been properly initialized with all Sedona-related dependencies, e.g.,

print(sc$extensions$packages)
## [1] "org.apache.sedona:sedona-core-3.0_2.12:1.2.1-incubating"
## [2] "org.apache.sedona:sedona-sql-3.0_2.12:1.2.1-incubating"
## [3] "org.apache.sedona:sedona-viz-3.0_2.12:1.2.1-incubating"
## [4] "org.datasyslab:geotools-wrapper:1.3.0-27.2"
## [5] "org.datasyslab:sernetcdf:0.1.0"
## [6] "org.locationtech.jts:jts-core:1.18.0"
## [7] "org.wololo:jts2geojson:0.14.3"

and

spark_session(sc) %>%
  invoke("%>%", list("conf"), list("get", "spark.kryo.registrator")) %>%
  print()
## [1] "org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator"

For more information about connecting to Spark with sparklyr, see https://therinspark.com/connections.html and ?sparklyr::spark_connect. Also see Initiate Spark Context and Initiate Spark Session for minimum and recommended dependencies for Apache Sedona.


Last update: November 23, 2021 06:53:29