Raster data in GeoTiff and ArcInfoAsciiGrid formats can be read into Spark.
Using the RasterUDT
Raster data in GeoTiff and ArcInfo Grid format can be loaded directly into Spark using the sparklyr::spark_read_binary
and Sedona constructors RS_FromGeoTiff
and RS_FromArcInfoAsciiGrid
.
library(dplyr)
library(sparklyr)
library(apache.sedona)
sc <- spark_connect(master = "local")
data_tbl <- spark_read_binary(sc, dir = here::here("/../core/src/test/resources/raster/"), name = "data")
raster <-
data_tbl %>%
mutate(raster = RS_FromGeoTiff(content))
raster
raster %>% sdf_schema()
Once the data is loaded, raster functions are available in dplyr workflows:
Functions taking in raster: Raster
arguments are meant to be used with data loaded with this reader, such as RS_Value
, RS_Values
, RS_Envelope
. Functions taking in Band: Array[Double]
arguments work with data loaded using the Sedona Geotiff DataFrame loader (see below).
For example, getting the number of bands:
raster %>%
mutate(
nbands = RS_NumBands(raster)
) %>%
select(path, nbands) %>%
collect() %>%
mutate(path = path %>% basename())
Or getting values the envelope:
raster %>%
mutate(
env = RS_Envelope(raster) %>% st_astext()
) %>%
select(path, env) %>%
collect() %>%
mutate(path = path %>% basename())
Or getting values at specific points:
Using the Sedona Geotiff Dataframe Loader
The Sedona Geotiff Dataframe Loader will read data from GeoTiff file (or folder containing multiple files) into a Spark DataFrame. The resulting data is a nested column. It can be unnested using SQL (results are collected)…:
data_tbl <- spark_read_geotiff(sc, path = here::here("../core/src/test/resources/raster/"), name = "data", options = list(dropInvalid = TRUE))
data_tbl
## Using a direct SQL query: results are collected directly
sc %>%
DBI::dbGetQuery("SELECT
image.geometry as Geom,
image.height as height,
image.width as width,
image.nBands as bands
FROM data")
… or using {sparklyr.nested}
(results stay in Spark until collection):
library(sparklyr.nested)
data_tbl %>% sdf_schema_json(parse_json = TRUE) %>% lobstr::tree()
data_tbl %>%
sdf_unnest(image) %>%
glimpse()
res <-
data_tbl %>%
sdf_unnest(image) %>%
mutate(
mult = RS_MultiplyFactor(data, 2L)
) %>%
select(data, mult) %>%
collect()
res$data[[1]][750:760]
res$mult[[1]][750:760]
Writing data back: