Spatially aggregate airports per country
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
In [1]:
Copied!
import os
import geopandas as gpd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr, when, explode, hex
from sedona.spark import *
from utilities import getConfig
import os
import geopandas as gpd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr, when, explode, hex
from sedona.spark import *
from utilities import getConfig
Setup Sedona environment¶
In [2]:
Copied!
config = (
SedonaContext.builder()
.config(
"spark.jars.packages",
"org.apache.sedona:sedona-spark-shaded-3.4_2.12:1.6.0,"
"org.datasyslab:geotools-wrapper:1.6.0-28.2,"
"uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4",
)
.getOrCreate()
)
sedona = SedonaContext.create(config)
sc = sedona.sparkContext
sc.setSystemProperty("sedona.global.charset", "utf8")
config = (
SedonaContext.builder()
.config(
"spark.jars.packages",
"org.apache.sedona:sedona-spark-shaded-3.4_2.12:1.6.0,"
"org.datasyslab:geotools-wrapper:1.6.0-28.2,"
"uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4",
)
.getOrCreate()
)
sedona = SedonaContext.create(config)
sc = sedona.sparkContext
sc.setSystemProperty("sedona.global.charset", "utf8")
:: loading settings :: url = jar:file:/home/jovyan/spark-3.4.2-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/jovyan/.ivy2/cache The jars for the packages stored in: /home/jovyan/.ivy2/jars org.apache.sedona#sedona-spark-shaded-3.4_2.12 added as a dependency org.datasyslab#geotools-wrapper added as a dependency uk.co.gresearch.spark#spark-extension_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-42c276ec-386f-421b-9fd0-00abbab81649;1.0 confs: [default] found org.apache.sedona#sedona-spark-shaded-3.4_2.12;1.6.0 in central found org.datasyslab#geotools-wrapper;1.6.0-28.2 in central found uk.co.gresearch.spark#spark-extension_2.12;2.11.0-3.4 in central found com.github.scopt#scopt_2.12;4.1.0 in central downloading https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.4_2.12/1.6.0/sedona-spark-shaded-3.4_2.12-1.6.0.jar ... [SUCCESSFUL ] org.apache.sedona#sedona-spark-shaded-3.4_2.12;1.6.0!sedona-spark-shaded-3.4_2.12.jar (3668ms) :: resolution report :: resolve 1816ms :: artifacts dl 3681ms :: modules in use: com.github.scopt#scopt_2.12;4.1.0 from central in [default] org.apache.sedona#sedona-spark-shaded-3.4_2.12;1.6.0 from central in [default] org.datasyslab#geotools-wrapper;1.6.0-28.2 from central in [default] uk.co.gresearch.spark#spark-extension_2.12;2.11.0-3.4 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 4 | 1 | 1 | 0 || 4 | 1 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-42c276ec-386f-421b-9fd0-00abbab81649 confs: [default] 1 artifacts copied, 3 already retrieved (21486kB/106ms) 24/05/22 18:02:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Read countries shapefile into a Sedona DataFrame¶
Data link: https://www.naturalearthdata.com/downloads/50m-cultural-vectors/
In [3]:
Copied!
countries = ShapefileReader.readToGeometryRDD(
sc, "data/ne_50m_admin_0_countries_lakes/"
)
countries_df = Adapter.toDf(countries, sedona)
countries_df.createOrReplaceTempView("country")
countries_df.printSchema()
countries = ShapefileReader.readToGeometryRDD(
sc, "data/ne_50m_admin_0_countries_lakes/"
)
countries_df = Adapter.toDf(countries, sedona)
countries_df.createOrReplaceTempView("country")
countries_df.printSchema()
root |-- geometry: geometry (nullable = true) |-- featurecla: string (nullable = true) |-- scalerank: string (nullable = true) |-- LABELRANK: string (nullable = true) |-- SOVEREIGNT: string (nullable = true) |-- SOV_A3: string (nullable = true) |-- ADM0_DIF: string (nullable = true) |-- LEVEL: string (nullable = true) |-- TYPE: string (nullable = true) |-- ADMIN: string (nullable = true) |-- ADM0_A3: string (nullable = true) |-- GEOU_DIF: string (nullable = true) |-- GEOUNIT: string (nullable = true) |-- GU_A3: string (nullable = true) |-- SU_DIF: string (nullable = true) |-- SUBUNIT: string (nullable = true) |-- SU_A3: string (nullable = true) |-- BRK_DIFF: string (nullable = true) |-- NAME: string (nullable = true) |-- NAME_LONG: string (nullable = true) |-- BRK_A3: string (nullable = true) |-- BRK_NAME: string (nullable = true) |-- BRK_GROUP: string (nullable = true) |-- ABBREV: string (nullable = true) |-- POSTAL: string (nullable = true) |-- FORMAL_EN: string (nullable = true) |-- FORMAL_FR: string (nullable = true) |-- NAME_CIAWF: string (nullable = true) |-- NOTE_ADM0: string (nullable = true) |-- NOTE_BRK: string (nullable = true) |-- NAME_SORT: string (nullable = true) |-- NAME_ALT: string (nullable = true) |-- MAPCOLOR7: string (nullable = true) |-- MAPCOLOR8: string (nullable = true) |-- MAPCOLOR9: string (nullable = true) |-- MAPCOLOR13: string (nullable = true) |-- POP_EST: string (nullable = true) |-- POP_RANK: string (nullable = true) |-- GDP_MD_EST: string (nullable = true) |-- POP_YEAR: string (nullable = true) |-- LASTCENSUS: string (nullable = true) |-- GDP_YEAR: string (nullable = true) |-- ECONOMY: string (nullable = true) |-- INCOME_GRP: string (nullable = true) |-- WIKIPEDIA: string (nullable = true) |-- FIPS_10_: string (nullable = true) |-- ISO_A2: string (nullable = true) |-- ISO_A3: string (nullable = true) |-- ISO_A3_EH: string (nullable = true) |-- ISO_N3: string (nullable = true) |-- UN_A3: string (nullable = true) |-- WB_A2: string (nullable = true) |-- WB_A3: string (nullable = true) |-- WOE_ID: string (nullable = true) |-- WOE_ID_EH: string (nullable = true) |-- WOE_NOTE: string (nullable = true) |-- ADM0_A3_IS: string (nullable = true) |-- ADM0_A3_US: string (nullable = true) |-- ADM0_A3_UN: string (nullable = true) |-- ADM0_A3_WB: string (nullable = true) |-- CONTINENT: string (nullable = true) |-- REGION_UN: string (nullable = true) |-- SUBREGION: string (nullable = true) |-- REGION_WB: string (nullable = true) |-- NAME_LEN: string (nullable = true) |-- LONG_LEN: string (nullable = true) |-- ABBREV_LEN: string (nullable = true) |-- TINY: string (nullable = true) |-- HOMEPART: string (nullable = true) |-- MIN_ZOOM: string (nullable = true) |-- MIN_LABEL: string (nullable = true) |-- MAX_LABEL: string (nullable = true) |-- NE_ID: string (nullable = true) |-- WIKIDATAID: string (nullable = true) |-- NAME_AR: string (nullable = true) |-- NAME_BN: string (nullable = true) |-- NAME_DE: string (nullable = true) |-- NAME_EN: string (nullable = true) |-- NAME_ES: string (nullable = true) |-- NAME_FR: string (nullable = true) |-- NAME_EL: string (nullable = true) |-- NAME_HI: string (nullable = true) |-- NAME_HU: string (nullable = true) |-- NAME_ID: string (nullable = true) |-- NAME_IT: string (nullable = true) |-- NAME_JA: string (nullable = true) |-- NAME_KO: string (nullable = true) |-- NAME_NL: string (nullable = true) |-- NAME_PL: string (nullable = true) |-- NAME_PT: string (nullable = true) |-- NAME_RU: string (nullable = true) |-- NAME_SV: string (nullable = true) |-- NAME_TR: string (nullable = true) |-- NAME_VI: string (nullable = true) |-- NAME_ZH: string (nullable = true)
24/05/22 18:02:51 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Read airports shapefile into a Sedona DataFrame¶
Data link: https://www.naturalearthdata.com/downloads/50m-cultural-vectors/
In [4]:
Copied!
airports = ShapefileReader.readToGeometryRDD(sc, "data/ne_50m_airports/")
airports_df = Adapter.toDf(airports, sedona)
airports_df.createOrReplaceTempView("airport")
airports_df.printSchema()
airports = ShapefileReader.readToGeometryRDD(sc, "data/ne_50m_airports/")
airports_df = Adapter.toDf(airports, sedona)
airports_df.createOrReplaceTempView("airport")
airports_df.printSchema()
root |-- geometry: geometry (nullable = true) |-- scalerank: string (nullable = true) |-- featurecla: string (nullable = true) |-- type: string (nullable = true) |-- name: string (nullable = true) |-- abbrev: string (nullable = true) |-- location: string (nullable = true) |-- gps_code: string (nullable = true) |-- iata_code: string (nullable = true) |-- wikipedia: string (nullable = true) |-- natlscale: string (nullable = true)
Run Spatial Join using SQL API¶
In [5]:
Copied!
result = sedona.sql(
"SELECT c.geometry as country_geom, c.NAME_EN, a.geometry as airport_geom, a.name FROM country c, airport a WHERE ST_Contains(c.geometry, a.geometry)"
)
result = sedona.sql(
"SELECT c.geometry as country_geom, c.NAME_EN, a.geometry as airport_geom, a.name FROM country c, airport a WHERE ST_Contains(c.geometry, a.geometry)"
)
Run Spatial Join using RDD API¶
In [6]:
Copied!
airports_rdd = Adapter.toSpatialRdd(airports_df, "geometry")
# Drop the duplicate name column in countries_df
countries_df = countries_df.drop("NAME")
countries_rdd = Adapter.toSpatialRdd(countries_df, "geometry")
airports_rdd.analyze()
countries_rdd.analyze()
# 4 is the num partitions used in spatial partitioning. This is an optional parameter
airports_rdd.spatialPartitioning(GridType.KDBTREE, 4)
countries_rdd.spatialPartitioning(airports_rdd.getPartitioner())
buildOnSpatialPartitionedRDD = True
usingIndex = True
considerBoundaryIntersection = True
airports_rdd.buildIndex(IndexType.QUADTREE, buildOnSpatialPartitionedRDD)
result_pair_rdd = JoinQueryRaw.SpatialJoinQueryFlat(
airports_rdd, countries_rdd, usingIndex, considerBoundaryIntersection
)
result2 = Adapter.toDf(
result_pair_rdd, countries_rdd.fieldNames, airports.fieldNames, sedona
)
result2.createOrReplaceTempView("join_result_with_all_cols")
# Select the columns needed in the join
result2 = sedona.sql(
"SELECT leftgeometry as country_geom, NAME_EN, rightgeometry as airport_geom, name FROM join_result_with_all_cols"
)
airports_rdd = Adapter.toSpatialRdd(airports_df, "geometry")
# Drop the duplicate name column in countries_df
countries_df = countries_df.drop("NAME")
countries_rdd = Adapter.toSpatialRdd(countries_df, "geometry")
airports_rdd.analyze()
countries_rdd.analyze()
# 4 is the num partitions used in spatial partitioning. This is an optional parameter
airports_rdd.spatialPartitioning(GridType.KDBTREE, 4)
countries_rdd.spatialPartitioning(airports_rdd.getPartitioner())
buildOnSpatialPartitionedRDD = True
usingIndex = True
considerBoundaryIntersection = True
airports_rdd.buildIndex(IndexType.QUADTREE, buildOnSpatialPartitionedRDD)
result_pair_rdd = JoinQueryRaw.SpatialJoinQueryFlat(
airports_rdd, countries_rdd, usingIndex, considerBoundaryIntersection
)
result2 = Adapter.toDf(
result_pair_rdd, countries_rdd.fieldNames, airports.fieldNames, sedona
)
result2.createOrReplaceTempView("join_result_with_all_cols")
# Select the columns needed in the join
result2 = sedona.sql(
"SELECT leftgeometry as country_geom, NAME_EN, rightgeometry as airport_geom, name FROM join_result_with_all_cols"
)
Print spatial join results¶
In [7]:
Copied!
# The result of SQL API
result.show()
# The result of RDD API
result2.show()
# The result of SQL API
result.show()
# The result of RDD API
result2.show()
24/05/22 18:03:06 WARN JoinQuery: UseIndex is true, but no index exists. Will build index on the fly.
+--------------------+--------------------+--------------------+--------------------+ | country_geom| NAME_EN| airport_geom| name| +--------------------+--------------------+--------------------+--------------------+ |MULTIPOLYGON (((1...|Taiwan ...|POINT (121.231370...|Taoyuan ...| |MULTIPOLYGON (((5...|Netherlands ...|POINT (4.76437693...|Schiphol ...| |POLYGON ((103.969...|Singapore ...|POINT (103.986413...|Singapore Changi ...| |MULTIPOLYGON (((-...|United Kingdom ...|POINT (-0.4531566...|London Heathrow ...| |MULTIPOLYGON (((-...|United States of ...|POINT (-149.98172...|Anchorage Int'l ...| |MULTIPOLYGON (((-...|United States of ...|POINT (-84.425397...|Hartsfield-Jackso...| |MULTIPOLYGON (((1...|People's Republic...|POINT (116.588174...|Beijing Capital ...| |MULTIPOLYGON (((-...|Colombia ...|POINT (-74.143371...|Eldorado Int'l ...| |MULTIPOLYGON (((6...|India ...|POINT (72.8745639...|Chhatrapati Shiva...| |MULTIPOLYGON (((-...|United States of ...|POINT (-71.016406...|Gen E L Logan Int...| |MULTIPOLYGON (((-...|United States of ...|POINT (-76.668642...|Baltimore-Washing...| |POLYGON ((36.8713...|Egypt ...|POINT (31.3997430...|Cairo Int'l ...| |POLYGON ((-2.2196...|Morocco ...|POINT (-7.6632188...|Casablanca-Anfa ...| |MULTIPOLYGON (((-...|Venezuela ...|POINT (-67.005748...|Simon Bolivar Int...| |MULTIPOLYGON (((2...|South Africa ...|POINT (18.5976565...|Cape Town Int'l ...| |MULTIPOLYGON (((1...|People's Republic...|POINT (103.956136...|Chengdushuang Liu...| |MULTIPOLYGON (((6...|India ...|POINT (77.0878362...|Indira Gandhi Int...| |MULTIPOLYGON (((-...|United States of ...|POINT (-104.67379...|Denver Int'l ...| |MULTIPOLYGON (((-...|United States of ...|POINT (-97.040371...|Dallas-Ft. Worth ...| |MULTIPOLYGON (((1...|Thailand ...|POINT (100.602578...|Don Muang Int'l ...| +--------------------+--------------------+--------------------+--------------------+ only showing top 20 rows
+--------------------+--------------------+--------------------+--------------------+ | country_geom| NAME_EN| airport_geom| name| +--------------------+--------------------+--------------------+--------------------+ |MULTIPOLYGON (((-...|United States of ...|POINT (-80.145258...|Fort Lauderdale H...| |MULTIPOLYGON (((-...|United States of ...|POINT (-80.278971...|Miami Int'l ...| |MULTIPOLYGON (((-...|United States of ...|POINT (-95.333704...|George Bush Inter...| |MULTIPOLYGON (((-...|United States of ...|POINT (-90.256693...|New Orleans Int'l...| |MULTIPOLYGON (((-...|United States of ...|POINT (-81.307371...|Orlando Int'l ...| |MULTIPOLYGON (((-...|United States of ...|POINT (-82.534824...|Tampa Int'l ...| |MULTIPOLYGON (((-...|United States of ...|POINT (-112.01363...|Sky Harbor Int'l ...| |MULTIPOLYGON (((-...|United States of ...|POINT (-118.40246...|Los Angeles Int'l...| |MULTIPOLYGON (((-...|United States of ...|POINT (-116.97547...|General Abelardo ...| |MULTIPOLYGON (((-...|United States of ...|POINT (-97.040371...|Dallas-Ft. Worth ...| |MULTIPOLYGON (((-...|United States of ...|POINT (-84.425397...|Hartsfield-Jackso...| |POLYGON ((-69.965...|Peru ...|POINT (-77.107565...|Jorge Chavez ...| |MULTIPOLYGON (((-...|Panama ...|POINT (-79.387134...|Tocumen Int'l ...| |POLYGON ((-83.157...|Nicaragua ...|POINT (-86.171284...|Augusto Cesar San...| |MULTIPOLYGON (((-...|Mexico ...|POINT (-96.183570...|Gen. Heriberto Ja...| |MULTIPOLYGON (((-...|Mexico ...|POINT (-106.27001...|General Rafael Bu...| |MULTIPOLYGON (((-...|Mexico ...|POINT (-99.754508...|General Juan N Al...| |MULTIPOLYGON (((-...|Mexico ...|POINT (-99.570649...|Jose Maria Morelo...| |MULTIPOLYGON (((-...|Mexico ...|POINT (-98.375759...|Puebla ...| |MULTIPOLYGON (((-...|Mexico ...|POINT (-99.082607...|Lic Benito Juarez...| +--------------------+--------------------+--------------------+--------------------+ only showing top 20 rows
Group airports by country¶
In [8]:
Copied!
# result.createOrReplaceTempView("result")
result2.createOrReplaceTempView("result")
groupedresult = sedona.sql(
"SELECT c.NAME_EN, c.country_geom, count(*) as AirportCount FROM result c GROUP BY c.NAME_EN, c.country_geom"
)
groupedresult.show()
groupedresult.createOrReplaceTempView("grouped_result")
# result.createOrReplaceTempView("result")
result2.createOrReplaceTempView("result")
groupedresult = sedona.sql(
"SELECT c.NAME_EN, c.country_geom, count(*) as AirportCount FROM result c GROUP BY c.NAME_EN, c.country_geom"
)
groupedresult.show()
groupedresult.createOrReplaceTempView("grouped_result")
+--------------------+--------------------+------------+ | NAME_EN| country_geom|AirportCount| +--------------------+--------------------+------------+ |Cuba ...|MULTIPOLYGON (((-...| 1| |Mexico ...|MULTIPOLYGON (((-...| 12| |Panama ...|MULTIPOLYGON (((-...| 1| |Nicaragua ...|POLYGON ((-83.157...| 1| |Honduras ...|MULTIPOLYGON (((-...| 1| |Colombia ...|MULTIPOLYGON (((-...| 4| |United States of ...|MULTIPOLYGON (((-...| 35| |Ecuador ...|MULTIPOLYGON (((-...| 1| |The Bahamas ...|MULTIPOLYGON (((-...| 1| |Peru ...|POLYGON ((-69.965...| 1| |Guatemala ...|POLYGON ((-92.235...| 1| |Canada ...|MULTIPOLYGON (((-...| 15| |Venezuela ...|MULTIPOLYGON (((-...| 3| |Argentina ...|MULTIPOLYGON (((-...| 3| |Bolivia ...|MULTIPOLYGON (((-...| 2| |Paraguay ...|POLYGON ((-58.159...| 1| |Benin ...|POLYGON ((1.62265...| 1| |Guinea ...|POLYGON ((-10.283...| 1| |Chile ...|MULTIPOLYGON (((-...| 5| |Nigeria ...|MULTIPOLYGON (((7...| 3| +--------------------+--------------------+------------+ only showing top 20 rows
Visualize the number of airports in each country¶
Visualize using SedonaKepler¶
In [9]:
Copied!
sedona_kepler_map = SedonaKepler.create_map(
df=groupedresult, name="AirportCount", config=getConfig()
)
sedona_kepler_map
sedona_kepler_map = SedonaKepler.create_map(
df=groupedresult, name="AirportCount", config=getConfig()
)
sedona_kepler_map
User Guide: https://docs.kepler.gl/docs/keplergl-jupyter
KeplerGl(config={'version': 'v1', 'config': {'visState': {'filters': [], 'layers': [{'id': 'ikzru0t', 'type': …
Visualize using SedonaPyDeck¶
The above visualization is generated by a pre-set config informing SedonaKepler that the map to be rendered has to be a choropleth map with choropleth of the AirportCount
column value.
This can be also be achieved using SedonaPyDeck and its create_choropleth_map
API.
In [10]:
Copied!
sedona_pydeck_map = SedonaPyDeck.create_choropleth_map(
df=groupedresult, plot_col="AirportCount"
)
sedona_pydeck_map
sedona_pydeck_map = SedonaPyDeck.create_choropleth_map(
df=groupedresult, plot_col="AirportCount"
)
sedona_pydeck_map
Out[10]: