Self-contained Spark projects¶
A self-contained project allows you to create multiple Scala / Java files and write complex logics in one place. To use GeoSpark in your self-contained Spark project, you just need to add GeoSpark as a dependency in your POM.xml or build.sbt.
Quick start¶
- To add GeoSpark as dependencies, please read GeoSpark Maven Central coordinates
- Use GeoSpark Template project to start: GeoSpark Template Project
- Compile your project using SBT or Maven. Make sure you obtain the fat jar which packages all dependencies.
- Submit your compiled fat jar to Spark cluster. Make sure you are in the root folder of Spark distribution. Then run the following command:
./bin/spark-submit --master spark://YOUR-IP:7077 /Path/To/YourJar.jar
Note
The detailed explanation of spark-submit is available on Spark website.
How to use GeoSpark in an IDE¶
Select an IDE¶
To develop a complex GeoSpark project, we suggest you use IntelliJ IDEA. It supports JVM languages, Scala and Java, and many dependency management systems, Maven and SBT.
Eclipse is also fine if you just want to use Java and Maven.
Open GeoSpark template project¶
Select a proper GeoSpark project you want from GeoSpark Template Project. In this tutorial, we use GeoSparkSQL Scala project as an example.
Open the folder that contains build.sbt
file in your IDE. The IDE may take a while to index dependencies and source code.
Try GeoSpark SQL functions¶
In your IDE, run ScalaExample.scala file.
You don't need to change anything in this file. The IDE will run all SQL queries in this example in local mode.
Package the project¶
To run this project in cluster mode, you have to package this project to a JAR and then run it using spark-submit
command.
Before packaging this project, you always need to check two places:
-
Remove the hardcoded Master IP
master("local[*]")
. This hardcoded IP is only needed when you run this project in an IDE.var sparkSession:SparkSession = SparkSession.builder() .config("spark.serializer",classOf[KryoSerializer].getName) .config("spark.kryo.registrator",classOf[GeoSparkKryoRegistrator].getName) .master("local[*]") .appName("GeoSparkSQL-demo").getOrCreate()
-
In build.sbt (or POM.xml), set Spark dependency scope to
provided
instead ofcompile
.compile
is only needed when you run this project in an IDE.org.apache.spark" %% "spark-core" % SparkVersion % "compile, org.apache.spark" %% "spark-sql" % SparkVersion % "compile
Warning
Forgetting to change the package scope will lead to a very big fat JAR and dependency conflicts when call spark-submit
. For more details, please visit Maven Dependency Scope.
- Make sure your downloaded Spark binary distribution is the same version with the Spark used in your
build.sbt
orPOM.xml
.
Submit the compiled jar¶
- Go to
./target/scala-2.11
folder and find a jar calledGeoSparkSQLScalaTemplate-0.1.0.jar
. Note that, this JAR normally is larger than 1MB. (If you use POM.xml, the jar is under./target
folder) -
Submit this JAR using
spark-submit
. -
Local mode:
./bin/spark-submit /Path/To/YourJar.jar
-
Cluster mode:
./bin/spark-submit --master spark://YOUR-IP:7077 /Path/To/YourJar.jar