Advanced tutorial: Tune your GeoSpark RDD application¶

Before getting into this advanced tutorial, please make sure that you have tried several GeoSpark functions on your local machine.

Pick a proper GeoSpark version¶

The versions of GeoSpark have three levels: X.X.X (i.e., 0.8.1). In addition, GeoSpark also supports Spark 1.X in Spark1.X version.

The first level means that this verion contains big structure redesign which may bring big changes in APIs and performance. Hopefully, we can see these big changes in GeoSpark 1.X version.

The second level (i.e., 0.8) indicates that this version contains significant performance enhancement, big new features and API changes. An old GeoSpark user who wants to pick this version needs to be careful about the API changes. Before you move to this version, please read GeoSpark version release notes and make sure you are ready to accept the API changes.

The third level (i.e., 0.8.1) tells that this version only contains bug fixes, some small new features and slight performance enhancement. This version will not contain any API changes. Moving to this version is safe. We highly suggest all GeoSpark users that stay at the same level move to the latest version in this level.

Choose a proper Spatial RDD constructor¶

GeoSpark provides a number of constructors for each SpatialRDD (PointRDD, PolygonRDD and LineStringRDD). In general, you have two options to start with.

Initialize a SpatialRDD from your data source such as HDFS and S3. A typical example is as follows:

public PointRDD(JavaSparkContext sparkContext, String InputLocation, Integer Offset, FileDataSplitter splitter, boolean carryInputData, Integer partitions, StorageLevel newLevel)

Initialize a SpatialRDD from an existing RDD. A typical example is as follows:
```
public PointRDD(JavaRDD<Point> rawSpatialRDD, StorageLevel newLevel)
```

You may notice that these constructors all take as input a "StorageLevel" parameter. This is to tell Apache Spark cache the "rawSpatialRDD", one attribute of SpatialRDD. The reason why GeoSpark does this is that GeoSpark wants to calculate the dataset boundary and approximate total count using several Apache Spark "Action"s. These information are useful when doing Spatial Join Query and Distance Join Query.

However, in some cases, you may know well about your datasets. If so, you can manually provide these information by calling this kind of Spatial RDD constructors:

public PointRDD(JavaSparkContext sparkContext, String InputLocation, Integer Offset, FileDataSplitter splitter, boolean carryInputData, Integer partitions, Envelope datasetBoundary, Integer approximateTotalCount) {

Manually providing the dataset boundary and approxmiate total count helps GeoSpark avoiding several slow "Action"s during initialization.

Cache the Spatial RDD that is repeatedly used¶

Each SpatialRDD (PointRDD, PolygonRDD and LineStringRDD) possesses four RDD attributes. They are:

rawSpatialRDD: The RDD generated by SpatialRDD constructors.
spatialPartitionedRDD: The RDD generated by spatial partition a rawSpatialRDD. Note that: this RDD has replicated spatial objects.
indexedRawRDD: The RDD generated by indexing a rawSpatialRDD.
indexedRDD: The RDD generated by indexing a spatialPartitionedRDD. Note that: this RDD has replicated spatial objects.

These four RDDs don't co-exist so you don't need to worry about the memory issue. These four RDDs are invoked in different queries:

Spatial Range Query / KNN Query, no index: rawSpatialRDD is used.
Spatial Range Query / KNN Query, use index: indexedRawRDD is used.
Spatial Join Query / Distance Join Query, no index: spatialPartitionedRDD is used.
Spatial Join Query / Distance Join Query, use index: indexed RDD is used.

Therefore, if you use one of the queries above many times, you'd better cache the associated RDD into memory. There are several possible use cases:

In Spatial Data Mining such as Spatial Autocorrelation and Spatial Co-location Pattern Mining, you may need to use Spatial Join / Spatial Self-join iteratively in order to calculate the adjacency matrix. If so, please cache the spatialPartitionedRDD/indexedRDD which is queries many times.
In Spark RDD sharing applications such as Livy and Spark Job Server, many users may do Spatial Range Query / KNN Query on the same Spatial RDD with different query predicates. You'd better cache the rawSpatialRDD/indexedRawRDD.

Be aware of Spatial RDD partitions¶

Sometimes users complain that the execution time is slow in some cases. As the first step, you should always consider increasing the number of your SpatialRDD partitions (2 - 8 times more than the original number). You can do this when you initialize a SpatialRDD. This may significantly improve your performance.

After that, you may consider tuning some other parameters in Apache Spark. For example, you may use Kyro serializer or change the RDD fraction that is cached into memory.

Last update: September 15, 2020 23:40:05