sedona.spark.stats.hotspot_detection package

Submodules

sedona.spark.stats.hotspot_detection.getis_ord module

Getis Ord functions. From the 1992 paper by Getis & Ord.

Getis, A., & Ord, J. K. (1992). The analysis of spatial association by use of distance statistics. Geographical Analysis, 24(3), 189-206. https://doi.org/10.1111/j.1538-4632.1992.tb00261.x

sedona.spark.stats.hotspot_detection.getis_ord.g_local(dataframe: DataFrame, x: str, weights: str = 'weights', permutations: int = 0, star: bool = False, island_weight: float = 0.0) → DataFrame[source]

Performs the Gi or Gi* statistic on the x column of the dataframe.

Weights should be the neighbors of this row. The members of the weights should be comprised of structs containing a value column and a neighbor column. The neighbor column should be the contents of the neighbors with the same types as the parent row (minus neighbors). You can use wherobots.weighing.add_distance_band_column to achieve this. To calculate the Gi* statistic, ensure the focal observation is in the neighbors array (i.e. the row is in the weights column) and star=true. Significance is calculated with a z score. Permutation tests are not yet implemented and thus island weight does nothing. The following columns will be added: G, E[G], V[G], Z, P.

Parameters:

dataframe – the dataframe to perform the G statistic on
x – The column name we want to perform hotspot analysis on
weights – The column name containing the neighbors array. The neighbor column should be the contents of the neighbors with the same types as the parent row (minus neighbors). You can use wherobots.weighing.add_distance_band_column to achieve this.
permutations – Not used. Permutation tests are not supported yet. The number of permutations to use for the significance test.
star – Whether the focal observation is in the neighbors array. If true this calculates Gi*, otherwise Gi
island_weight – Not used. The weight for the simulated neighbor used for records without a neighbor in perm tests

Returns:

A dataframe with the original columns plus the columns G, E[G], V[G], Z, P.

Module contents

Detecting across a region where a variable’s value is significantly different from other values nearby.