DataFrame
Overview¶
Sedona's stats module provides Scala and Python functions for conducting geospatial statistical analysis on dataframes with spatial columns. The stats module is built on top of the core module and provides a set of functions that can be used to perform spatial analysis on these dataframes. The stats module is designed to be used with the core module and the viz module to provide a complete set of geospatial analysis tools.
Using DBSCAN¶
The DBSCAN function is provided at org.apache.sedona.stats.clustering.DBSCAN.dbscan
in scala/java and sedona.stats.clustering.dbscan.dbscan
in python.
The function annotates a dataframe with a cluster label for each data record using the DBSCAN algorithm.
The dataframe should contain at least one GeometryType
column. Rows must be unique. If one
geometry column is present it will be used automatically. If two are present, the one named
'geometry' will be used. If more than one are present and none are named 'geometry', the
column name must be provided. The new column will be named 'cluster'.
Parameters¶
names in parentheses are python variable names
- dataframe - dataframe to cluster. Must contain at least one GeometryType column
- epsilon - minimum distance parameter of DBSCAN algorithm
- minPts (min_pts) - minimum number of points parameter of DBSCAN algorithm
- geometry - name of the geometry column
- includeOutliers (include_outliers) - whether to include outliers in the output. Default is false
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
The output is the input DataFrame with the cluster label added to each row. Outlier will have a cluster value of -1 if included.
Using Local Outlier Factor (LOF)¶
The LOF function is provided at org.apache.sedona.stats.outlierDetection.LocalOutlierFactor.localOutlierFactor
in scala/java and sedona.stats.outlier_detection.local_outlier_factor.local_outlier_factor
in python.
The function annotates a dataframe with a column containing the local outlier factor for each data record.
The dataframe should contain at least one GeometryType
column. Rows must be unique. If one
geometry column is present it will be used automatically. If two are present, the one named
'geometry' will be used. If more than one are present and neither is named 'geometry', the
column name must be provided.
Parameters¶
names in parentheses are python variable names
- dataframe - dataframe containing the point geometries
- k - number of nearest neighbors that will be considered for the LOF calculation
- geometry - name of the geometry column
- handleTies (handle_ties) - whether to handle ties in the k-distance calculation. Default is false
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
The output is the input DataFrame with the lof added to each row.
Using Getis-Ord Gi(*)¶
The G Local function is provided at org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal
in scala/java and sedona.stats.hotspot_detection.getis_ord.g_local
in python.
Performs the Gi or Gi* statistic on the x column of the dataframe.
Weights should be the neighbors of this row. The members of the weights should be comprised
of structs containing a value column and a neighbor column. The neighbor column should be the
contents of the neighbors with the same types as the parent row (minus neighbors). Reference the Using the Distance
Weighting Function header for instructions on generating this column. To calculate the Gi*
statistic, ensure the focal observation is in the neighbors array (i.e. the row is in the
weights column) and star=true
. Significance is calculated with a z score.
Parameters¶
- dataframe - the dataframe to perform the G statistic on
- x - The column name we want to perform hotspot analysis on
- weights - The column name containing the neighbors array. The neighbor column should be the contents of the neighbors with the same types as the parent row (minus neighbors). You can use
Weighting
class functions to achieve this. - star - Whether the focal observation is in the neighbors array. If true this calculates Gi*, otherwise Gi
The output is the input DataFrame with the following columns added: G, E[G], V[G], Z, P.
Using the Distance Weighting Function¶
The Weighting functions are provided at org.apache.sedona.stats.Weighting
in scala/java and sedona.stats.weighting
in python.
The function generates a column containing an array of structs containing a value column and a neighbor column.
The generic addDistanceBandColumn
(add_distance_band_column
in python) function annotates a dataframe with a weights column containing the other records within the threshold and their weight.
The dataframe should contain at least one GeometryType
column. Rows must be unique. If one
geometry column is present it will be used automatically. If two are present, the one named
'geometry' will be used. If more than one are present and neither is named 'geometry', the
column name must be provided. The new column will be named 'cluster'.
Parameters¶
addDistanceBandColumn¶
names in parentheses are python variable names
- dataframe - DataFrame with geometry column
- threshold - Distance threshold for considering neighbors
- binary - whether to use binary weights or inverse distance weights for neighbors (dist^alpha)
- alpha - alpha to use for inverse distance weights ignored when binary is true
- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
- includeSelf (include_self) - whether to include self in the list of neighbors
- selfWeight (self_weight) - the value to use for the self weight
- geometry - name of the geometry column
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
addBinaryDistanceBandColumn¶
names in parentheses are python variable names
- dataframe - DataFrame with geometry column
- threshold - Distance threshold for considering neighbors
- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
- includeSelf (include_self) - whether to include self in the list of neighbors
- selfWeight (self_weight) - the value to use for the self weight
- geometry - name of the geometry column
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
In both cases the output is the input DataFrame with the weights column added to each row.
Moran I¶
Moran I is the spatial autocorrelation algorithm, which is using spatial location and non-spatial attribute. When the value is close to the 1 it means that there is spatial correlation, when it is close to 0 then the correlation does not exist and data is randomly distributed. When the MoranI autocorrelation value is close to -1 it means that there is negative correlation. Negative correlation means that close values has dissimilar values.
You can see spatial correlation values on the figure below
- on the left there is negative correlation (-1)
- in the middle correlation is positive (1)
- on the right the correlation is close to zero and data is random.
Moran statistics can be used as the Scala/Java and Python functions. As the input function requires weight DataFrame. You can create the weight DataFrame using Apache Sedona weighting functions. You need to keep in mind that your input has to have id column that uniquely identifies the feature and value field. The required minimal schema for the MoranI Apache Sedona function is:
|-- id: integer (nullable = true)
|-- value: double (nullable = true)
|-- weights: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- neighbor: struct (nullable = false)
| | | |-- id: integer (nullable = true)
| | | |-- value: double (nullable = true)
| | |-- value: double (nullable = true)
You can manipulate the value column name and id using function parameters.
To use the Apache Sedona weight functions you need to pass the id column and value column to kept parameters.
val weights = Weighting.addDistanceBandColumn(
positiveCorrelationFrame,
1.0,
savedAttributes = Seq("id", "value")
)
val moranResult = Moran.getGlobal(weights, idColumn = "id")
// result fields
moranResult.getPNorm
moranResult.getI
moranResult.getZNorm
from sedona.spark.stats.autocorrelation.moran import Moran
from sedona.spark.stats.weighting import add_binary_distance_band_column
result = add_binary_distance_band_column(
df,
1.0,
saved_attributes=["id", "value"]
)
moran_i_result = Moran.get_global(result)
## result fields
moran_i_result.p_norm
moran_i_result.i
moran_i_result.z_norm
In the result you get the Z norm, P norm and Moran I value.
The full signatures of the functions
def getGlobal(
dataframe: DataFrame,
twoTailed: Boolean = true,
idColumn: String = ID_COLUMN,
valueColumnName: String = VALUE_COLUMN): MoranResult
// java interface
public interface MoranResult {
public double getI();
public double getPNorm();
public double getZNorm();
}
def get_global(
df: DataFrame,
two_tailed: bool = True,
id_column: str = "id",
value_column: str = "value",
) -> MoranResult
@dataclass
class MoranResult:
i: float
p_norm: float
z_norm: float