Introduction
Hot Spot Analysis is a common method for identifying spatially significant clusters of high and low values among features. While feature values may be easily mapped using histogram distributions to identify areas or points with high or low values, it is difficult to tell whether these values are spatially significant.
In ArcGIS, the Hotspot Analysis tool calculates the Getis-Ord Gi* statistic (Figure 1) for each feature in a dataset. It then outputs a z-score and p-value for each feature, allowing one to map the spatial distribution of significant values to identify local hotspots. To be a statistically significant hotspot, a feature must not only have a high or low value, but must be surrounded by features with similarly high or low values.
Figure 1: Equation for the Getis-Ord Gi* statistic.
As we shall see, significance for the Getis-Ord Gi*statistic depends largely on calculation of a local sum for a feature and its neighbors. This is compared proportionally to the sum of all features to identify regions where when the observed local sum is significantly different from the expected local sum, or where the difference is too large to be the result of random chance.
But how does the Hotspot Analysis tool determine what is “local”?
When the threshold distance parameter is left blank, a default threshold equal to the Euclidean distance that ensures that every feature has at least one neighbor is computed and applied to the entire dataset. An implication of this is that Hotspot Analysis is extremely sensitive to the spatial distribution of features, making some data inappropriate for this type of analysis.
In this tutorial, we will explore how manual changes in threshold distance affect our view of what is significant using clay chemistry data from the Valley of Oaxaca, Mexico. I will argue that because clay sample locations were researcher-selected rather than the product of natural processes, Hotspot Analysis is not appropriate for this data.
Data, Analysis, Results
The Valley of Oaxaca is a broad Y-shaped valley located in the central highlands of Oaxaca, Mexico and the heart of ancient Zapotec Civilization. Over the past several years, researchers at the OSU Archaeometry Laboratory have collected over 300 clay samples from across the Valley for elemental analysis using Instrumental Neutron Activation Analysis (INAA) to establish a comparative basis for the geochemical fingerprinting of ancient pottery from the region.
Figure 2: Clay Sampling locations relative to regional geology in the Valley of Oaxaca, Mexico.
Our ability to identify pottery produced in different part of the Valley is due in large part to the geologic complexity of the region. In each area of the Valley, clay composition is strongly conditioned by parent material, sometimes representing the product of mixed sediments from multiple sources. For this exercise, we will focus on two elements, lanthanum (La) and cesium (Cs), found in elevated concentrations in clays derived from gneiss and rhyolite respectively. La concentrations tend to be high along the western side of the valley (Figure 3), while Cs concentrations are high in the east (Figure 4).
Figure 3: La concentrations in samples from the Oaxaca Clay Survey
Figure 4: Cs concentrations in samples from the Oaxaca Clay Survey
A hotspot analysis of La concentrations from across the Valley using an unspecified (default setting) threshold distance (Figure 5) identifies significantly high hotspots of La along the central portion of the western edge of the valley – as expected – as well as significant cold spots along the eastern edge of the southern arm of the valley and along the northern side of the eastern arm. A similar analysis of Cs values identifies Cs cold spots along the western edge of the valley and Cs hot spots near the southern and eastern sides of the eastern arm (Figure 6).
Figure 5: Hotspot analysis of La concentrations using an unspecified threshold distance.
Figure 6: Hotspot analysis of Cs concentrations using an unspecified threshold distance.
Broadly speaking these patterns are in accord with our previous understanding of regional clay composition, but to what degree are the significance levels identified by hotspot analysis dependent upon our unspecified threshold distance?
To address this question, we conducted a series of three additional hotspot analyses of Cs concentrations using specified threshold distances of 100 m (Figure 7), 1,000 m (Figure 8), and 10,000 m (Figure 9). Results of these analyses show that the number of features identified as significant hot or cold spots increases as a function of threshold distance. Using a threshold of 100 m, only a few points in the far eastern portion of the Valley are identified as significant hotspots. This pattern largely holds true using a 1,000 m threshold with a few additional points in the same limited area defined as significant hotspots. However, using a threshold of 10,000 m, the majority of features on the eastern side of the Valley are classified as significant hotspots, while the majority of those in the west are classified as significant cold spots. When nearly everything is significant, what is significant? Clearly 10,000 m is an improper choice of threshold distance, but this raises the threefold issue of how the default threshold is determined, whether this distance best describes the data, and whether the assumptions of this tool are met by the data.
Figure 7: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 100 m.
Figure 8: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 1000 m.
Figure 9: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 10,000 m.
Discussion
Results of a series of hotspot analyses of La and Cs concentrations in clay samples from the Valley of Oaxaca largely confirm what we already knew – La concentrations are high in clays along the western side of the valley and Cs concentrations are highest in clays in the eastern side of the valley. Hotspot analysis allows us to identify spatially significant clusters of points with higher or lower than expected values, but as we have seen, what gets flagged as significant is strongly dependent upon ones choice of threshold distance.
By default, ArcGIS selects a threshold distance equal to the minimum distance required to ensure that all features have at least one neighbor. This means that the default minimum distance is contingent upon the spatial distribution of features, and is likely driven by features at the edge of the dataset and/or spatial outliers.
This raises the larger issue of what factors drive the spatial distribution of features in one’s dataset. Are the location of individual features the product of the physical or social environment? Is their spatial distribution critical to your understanding of the data? Or are the locations researcher-selected according to some regular, stratified, or arbitrary sampling design that is not a product of the environment, but of how one chose to sample that environment?
In the case of the Oaxaca Clay Survey, clay sampling locations were selected opportunistically from subsurface horizons along roadcuts and streambanks, with greater sampling intensity near key archaeological sites sparse density in areas with few important sites. Because the sampling locations were arbitrarily selected, their overall distribution is not of interest. Nor are the inter-point distances used to calculate the Getis-Ord Gi* statistic.
Because the spatial distribution of clay sampling locations tells us more about researcher behavior than clay formation processes, and Hotspot Analysis uses this information in its calculation of z-scores and significance levels, Hotspot Analysis is not appropriate for this data.
Conclusion
Before using Hotspot Analysis, one should always verify (1) that the spatial locations of features of interest are the product of the physical or social environment rather than researcher-selection; and (2) that the default local threshold value is not driven by spatial outliers.