EDITOR’S NOTE: The data used for this analysis contained spatially duplicated values. Meaning there are many instances of two separate measurements taken from the same site. These values might be temporally different (same site measured more than once on different dates), distinct but adjacent (different samples from same field, close enough to have the same lat/long within rounding error), or true duplicates. Data was not cleaned for these error prior to analysis
Introduction
This week I fiddled around with the two Mapping Clusters tools in the Spatial Statistics toolbox. Here’s a rundown of each tool:
Cluster and Outlier Analysis
Given a set of weighted features, identifies statistically significant hot spots, cold spots, and spatial outliers using the Anselin Local Moran’s I statistic.
Hot Spot Analysis
Given a set of weighted features, identifies statistically significant hot spots and cold spots using the Getis-Ord Gi* statistic.
Methodology
I wanted to challenge each of these tools to see if they could cope with a common natural resources data phenomenon: zero-value inflation. The problem of a large proportion of zero values is common with data obtained from ecological studies involving counts of abundance, presence–absence, or occupancy rates. The zero values are real and cannot simply be tossed out, but because 0.0 is a singular value, it results in a major spike in the data distribution. This can adversely affect many common modeling techniques. The graphs below are from my data set of organic matter in Oregon soil pedons and have this zero-value complication:
At left, the full data set with an enormous spike in the first bar (caused by all the zero values). At right, the same data set with the almost 400 zero-value data points removed. It’s still right skewed, but looks like a more reasonable distribution spread.
I ran each data set through both cluster analysis tools and visually compared results. There were noticeable differences between all four maps. In this “spot the difference” exercise, some major changes included: Hot spot in the northeast in full data only. Willamette Valley everywhere from cold to hot. Bigger east-west trend in the no-zero data sets.
Discussion
Clearly, having a large number of zero-values in one’s dataset will skew cluster analysis results. This was to be expected. In my data, the zero values were enough to overwhelm the generally “warm” signature in the Willamette Valley. Also, the values in the north-east were “significantly large” when balanced against a large zero data set, but “not significant” in the more balanced no-zero data set.
Regarding the Hot Spot Analysis versus the Cluster and Outlier Analysis, for this data set the Cluster & Outlier tool resulted in more stringent (smaller) hot and cold spots. Also, outlier values of cold in hot spots and hot in cold spots can been seen in the Cluster and Outlier Data Set, but vanish and it fact take the same label as the surrounding data in the Hot Spot Analysis (see below).
At left, a close-up of soutwest Oregon in the Hot Spot analysis of the no-zero dataset. At right, the Cluster and Outlier analysis of the same area. Note the two dots in the middle that are classified as “hot” in the left image but “low, surrounded by high” in the right image.
Conclusion
I would be very wary of using these tools with zero-inflated and skewed data. This study was crude in that I completely removed zero values rather than re-casting them as low (but not zero) values. Nonetheless, there were drastic differences between all four maps in large sample (more than 1000 point) dataset.
Between Hot Spot or Cluster and Outlier, for this data I would prefer to use the Cluster and Outlier data set. The same general hot spot pattern is maintained between the two. It seems there is a trade-off between smoothness and information regarding outliers within hot and cold spots. For scientific mapping purposes it would seem very important to be aware of those outliers.