EDITOR’S NOTE: The data used for this analysis contained spatially duplicated values. Meaning there are many instances of two separate measurements taken from the same site. These values might be temporally different (same site measured more than once on different dates), distinct but adjacent (different samples from same field, close enough to have the same lat/long within rounding error), or true duplicates. Data was not cleaned for these error prior to analysis
Introduction
This week I fiddled around with the two Mapping Clusters tools in the Spatial Statistics toolbox. Here’s a rundown of each tool:
Cluster and Outlier Analysis
Given a set of weighted features, identifies statistically significant hot spots, cold spots, and spatial outliers using the Anselin Local Moran’s I statistic.
Hot Spot Analysis
Given a set of weighted features, identifies statistically significant hot spots and cold spots using the Getis-Ord Gi* statistic.
Methodology
I wanted to challenge each of these tools to see if they could cope with a common natural resources data phenomenon: zero-value inflation. The problem of a large proportion of zero values is common with data obtained from ecological studies involving counts of abundance, presence–absence, or occupancy rates. The zero values are real and cannot simply be tossed out, but because 0.0 is a singular value, it results in a major spike in the data distribution. This can adversely affect many common modeling techniques. The graphs below are from my data set of organic matter in Oregon soil pedons and have this zero-value complication:
At left, the full data set with an enormous spike in the first bar (caused by all the zero values). At right, the same data set with the almost 400 zero-value data points removed. It’s still right skewed, but looks like a more reasonable distribution spread.
I ran each data set through both cluster analysis tools and visually compared results. There were noticeable differences between all four maps. In this “spot the difference” exercise, some major changes included: Hot spot in the northeast in full data only. Willamette Valley everywhere from cold to hot. Bigger east-west trend in the no-zero data sets.
Discussion
Clearly, having a large number of zero-values in one’s dataset will skew cluster analysis results. This was to be expected. In my data, the zero values were enough to overwhelm the generally “warm” signature in the Willamette Valley. Also, the values in the north-east were “significantly large” when balanced against a large zero data set, but “not significant” in the more balanced no-zero data set.
Regarding the Hot Spot Analysis versus the Cluster and Outlier Analysis, for this data set the Cluster & Outlier tool resulted in more stringent (smaller) hot and cold spots. Also, outlier values of cold in hot spots and hot in cold spots can been seen in the Cluster and Outlier Data Set, but vanish and it fact take the same label as the surrounding data in the Hot Spot Analysis (see below).
At left, a close-up of soutwest Oregon in the Hot Spot analysis of the no-zero dataset. At right, the Cluster and Outlier analysis of the same area. Note the two dots in the middle that are classified as “hot” in the left image but “low, surrounded by high” in the right image.
Conclusion
I would be very wary of using these tools with zero-inflated and skewed data. This study was crude in that I completely removed zero values rather than re-casting them as low (but not zero) values. Nonetheless, there were drastic differences between all four maps in large sample (more than 1000 point) dataset.
Between Hot Spot or Cluster and Outlier, for this data I would prefer to use the Cluster and Outlier data set. The same general hot spot pattern is maintained between the two. It seems there is a trade-off between smoothness and information regarding outliers within hot and cold spots. For scientific mapping purposes it would seem very important to be aware of those outliers.
This is really fascinating! Thank you for fully exploring that spatial problem. It’s interesting to see how the results vary with and without the zero values. Eventually, I’d like to find a way to visualize genotypic data of my humpback whales and some of the loci are coded as zeros since they didn’t amplify in these specific regions of the genome. The information that you present here might help when I figure out how to incorporate genotypes into GIS.
Thanks!
Hi Max!
I love the way you systematically compare these tools with and without zeros! Some comments:
1) Skewed data is fine, because these tools are asymptotically normal… you will want to make sure that all features have at least a few neighbors (8 is a general rule of thumb) and no features have almost all other features as neighbors.
2) To compare apples to apples you will want to make sure that your scale of analysis is the same for both tools. When you don’t have any other (theoretical, common sense, etc.) justification for an appropriate scale of analysis, the Incremental Spatial Autocorrelation tool can tell you if there are specific distances where the spatial processes promoting the pattern you are seeing are pronounced at any specific distances. I generally select the very first peak distance.
3) I like Fixed Distance best for these tools (though other people prefer an adaptive conceptualization: 8 nearest neighbors). The reason I like fixed distance is that it ensures the scale of your analysis is the same across the study area. In my mind, when we change the scale of our analysis (use a larger or smaller distance band) we are changing our question. Example: I want to know if access to physicians is good in a county… I look at the number of physicians and the number of people and see the ratio is similar to the nation as a whole… I conclude people in the county have good access to physicials. Next I map physicians and people using census tracts and see that all of the physicians are in the northern part of the county and all of the people are in the southern part of the county… I conclude people in the county do not have good access to physicians. Both answers are correct for the scale of their analysis.
4) To include zeros or not to include zeros ? That is the question! 🙂 AND this very much depends on the question that you are asking. Let’s use the Hot Spot Analysis tool as an example. This tool compares the local mean (the value associated with a feature and its neighbors) to the global mean (all features in the dataset) and decides if the difference between the local and global mean values are different enough to be statistically significant (this is based on the variance and number of observations, for starters). Now consider this… if truly the most common value is zero, then the variance is small… and for your analysis you have lots of observations… in this case, anything that is NOT zero is statistically unlikely… these will appear as hot or cold spots. This is the correct answer to your question: where are there unexpectedly high and low values given the overall distribution of values across my study area and the fact that almost all values are zer?. Compare this to the case where you remove zeros… the question is now different… you are asking: of the locations where there is at least a trace amount of where are the unexpectedly high or low concentrations? There will be fewer hot and cold spots, but this information is still very interesting!
5) You indicate that you have sampled some locations more than others (meaning… same place, but more samples). When you have duplicate samples, you have more confidence in the result. Unfortunately, at present only GWR allows you to utilize this added information about the confidence you have in a result. For Hot Spot or Cluster and Outlier analysis I would probably average nearby samples. One strategy is to:
A) MAKE A BACKUP OF YOUR INPUT DATA (the next step WILL CHANGE YOUR GEOMETRY FOREVER).
B) Run the Integrate tool on your samples to snap nearby samples together.
C) Run the Spatial Join tool for coincident (snapped) points and compute the mean.
6) I mentioned this in another post, but it is worth repeating… When you use the mapping cluster tools on sampled data, you do not really need to be concerned that you have sampled some places more intensely than others (the bigger concern is that your samples are not representative of the overall distribution of values). When some places are sampled more than others, Gi* (for example) has MORE information at hand to compute a result… and this is good. When there are few samples, Gi* will still compute a result and it will do its very best, but it will have less information to work with.
Again, I really like the way you think and the way you have approached your analyses here! Very best wishes with your research,
Lauren