This tool is not relevant to my study that examines spatial variations of DOM (dissolved organic matter) and nutrients in streams, but I wanted to learn how it works, so I tried this tool. The result will tell me how clustered my sampling sites are based on the study site extent. I could visually see my sampling sites were clustered within the study site, and the result supported my visual observation. Note, I did not use all of my sampling sites.

Here are some notes:

I project the point Shapefile FIRST. This will determine the units of my results. (I was not sure about this, but I found it was true during the discussion on Monday.)

I used EUCLIDEAN_DISTANCE.

I had to check on “Generate Report” to create a graphical result, which can be accessed from “Results” window –> HTML Report File: NearestNeighbor_Results. I could save this report as PDF.
All the results are in the log/results window. It personally helps me to disable background processing and have an actual log window to obtain a text summary of the results.

Average Nearest Neighbor Summary
Observed Mean Distance: 4318.596812
Expected Mean Distance: 9383.554326
Nearest Neighbor Ratio: 0.460230
z-score: -4.732046
p-value: 0.000002

 

Unfortunately, I could not easily find a corresponding tool in SSN and STARS. I have to look into their documents.

Incremental spatial autocorrelation (ISA) uses Moran’s I to test for spatial autocorrelation within distance bands. Analysis is run on a given parameter (eg. percent cover, elevation).

Interpratation

  • ISA returns z-score and p-values
  • Significant p-value indicates spatial clumping
  • Non-significant p-value indicates random processes at work
  • Indicates significant peaks in z-score
  • Higher z-score indicates more spatial clumping
  • Distance of first peak in z-score usually used for further analysis
  • Useful for determining the appropriate scale for further analysis
  • Hot Spot Analysis
  • Density tools which ask for a radius
  • Determine whether a subsample should be taken to remove autocorrelation

AGST_Coos_lowresAGST_ISA_graph

 

 

EDITOR’S NOTE: The data used for this analysis contained spatially duplicated values. Meaning there are many instances of two separate measurements taken from the same site. These values might be temporally different (same site measured more than once on different dates), distinct but adjacent (different samples from same field, close enough to have the same lat/long within rounding error), or true duplicates. Data was not cleaned for these error prior to analysis

Introduction

This week I fiddled around with the two Mapping Clusters tools in the Spatial Statistics toolbox. Here’s a rundown of each tool:

Cluster and Outlier Analysis

Given a set of weighted features, identifies statistically significant hot spots, cold spots, and spatial outliers using the Anselin Local Moran’s I statistic.

Hot Spot Analysis

Given a set of weighted features, identifies statistically significant hot spots and cold spots using the Getis-Ord Gi* statistic.

Methodology

I wanted to challenge each of these tools to see if they could cope with a common natural resources data phenomenon: zero-value inflation. The problem of a large proportion of zero values is common with data obtained from ecological studies involving counts of abundance, presence–absence, or occupancy rates. The zero values are real and cannot simply be tossed out, but because 0.0 is a singular value, it results in a major spike in the data distribution. This can adversely affect many common modeling techniques. The graphs below are from my data set of organic matter in Oregon soil pedons and have this zero-value complication:

                                                                             

At left, the full data set with an enormous spike in the first bar (caused by all the zero values). At right, the same data set with the almost 400 zero-value data points removed. It’s still right skewed, but looks like a more reasonable distribution spread.

I ran each data set through both cluster analysis tools and visually compared results. There were noticeable differences between all four maps. In this “spot the difference” exercise, some major changes included: Hot spot in the northeast in full data only. Willamette Valley everywhere from cold to hot. Bigger east-west trend in the no-zero data sets.

Discussion

Clearly, having a large number of zero-values in one’s dataset will skew cluster analysis results. This was to be expected. In my data, the zero values were enough to overwhelm the generally “warm” signature in the Willamette Valley. Also, the values in the north-east were “significantly large” when balanced against a large zero data set, but “not significant” in the more balanced no-zero data set.

Regarding the Hot Spot Analysis versus the Cluster and Outlier Analysis, for this data set the Cluster & Outlier tool resulted in more stringent (smaller) hot and cold spots. Also, outlier values of cold in hot spots and hot in cold spots can been seen in the Cluster and Outlier Data Set, but vanish and it fact take the same label as the surrounding data in the Hot Spot Analysis (see below).

                                                          

At left, a close-up of soutwest Oregon in the Hot Spot analysis of the no-zero dataset. At right, the Cluster and Outlier analysis of the same area. Note the two dots in the middle that are classified as “hot” in the left image but “low, surrounded by high” in the right image.

Conclusion

I would be very wary of using these tools with zero-inflated and skewed data. This study was crude in that I completely removed zero values rather than re-casting them as low (but not zero) values. Nonetheless, there were drastic differences between all four maps in  large sample (more than 1000 point) dataset.

Between Hot Spot or Cluster and Outlier, for this data I would prefer to use the Cluster and Outlier data set. The same general hot spot pattern is maintained between the two. It seems there is a trade-off between smoothness and information regarding outliers within hot and cold spots. For scientific mapping purposes it would seem very important to be aware of those outliers.

 

The main problem that I face with my humpback whale sighting data is that field efforts were not random and sightings reflect locations of predictable habitat use rather than sightings along survey transects. When asked to run a nearest neighbor analysis, Julia and I thought it might be neat to run the identical analysis at three different spatial scales in order to see how the results differ. I made three independent shapefiles for each spatial scale and ran the analysis for 1) all of southeastern Alaska, 2) just Glacier Bay and Icy Strait and 3) Point Adolphus.

These were the results for the nearest neighbor (NN) analyses:

SEAK (largest extent)

  • Expected NN: 3795.7 m
  • Observed NN: 485.9 m

Glacier Bay/Icy Strait (medium extent)

  • Expected NN: 1174.3 m
  • Observed NN: 353.6 m

Point Adolphus (smallest extent)

  • Expected NN: 366.2 m
  • Observed NN: 137.4 m

We can see that as we get down to a smaller spatial scale, the expected value becomes more similar to the observed. This is expected since the geography of the entire SEAK extent is no longer getting between groups of whales. Also, the distribution of whales is becoming be more evenly distributed.

Next, I ran a hot spot analysis of humpback whale sightings in Glacier Bay/Icy Strait. A layer of bathymetry was downloaded from the GEBCO “British Oceanographic Data Center” and I extracted raster values to each humpback whale sighting (first figure, below – green dots are deeper, red are more shallow). In the second figure below, red dots indicate significant clusters of high values (depths) and blue indicate significant clusters of whales at shallower depths.

Depths at humpback whale sightings.
Results from my hot spot analysis.

Notes from today’s class discussion: How do we actually calculate the expected nearest neighbor value? This analysis is scale dependent. We must consider the spatial extent that goes into the calculation. Kate discovered that there is an option for defining the area of this analysis when you run the tool. The default is to use the extent of your feature class. Since I created a new feature class for each of the three analyses above, those different spatial scales are included in the output value for expected nearest neighbor distance. We must be sure to keep this methodology in mind when doing these calculations.

A potential next step for me is to run my hot spot analysis on more localized scales. The default spatial extent for the analysis I ran appears to be too expansive to really show fine-scale patterns. I also want to start looking at hot spot analyses based on mitochondrial DNA haplotypes.

 

As I was exploring the Spatial Statistics Resources web-page, I quickly realized most of the spatial statistical tools offered by ESRI are not applicable to my project. My project explores spatial and temporal variations of water quality (dissolved organic carbon sources to be precise) in rivers of the Willamette River Basin. Those ESRI spatial statistical tools are not applicable to my project because 1) points are not representing actual observation points of organisms or diseases for my project but rather representing water quality sampling locations that were selected by me and 2) not only Euclidean distance but also in-stream distances, flow directions, and stream networks affect statistical significance.

I found add-in toolboxes for SSN & STARS and FLoWS that address those two issues mentioned above. These toolboxes were developed by the U.S. Forest Service (USFS). Unfortunately the currently available toolboxes are for ArcGIS 9.3, but the USFS states they are planning to publish new toolboxes for ArcGIS10 later this year.

 

http://webcache.googleusercontent.com/search?q=cache:5SIzWb38eREJ:blogs.esri.com/esri/arcgis/2013/01/29/ssn-stars-tools-for-spatial-statistical-modeling-on-stream-networks/+spatial+statistics+arcgis+water&cd=1&hl=en&ct=clnk&gl=us

 

Things I would like to accomplish by the next class period are to 1) download those two toolboxes and 2) see if they seem to work with ArcGIS10. Note, I am not planning on publishing data modified using those toolboxes developed for ArcGIS 9.3; however, these goals will help me explore what kinds of tools are available through these toolboxes and learn the concept of tools that I am interested in using.

As a general introduction to what I can expect from spatial statistics I searched for a webpage that would define what spatial statistics are, what kinds of questions they can answer, and how they are different from a-spatial statistics.  I found a document entitled “Understanding Spatial Statistics in ArcGIS 9” (http://www.utsa.edu/lrsg/Teaching/EES6513/ESRI_ws_SpatialStatsSlides.pdf) that answers these questions.

The document begins by answering the question “What are spatial statistics?”  The author defines them as “exploratory tools that help you measure spatial processes, spatial distributions, and spatial relationships.”

There are two categories of spatial measurements:

1)      Identifying characteristics of a distribution.  This first category of measurements is descriptive, answers questions like: where is the center, or how are the features distributed around the center?

2)      Quantifying geographic pattern ie are the data random, clustered, or evenly dispersed.

Spatial statistics are different from a-spatial or non-spatial statistics in that spatial statistics include some measure of space in there mathematics.  In most cases, neighboring observations are considered in the statistics regarding a focal observation or global measurement.

The document describes a few examples of problems or questions addressed using spatial statistics available in ArcGIS:

1) How does the distribution of Dengue Fever for a village in India change during the first three weeks after the outbreak?

2) Does bobcat movement between preferred habitat areas coincide with natural land features such as valleys, rivers, or ridgelines?

3) Are there persistent areas in the United States where people are either dying earlier, or living longer, than the average American?

Areas of interest

1. Using  ModelBuilder  to manage data downloaded from the Internet.

http://www.arcgis.com/home/item.html?id=7180ba6e9d8845128eaadf70a4b6bf7e

This tutorial piqued my interest because my data will come from a variety of sources. I will likely encounter a variety of formatting, labeling, and quality differences among datasets so standardizing the process would be beneficial. This tutorial illustrates some of the pertinent considerations, such as no spaces in field names, when importing data into ArcGIS as well as how to use ModelBuilder to plan and automate tasks.

2. Using R in ArcGIS 10.

http://www.arcgis.com/home/item.html?id=a5736544d97a4544aa47d06baf910f6d

Extending ArcGIS with R – presentation from the 2010 Users Conference

http://www.arcgis.com/home/item.html?id=547085ee428f4141b2cacb338f8f61a3

Since ArcGIS can experience limited functionality working with large datasets and spatial statistics needs can extend beyond its capabilities, being able to integrate with software that is more capable, such as R, could be very useful.

To Do:

  1. I am still early in my thesis development but one of the things that I would like to investigate is habitat use of melon-headed whales around French Polynesia and compare that to habitat use around other islands. I would like to continue to investigate the spatial statistics tools that are out there and see what the best approach will be for my project.
  2. I am also interested in looking at spatial distributions of small cetaceans in the Pacific and test for relationships between these distributions and the presence or absence of melon-headed whales. So again, investigation into the spatial statistics relevant to this type of study is on the to do list.

Regression analysis can help you dive deeper into the spatial relationships and the factors behind spatial patterns. At a slightly more advanced level, regression analysis can help you make predictions based on your data. The ArcGIS Resource Center has a very nice page called “Regression Analysis Basics” and gives users an introduction to both regression and the related tools available. It notes the different components of models such as dependent and independent variables and regression coefficients. One of my favorite components of the page is the table “Common regression problems, consequences, and solutions”.  This lists problems and links to solutions that could potentially help you make your regression model stronger. Even if your skill set is beyond the basics of regression analysis, this page is a good refresher and introduction to how Arc can aid in telling a story.

Another helpful page is titled “What they don’t tell you about regression analysis”. Whatever you are trying to model is likely a complex phenomenon (especially in this class) and may not have a simple set of answers. Models often need revision and Arc has created a step-by-step protocol for increasing the validity of your analysis and model; this page guides you through six questions/check-marks that you’ll want to pass before you can have confidence in your model.

In my data, for example, I have several layers that could potentially help me identify where wetlands lie within the valley; examples include elevation, hydrology (stream and flood inundation), vegetation, and soils. Often, GIS users simply stack these layers together and create polygons based on areas that contain all, or a majority of layers. This technique may be based in ecologically sound logic, but does not address the strength between layers or the degree to which one or more layers may influence (both positively and negatively) others.
A regression analysis using known areas of wetland as the dependent variable and a variety of GIS layers as explanatory variables could help me predict places where wetlands are located but may not have been mapped.  Or, even better, it could help me predict where wetlands were in the past. The two pages listed above are useful in guiding me through making a model through the individual decisions I need to make. For example, using Ordinary Least Squares versus Geographically Weighted Regression.

Take a look at the two introduction pages and consider if your data could be used in a regression analysis and if the tools available in the Spatial Statistics toolbox could be useful. You could even just bring three different variables (ex: hydro, soils, and elevation) to try out.
There are three resources to explore further if you’re interested in using your data to perform regression analysis:

  1. Lauren Scott’s presentation on regression analysis
  2. The seminar on regression analysis titled “Beyond Where: Using Regression Analysis to Explore Why
  3. The regression analysis tutorial (the same used in Scott et al.’s presentation) where you can “Learn how to build a properly specified OLS model and improve that model using GWR, interpret regression results and diagnostics, and potentially use the results of regression analysis to design targeted interventions”

 

For my “first take” on the Spatial Statistics Resources blog, I learned more about the mathematical statistics contained within the tools of the Spatial Statistics toolbox. I quickly realized that the tools can be grouped by common mathematical principle. For example, all hot spot identification is found using something called the Getis-Ord Gi* statistic. Looking at the Desktop 10 Help website list of sample applications, most tools are listed with an associated mathematical statistic (usually listed in parentheses). For example:

Question: Is the data spatially correlated?

Tool: Spatial Autocorrelation (Global Moran’s I)

Some of the mathematical concepts I am fairly well acquainted with, like ordinary least squares. Others I had never heard of. The Getis-Ord statistic is one I’d never encountered before. I used one of my primary research tools, the internet, and found the statistic was developed in the mid-nineties by the method’s namesake statisticians.

Link to the 1995 paper on the Getis-Ord statistic

But one need not always consult the internet at large. ESRI provides some explanation of each tool in various articles scattered around the Spatial Statistics folder from Desktop 10.0 Help. I’ve begun assembling a list with the link to each math principle/tool/statistics below. I would like to learn about these statistics, what their strengths and weaknesses are, and especially when it is not appropriate to use them (what are the assumptions?).

List of Mathematical Principles/Statistics Underlying the Suite of Available Spatial Statistics

Analyzing Patterns:

How Multi-Distance Spatial Cluster Analysis (Ripley’s K-function) works

How Spatial Autocorrelation (Global Moran’s I) works

How High/Low Clustering (Getis-Ord General G) works

Mapping Clusters:

How Hot Spot Analysis (Getis-Ord Gi*) works

How Cluster and Outlier Analysis (Anselin Local Moran’s I) works

Measuring Geographic Distributions:

How Directional Distribution (Standard Deviational Ellipse) works

Modeling Spatial Relationships:

Geographically Weighted Regression (GWR) (Spatial Statistics)

Ordinary Least Squares (OLS) (Spatial Statistics)

 

The class today discussed topics of interest within the ArcGIS Spatial Statistics toolbox using the Spatial Statistics Blog as a starting point (http://blogs.esri.com/esri/arcgis/2010/07/13/spatial-statistics-resources/).   Most students looked for concepts or tools that would be useful to their specific research needs.  For me, I was interested in the discussion surrounding modeling spatial relationships and analyzing patterns and how this might apply to the humpback whale data I am using for my own project.

Of particular interest was the “Conceptualization of Spatial Relationships” (http://help.arcgis.com/en/arcgisdesktop/10.0/help/#/Modeling_spatial_relationships/005p00000005000000/) webpage.  This concept is important for most of the tools used in the Spatial Stats toolbox and is critical for data in which there is some degree of locational uncertainty – what is the best spatial conceptualization for your data so that the tool output makes sense with your data?

Other interesting points made in class today include:

The discussion on regression and measuring geographic distributions.