For this weeks assignment, we were tasked to begin exploring our dataset with some basic exploratory spatial statistics tools from ArcGIS (average nearest neighbor or/and spatial autocorrelation and hot-spot analysis). Since, my underlying problem is to interpret subsurface geologic characteristics throughout the Northern Gulf of Mexico to fill spatial gaps, I need to understand both the distribution of my sampling points (n=13625) as well as the spatial distribution of the subsurface geologic characteristics associated with each sampling point, such as average porosity, initial temperature (°F), and initial pressure (psi). Therefore, to get at the initial distribution of my sampling points, I used average nearest neighbor to identify if the distribution of my sampling points tended to be clustered, random, or dispersed. Results (table 1) showed that my sampling points were significantly clustered, which verifies with the patterns observed visually (figure 1).

Table 1. Resulting z score and p value from average nearest neighbor and spatial autocorrelation tests for entire sampling data and subsampled datasets
Figure 1. Location of sampling points (boreholes; n=13625) throughout the Northern Gulf of Mexico

 

However, since I know the distribution of my sampling points, how could I be sure that the spatial pattern of the subsurface geologic characteristics wouldn’t just reflect the clustered sampling distribution? Therefore, I decided to subsample my data points, first to  a smaller geographic area Mississippi Canyon Outer Continental Shelf (OCS) lease block (n=397; Figure 2A), and then further subsample those points (using the Create Random Points tool in ArcGIS) to select data points (n=50) to give them a clustered, random, and dispersed spatial distribution (determined using average nearest neighbor; Figure 2B). Then, I ran the spatial autocorrelation tool for each subsample (all Mississippi Canyon, and the clustered, random, and dispersed samples within Mississippi Canyon), which identified that despite the distribution of my sampling points, the values for temperature, pressure, and porosity are significantly clustered (Table 1). The next spatial statistic tool requested to test with our dataset was the hot spot analysis tool. I ran this tool on temperature value for the Mississippi Canyon (n=367) data subsample to identify if there are significant spatial clusters of high and low temperatures values. Results show (Figure 2C) that there are significant clusters of high temperatures (red dots and blue triangles) and low temperature (blue dots and blue triangles). Now, the next step is to being exploring the relationships between different subsurface geologic characteristics and different environmental conditions, such as water depth, subsurface depth, geologic age, etc. to identify any correlations that can used to help fill in spatial gaps of subsurface geologic characteristics throughout the Northern Gulf of Mexico.

Figure 2. Location of the subsampled data points in Mississippi Canyon (n=367; A), the subselected data samples with clustered, random, and dispersed distributions (n=50; B), and the results of the hot spot analysis tool for temperatures from the Mississippi Canyon subsample dataset (n=367;C)

One of my spatial problems is examining the spatial distribution of mitigated wetlands in the Willamette Valley to examine the quality of location chosen for restoration. The data set I  used to test the hot spot tool  is a point file of wetland mitigation sites (i.e. sites that have been restored or created based on intentional disturbance elsewhere).

The mitigation data look clustered when examined visually, and average nearest neighbor confirms this hypothesis.

It seems intuitive that wetlands would be clustered towards streams so I ran average nearest neighbor on the valley’s streams to examine spatial distribution. This showed that the streams are less clustered than mitigated wetlands, indicating there other factors that explain locations of mitigated wetland sites.

Categorical data is largely unusable in the spatial statistics toolbox. However, I wanted to examine the spatial distribution of mitigated wetlands compared to historic vegetation cover. In order to work around the categorical data, I first created a layer that only contained historic wetland vegetation; I then ran the “near” tool to calculate distance between the mitigated wetlands and the historic wetland polygons. Lastly, I ran the hot spot analysis on this distance.

Red indicates increased distance from a historic wetland. The results show that since most of the valley was once floodplain wetlands, most sites are situated on historic wetlands; an area near Portland, however, shows a hot spot of mitigated wetlands that are located further from historic wetland vegetation.

 

Continue reading

This tool is not relevant to my study that examines spatial variations of DOM (dissolved organic matter) and nutrients in streams, but I wanted to learn how it works, so I tried this tool. The result will tell me how clustered my sampling sites are based on the study site extent. I could visually see my sampling sites were clustered within the study site, and the result supported my visual observation. Note, I did not use all of my sampling sites.

Here are some notes:

I project the point Shapefile FIRST. This will determine the units of my results. (I was not sure about this, but I found it was true during the discussion on Monday.)

I used EUCLIDEAN_DISTANCE.

I had to check on “Generate Report” to create a graphical result, which can be accessed from “Results” window –> HTML Report File: NearestNeighbor_Results. I could save this report as PDF.
All the results are in the log/results window. It personally helps me to disable background processing and have an actual log window to obtain a text summary of the results.

Average Nearest Neighbor Summary
Observed Mean Distance: 4318.596812
Expected Mean Distance: 9383.554326
Nearest Neighbor Ratio: 0.460230
z-score: -4.732046
p-value: 0.000002

 

Unfortunately, I could not easily find a corresponding tool in SSN and STARS. I have to look into their documents.

Incremental spatial autocorrelation (ISA) uses Moran’s I to test for spatial autocorrelation within distance bands. Analysis is run on a given parameter (eg. percent cover, elevation).

Interpratation

  • ISA returns z-score and p-values
  • Significant p-value indicates spatial clumping
  • Non-significant p-value indicates random processes at work
  • Indicates significant peaks in z-score
  • Higher z-score indicates more spatial clumping
  • Distance of first peak in z-score usually used for further analysis
  • Useful for determining the appropriate scale for further analysis
  • Hot Spot Analysis
  • Density tools which ask for a radius
  • Determine whether a subsample should be taken to remove autocorrelation

AGST_Coos_lowresAGST_ISA_graph

 

 

EDITOR’S NOTE: The data used for this analysis contained spatially duplicated values. Meaning there are many instances of two separate measurements taken from the same site. These values might be temporally different (same site measured more than once on different dates), distinct but adjacent (different samples from same field, close enough to have the same lat/long within rounding error), or true duplicates. Data was not cleaned for these error prior to analysis

Introduction

This week I fiddled around with the two Mapping Clusters tools in the Spatial Statistics toolbox. Here’s a rundown of each tool:

Cluster and Outlier Analysis

Given a set of weighted features, identifies statistically significant hot spots, cold spots, and spatial outliers using the Anselin Local Moran’s I statistic.

Hot Spot Analysis

Given a set of weighted features, identifies statistically significant hot spots and cold spots using the Getis-Ord Gi* statistic.

Methodology

I wanted to challenge each of these tools to see if they could cope with a common natural resources data phenomenon: zero-value inflation. The problem of a large proportion of zero values is common with data obtained from ecological studies involving counts of abundance, presence–absence, or occupancy rates. The zero values are real and cannot simply be tossed out, but because 0.0 is a singular value, it results in a major spike in the data distribution. This can adversely affect many common modeling techniques. The graphs below are from my data set of organic matter in Oregon soil pedons and have this zero-value complication:

                                                                             

At left, the full data set with an enormous spike in the first bar (caused by all the zero values). At right, the same data set with the almost 400 zero-value data points removed. It’s still right skewed, but looks like a more reasonable distribution spread.

I ran each data set through both cluster analysis tools and visually compared results. There were noticeable differences between all four maps. In this “spot the difference” exercise, some major changes included: Hot spot in the northeast in full data only. Willamette Valley everywhere from cold to hot. Bigger east-west trend in the no-zero data sets.

Discussion

Clearly, having a large number of zero-values in one’s dataset will skew cluster analysis results. This was to be expected. In my data, the zero values were enough to overwhelm the generally “warm” signature in the Willamette Valley. Also, the values in the north-east were “significantly large” when balanced against a large zero data set, but “not significant” in the more balanced no-zero data set.

Regarding the Hot Spot Analysis versus the Cluster and Outlier Analysis, for this data set the Cluster & Outlier tool resulted in more stringent (smaller) hot and cold spots. Also, outlier values of cold in hot spots and hot in cold spots can been seen in the Cluster and Outlier Data Set, but vanish and it fact take the same label as the surrounding data in the Hot Spot Analysis (see below).

                                                          

At left, a close-up of soutwest Oregon in the Hot Spot analysis of the no-zero dataset. At right, the Cluster and Outlier analysis of the same area. Note the two dots in the middle that are classified as “hot” in the left image but “low, surrounded by high” in the right image.

Conclusion

I would be very wary of using these tools with zero-inflated and skewed data. This study was crude in that I completely removed zero values rather than re-casting them as low (but not zero) values. Nonetheless, there were drastic differences between all four maps in  large sample (more than 1000 point) dataset.

Between Hot Spot or Cluster and Outlier, for this data I would prefer to use the Cluster and Outlier data set. The same general hot spot pattern is maintained between the two. It seems there is a trade-off between smoothness and information regarding outliers within hot and cold spots. For scientific mapping purposes it would seem very important to be aware of those outliers.

 

The following is the abstract of the paper I presented earlier this month at AAG:

The specific geography of individual wine growing regions has long been understood to be a significant factor in predicting both a region’s success in producing high quality grapes, and the resulting demand for wines produced from that region’s fruit. In the American wine industry, American Viticultural Areas (AVAs) are increasingly being used to designate a uniqueness and specificity of place. This process is often predicated on the argument that these areas represent a certain degree of physiographic uniformity or homogeneity. This is particularly the case with regard to the phenomenon of sub-AVAs, wherein smaller areas within large, spatially heterogeneous AVAs seek to differentiate themselves based on the physiographic features that are purportedly unique to those smaller subregions. In many cases, there is a strong correlation between soil classes and AVA boundaries, whereas in other cases the correlation is not as strong. This suggests that there are factors other than physiographic homogeneity contributing to the designation of these sub-AVAs. This study employs GIS and spatial analysis to examine and potentially correlate the soil classes of Oregon’s northern Willamette Valley with the sub-AVAs in that area. In doing so, this study presents maps and statistical results in order to provide a quantitative summary of the geographic context of vineyards in this region with respect to both the soil classes present and the federally designated AVA boundaries in which they are located.

 

About my data and my spatial problem:

The data set that I am working with is a legacy National Resources Conservation Service (NRCS) data set detailing soil classes throughout Oregon’s Willamette Valley. Using meets and bounds descriptions provided by the United States Department of the Treasury’s Alcohol and Tobacco Tax and Trade Bureau (TTB), the Federal entity tasked with approving AVA designation petitions, I have generated a series of polygons representing the Willamette Valley AVAs (Willamette Valley and its 6 sub-AVAs: Chehalem Mountains, Ribbon Ridge, Dundee Hills, Yamhill-Carlton, McMinnville, and Eola-Amity Hills). I also have a handful of raster data layers (slope, aspect, landform, lithology, and PRISM) that I am using to calculate zonal statistics. Many spatial statistical methods are designed around the use of point data – this poses a problem for me because all of my data is in either a vector polygon or raster format. I am interested in exploring which methods/tools within the Spatial Statistics toolbox are most appropriate for using with my data. I am also interested in getting feedback from others in this course so as to make my research more robust, defensible, and statistically sound.

-Doug

The main problem that I face with my humpback whale sighting data is that field efforts were not random and sightings reflect locations of predictable habitat use rather than sightings along survey transects. When asked to run a nearest neighbor analysis, Julia and I thought it might be neat to run the identical analysis at three different spatial scales in order to see how the results differ. I made three independent shapefiles for each spatial scale and ran the analysis for 1) all of southeastern Alaska, 2) just Glacier Bay and Icy Strait and 3) Point Adolphus.

These were the results for the nearest neighbor (NN) analyses:

SEAK (largest extent)

  • Expected NN: 3795.7 m
  • Observed NN: 485.9 m

Glacier Bay/Icy Strait (medium extent)

  • Expected NN: 1174.3 m
  • Observed NN: 353.6 m

Point Adolphus (smallest extent)

  • Expected NN: 366.2 m
  • Observed NN: 137.4 m

We can see that as we get down to a smaller spatial scale, the expected value becomes more similar to the observed. This is expected since the geography of the entire SEAK extent is no longer getting between groups of whales. Also, the distribution of whales is becoming be more evenly distributed.

Next, I ran a hot spot analysis of humpback whale sightings in Glacier Bay/Icy Strait. A layer of bathymetry was downloaded from the GEBCO “British Oceanographic Data Center” and I extracted raster values to each humpback whale sighting (first figure, below – green dots are deeper, red are more shallow). In the second figure below, red dots indicate significant clusters of high values (depths) and blue indicate significant clusters of whales at shallower depths.

Depths at humpback whale sightings.
Results from my hot spot analysis.

Notes from today’s class discussion: How do we actually calculate the expected nearest neighbor value? This analysis is scale dependent. We must consider the spatial extent that goes into the calculation. Kate discovered that there is an option for defining the area of this analysis when you run the tool. The default is to use the extent of your feature class. Since I created a new feature class for each of the three analyses above, those different spatial scales are included in the output value for expected nearest neighbor distance. We must be sure to keep this methodology in mind when doing these calculations.

A potential next step for me is to run my hot spot analysis on more localized scales. The default spatial extent for the analysis I ran appears to be too expansive to really show fine-scale patterns. I also want to start looking at hot spot analyses based on mitochondrial DNA haplotypes.

 

For students in GEO 599: If you have any questions about the tools in the Spatial Statistics Toolbox, please reply to this blog post or email her directly at Lscott@esri.com.

In addition you may contact Dr. Lauren Scott at Esri, the creator of all the tools through the Esri forums. Lauren is very helpful.

  1. Go to Esri forums at http://forums.arcgis.com/
  2. Sign in using your global account
  3. Search for Lauren Scott
  4. Click on on her profile
  5. You will get an option to contact her directly

For the spatial statistics forum, see: http://forums.arcgis.com/forums/110-Spatial-Statistics?sort=lastpost&order=desc

For Dr. Lauren Scott’s threads, search on “Lauren Scott”

For my research, I use annual Landsat satellite images to view the Willamette River to examine disturbances and loss in the valley’s wetlands. I also have several other critical data sets including a LiDAR inundation raster based on a 2 year flood return interval and several shapefiles showing the location of mitigation wetlands. One of the spatial problems I’d like to investigate in this class involves relating the data sets to each other; one of the ecological questions I’m asking through my research involves investigating the spatial distribution of wetlands created and restored through mitigation versus those destroyed and disturbed. For example, do the two differ in their proximity to the river and its tributaries? Is one more clumped/distributed than the other?

Utilizing the spatial statistics toolbox, specifically regression and mapping trends/clusters, may help me answer some of these questions.

Some of my annual satellite imagery viewed in Tasseled Cap Index:

My spatial problem:

Under what forest and climatic conditions do endemic mountain pine beetle populations switch to epidemic populations?

The null hypothesis is that conditions are random and alternative hypotheses include: 1) population eruptions are simply cyclic or periodic; 2) there are specific environmental condition triggers; 3) some combination of alternative hypotheses 1 and 2.

Mountain pine beetle survival is dependent on availability of susceptible hosts and suitable temperature range, with the primary limitation being minimum temperature.  In an endemic state, mountain pine beetles may kill several trees in a dispersed pattern, while in an epidemic state, nearly continuous, widespread host tree mortality is observed.  Population eruptions exhibit both temporal and spatial patterns over the landscape making spatial statistics a useful analysis tool.

There are two parts to the study: 1) outbreak detection and monitoring using 40 years of Landsat satellite imagery and 2) analysis of relationships between outbreak initiation and spread and forest and climate conditions

Independent variables include:

Host availability at time of outbreak:

Host density

Host age

Topography:

Elevation

Slope

Aspect

Climate at and prior to time of outbreak:

Min and max temperature for various time periods

Precipitation for various time periods

Forest structure:

Composition

Management

Disturbance history

Each of these variables could potentially be related to outbreak timing and position through geographically weighted regression.