In the geological sciences spatial statistical analysis of gas distribution and migration thru subsurface systems has been applied in a limited number of studies across a variety of systems, CO2 storage, hydrocarbon exploration, landfills, and natural gas storage facilities on a fairly limited basis.  My study seeks to evaluate the source and mechanism of potential contaminates in groundwater systems affiliated with engineered-subsurface resource activities (e.g. hydrocarbon development, CO2 storage, EGS stimulation) using currently available datasets.

Specifically, this project seeks to apply geostatistical techniques in combination with spatial analysis of key datasets from a single geologic basin to evaluate the source and mechanism of gas and other potential contaminants in groundwater systems.  This project hypothesizes that larger-scale patterns in shallow methane concentrations in groundwater aquifers can be correlated to both primary migration pathways (such as wellbores or fracture networks) and the underlying volume of in situ hydrocarbon.  The general approach to this study is to identify, standardize, and integrate preexisting data from the study basin for use in geostatistical, relational, and probabilistic evaluation and interpretation.

The box model diagram below conceptually simplifies the primary systems interacting in the subsurface.  Datasets key to characterizing the flux of gas in and out of these systems, i) sources, ii) pathways, and iii) receptors, will require spatial characterization and statistical analysis in order to support predictions of areas of likely high-flux to receptors versus low-flux to receptors in relation to both natural and anthropogenic processes.

simplified box model 4 2013

My objective was to see if the displacement of the birds showed particular patterns. For this, I decided to analyze the distribution of speed and rotation angles in space. Speed at a particular point is calculated as distance to previous point over time taken to move between points. Rotation angle refers to the angle between two consecutive movement lines (i.e., lines joining point A to B and B to C).
I first tried the Spatial Autocorrelation function, which indicated a clustered distribution of the values.

Example of output of the Spatial Autocorrelation tool applied to rotation angles.

These results weren’t meaningful for me though, as I was interested in the variability within the observations. Studies on different animals species have shown that the analysis of variability within movement patterns can be used to infer behavioral patterns. I expected the birds would show varying speeds and rotation angles in response to the habitat where they were living (e.g., move slower inside the forest and quicker between forest patches; straighter movement lines in non-forest habitat). Thus, I decided to apply the Incremental Spatial Autocorrelation function, as this tool would indicate if the spatial clustering of values varied in the study area.

The results show mixed responses from each bird, with no clear interpretation for the observed patterns.

Example of output of the Incremental Spatial Autocorrelation tool applied to speed.
Example of output of the Incremental Spatial Autocorrelation tool applied to rotation angles.

Most of them have non-significant z-scores, and those that do have no clear relationship to any environmental factor.  Hot spot analyses don’t show a particular concentration of values at any point either.

Example of output of the Hot Spots tool as applied to rotation angles.

 

In conclusion, speed and rotation angles are either A) not affected by the disposition of forest or B) bad indicators of behavioral changes associated to space use.

 

 

For this weeks assignment, we were tasked to begin exploring our dataset with some basic exploratory spatial statistics tools from ArcGIS (average nearest neighbor or/and spatial autocorrelation and hot-spot analysis). Since, my underlying problem is to interpret subsurface geologic characteristics throughout the Northern Gulf of Mexico to fill spatial gaps, I need to understand both the distribution of my sampling points (n=13625) as well as the spatial distribution of the subsurface geologic characteristics associated with each sampling point, such as average porosity, initial temperature (°F), and initial pressure (psi). Therefore, to get at the initial distribution of my sampling points, I used average nearest neighbor to identify if the distribution of my sampling points tended to be clustered, random, or dispersed. Results (table 1) showed that my sampling points were significantly clustered, which verifies with the patterns observed visually (figure 1).

Table 1. Resulting z score and p value from average nearest neighbor and spatial autocorrelation tests for entire sampling data and subsampled datasets
Figure 1. Location of sampling points (boreholes; n=13625) throughout the Northern Gulf of Mexico

 

However, since I know the distribution of my sampling points, how could I be sure that the spatial pattern of the subsurface geologic characteristics wouldn’t just reflect the clustered sampling distribution? Therefore, I decided to subsample my data points, first to  a smaller geographic area Mississippi Canyon Outer Continental Shelf (OCS) lease block (n=397; Figure 2A), and then further subsample those points (using the Create Random Points tool in ArcGIS) to select data points (n=50) to give them a clustered, random, and dispersed spatial distribution (determined using average nearest neighbor; Figure 2B). Then, I ran the spatial autocorrelation tool for each subsample (all Mississippi Canyon, and the clustered, random, and dispersed samples within Mississippi Canyon), which identified that despite the distribution of my sampling points, the values for temperature, pressure, and porosity are significantly clustered (Table 1). The next spatial statistic tool requested to test with our dataset was the hot spot analysis tool. I ran this tool on temperature value for the Mississippi Canyon (n=367) data subsample to identify if there are significant spatial clusters of high and low temperatures values. Results show (Figure 2C) that there are significant clusters of high temperatures (red dots and blue triangles) and low temperature (blue dots and blue triangles). Now, the next step is to being exploring the relationships between different subsurface geologic characteristics and different environmental conditions, such as water depth, subsurface depth, geologic age, etc. to identify any correlations that can used to help fill in spatial gaps of subsurface geologic characteristics throughout the Northern Gulf of Mexico.

Figure 2. Location of the subsampled data points in Mississippi Canyon (n=367; A), the subselected data samples with clustered, random, and dispersed distributions (n=50; B), and the results of the hot spot analysis tool for temperatures from the Mississippi Canyon subsample dataset (n=367;C)

One of my spatial problems is examining the spatial distribution of mitigated wetlands in the Willamette Valley to examine the quality of location chosen for restoration. The data set I  used to test the hot spot tool  is a point file of wetland mitigation sites (i.e. sites that have been restored or created based on intentional disturbance elsewhere).

The mitigation data look clustered when examined visually, and average nearest neighbor confirms this hypothesis.

It seems intuitive that wetlands would be clustered towards streams so I ran average nearest neighbor on the valley’s streams to examine spatial distribution. This showed that the streams are less clustered than mitigated wetlands, indicating there other factors that explain locations of mitigated wetland sites.

Categorical data is largely unusable in the spatial statistics toolbox. However, I wanted to examine the spatial distribution of mitigated wetlands compared to historic vegetation cover. In order to work around the categorical data, I first created a layer that only contained historic wetland vegetation; I then ran the “near” tool to calculate distance between the mitigated wetlands and the historic wetland polygons. Lastly, I ran the hot spot analysis on this distance.

Red indicates increased distance from a historic wetland. The results show that since most of the valley was once floodplain wetlands, most sites are situated on historic wetlands; an area near Portland, however, shows a hot spot of mitigated wetlands that are located further from historic wetland vegetation.

 

Continue reading

This tool is not relevant to my study that examines spatial variations of DOM (dissolved organic matter) and nutrients in streams, but I wanted to learn how it works, so I tried this tool. The result will tell me how clustered my sampling sites are based on the study site extent. I could visually see my sampling sites were clustered within the study site, and the result supported my visual observation. Note, I did not use all of my sampling sites.

Here are some notes:

I project the point Shapefile FIRST. This will determine the units of my results. (I was not sure about this, but I found it was true during the discussion on Monday.)

I used EUCLIDEAN_DISTANCE.

I had to check on “Generate Report” to create a graphical result, which can be accessed from “Results” window –> HTML Report File: NearestNeighbor_Results. I could save this report as PDF.
All the results are in the log/results window. It personally helps me to disable background processing and have an actual log window to obtain a text summary of the results.

Average Nearest Neighbor Summary
Observed Mean Distance: 4318.596812
Expected Mean Distance: 9383.554326
Nearest Neighbor Ratio: 0.460230
z-score: -4.732046
p-value: 0.000002

 

Unfortunately, I could not easily find a corresponding tool in SSN and STARS. I have to look into their documents.

Incremental spatial autocorrelation (ISA) uses Moran’s I to test for spatial autocorrelation within distance bands. Analysis is run on a given parameter (eg. percent cover, elevation).

Interpratation

  • ISA returns z-score and p-values
  • Significant p-value indicates spatial clumping
  • Non-significant p-value indicates random processes at work
  • Indicates significant peaks in z-score
  • Higher z-score indicates more spatial clumping
  • Distance of first peak in z-score usually used for further analysis
  • Useful for determining the appropriate scale for further analysis
  • Hot Spot Analysis
  • Density tools which ask for a radius
  • Determine whether a subsample should be taken to remove autocorrelation

AGST_Coos_lowresAGST_ISA_graph

 

 

EDITOR’S NOTE: The data used for this analysis contained spatially duplicated values. Meaning there are many instances of two separate measurements taken from the same site. These values might be temporally different (same site measured more than once on different dates), distinct but adjacent (different samples from same field, close enough to have the same lat/long within rounding error), or true duplicates. Data was not cleaned for these error prior to analysis

Introduction

This week I fiddled around with the two Mapping Clusters tools in the Spatial Statistics toolbox. Here’s a rundown of each tool:

Cluster and Outlier Analysis

Given a set of weighted features, identifies statistically significant hot spots, cold spots, and spatial outliers using the Anselin Local Moran’s I statistic.

Hot Spot Analysis

Given a set of weighted features, identifies statistically significant hot spots and cold spots using the Getis-Ord Gi* statistic.

Methodology

I wanted to challenge each of these tools to see if they could cope with a common natural resources data phenomenon: zero-value inflation. The problem of a large proportion of zero values is common with data obtained from ecological studies involving counts of abundance, presence–absence, or occupancy rates. The zero values are real and cannot simply be tossed out, but because 0.0 is a singular value, it results in a major spike in the data distribution. This can adversely affect many common modeling techniques. The graphs below are from my data set of organic matter in Oregon soil pedons and have this zero-value complication:

                                                                             

At left, the full data set with an enormous spike in the first bar (caused by all the zero values). At right, the same data set with the almost 400 zero-value data points removed. It’s still right skewed, but looks like a more reasonable distribution spread.

I ran each data set through both cluster analysis tools and visually compared results. There were noticeable differences between all four maps. In this “spot the difference” exercise, some major changes included: Hot spot in the northeast in full data only. Willamette Valley everywhere from cold to hot. Bigger east-west trend in the no-zero data sets.

Discussion

Clearly, having a large number of zero-values in one’s dataset will skew cluster analysis results. This was to be expected. In my data, the zero values were enough to overwhelm the generally “warm” signature in the Willamette Valley. Also, the values in the north-east were “significantly large” when balanced against a large zero data set, but “not significant” in the more balanced no-zero data set.

Regarding the Hot Spot Analysis versus the Cluster and Outlier Analysis, for this data set the Cluster & Outlier tool resulted in more stringent (smaller) hot and cold spots. Also, outlier values of cold in hot spots and hot in cold spots can been seen in the Cluster and Outlier Data Set, but vanish and it fact take the same label as the surrounding data in the Hot Spot Analysis (see below).

                                                          

At left, a close-up of soutwest Oregon in the Hot Spot analysis of the no-zero dataset. At right, the Cluster and Outlier analysis of the same area. Note the two dots in the middle that are classified as “hot” in the left image but “low, surrounded by high” in the right image.

Conclusion

I would be very wary of using these tools with zero-inflated and skewed data. This study was crude in that I completely removed zero values rather than re-casting them as low (but not zero) values. Nonetheless, there were drastic differences between all four maps in  large sample (more than 1000 point) dataset.

Between Hot Spot or Cluster and Outlier, for this data I would prefer to use the Cluster and Outlier data set. The same general hot spot pattern is maintained between the two. It seems there is a trade-off between smoothness and information regarding outliers within hot and cold spots. For scientific mapping purposes it would seem very important to be aware of those outliers.

 

The following is the abstract of the paper I presented earlier this month at AAG:

The specific geography of individual wine growing regions has long been understood to be a significant factor in predicting both a region’s success in producing high quality grapes, and the resulting demand for wines produced from that region’s fruit. In the American wine industry, American Viticultural Areas (AVAs) are increasingly being used to designate a uniqueness and specificity of place. This process is often predicated on the argument that these areas represent a certain degree of physiographic uniformity or homogeneity. This is particularly the case with regard to the phenomenon of sub-AVAs, wherein smaller areas within large, spatially heterogeneous AVAs seek to differentiate themselves based on the physiographic features that are purportedly unique to those smaller subregions. In many cases, there is a strong correlation between soil classes and AVA boundaries, whereas in other cases the correlation is not as strong. This suggests that there are factors other than physiographic homogeneity contributing to the designation of these sub-AVAs. This study employs GIS and spatial analysis to examine and potentially correlate the soil classes of Oregon’s northern Willamette Valley with the sub-AVAs in that area. In doing so, this study presents maps and statistical results in order to provide a quantitative summary of the geographic context of vineyards in this region with respect to both the soil classes present and the federally designated AVA boundaries in which they are located.

 

About my data and my spatial problem:

The data set that I am working with is a legacy National Resources Conservation Service (NRCS) data set detailing soil classes throughout Oregon’s Willamette Valley. Using meets and bounds descriptions provided by the United States Department of the Treasury’s Alcohol and Tobacco Tax and Trade Bureau (TTB), the Federal entity tasked with approving AVA designation petitions, I have generated a series of polygons representing the Willamette Valley AVAs (Willamette Valley and its 6 sub-AVAs: Chehalem Mountains, Ribbon Ridge, Dundee Hills, Yamhill-Carlton, McMinnville, and Eola-Amity Hills). I also have a handful of raster data layers (slope, aspect, landform, lithology, and PRISM) that I am using to calculate zonal statistics. Many spatial statistical methods are designed around the use of point data – this poses a problem for me because all of my data is in either a vector polygon or raster format. I am interested in exploring which methods/tools within the Spatial Statistics toolbox are most appropriate for using with my data. I am also interested in getting feedback from others in this course so as to make my research more robust, defensible, and statistically sound.

-Doug

The main problem that I face with my humpback whale sighting data is that field efforts were not random and sightings reflect locations of predictable habitat use rather than sightings along survey transects. When asked to run a nearest neighbor analysis, Julia and I thought it might be neat to run the identical analysis at three different spatial scales in order to see how the results differ. I made three independent shapefiles for each spatial scale and ran the analysis for 1) all of southeastern Alaska, 2) just Glacier Bay and Icy Strait and 3) Point Adolphus.

These were the results for the nearest neighbor (NN) analyses:

SEAK (largest extent)

  • Expected NN: 3795.7 m
  • Observed NN: 485.9 m

Glacier Bay/Icy Strait (medium extent)

  • Expected NN: 1174.3 m
  • Observed NN: 353.6 m

Point Adolphus (smallest extent)

  • Expected NN: 366.2 m
  • Observed NN: 137.4 m

We can see that as we get down to a smaller spatial scale, the expected value becomes more similar to the observed. This is expected since the geography of the entire SEAK extent is no longer getting between groups of whales. Also, the distribution of whales is becoming be more evenly distributed.

Next, I ran a hot spot analysis of humpback whale sightings in Glacier Bay/Icy Strait. A layer of bathymetry was downloaded from the GEBCO “British Oceanographic Data Center” and I extracted raster values to each humpback whale sighting (first figure, below – green dots are deeper, red are more shallow). In the second figure below, red dots indicate significant clusters of high values (depths) and blue indicate significant clusters of whales at shallower depths.

Depths at humpback whale sightings.
Results from my hot spot analysis.

Notes from today’s class discussion: How do we actually calculate the expected nearest neighbor value? This analysis is scale dependent. We must consider the spatial extent that goes into the calculation. Kate discovered that there is an option for defining the area of this analysis when you run the tool. The default is to use the extent of your feature class. Since I created a new feature class for each of the three analyses above, those different spatial scales are included in the output value for expected nearest neighbor distance. We must be sure to keep this methodology in mind when doing these calculations.

A potential next step for me is to run my hot spot analysis on more localized scales. The default spatial extent for the analysis I ran appears to be too expansive to really show fine-scale patterns. I also want to start looking at hot spot analyses based on mitochondrial DNA haplotypes.

 

For students in GEO 599: If you have any questions about the tools in the Spatial Statistics Toolbox, please reply to this blog post or email her directly at Lscott@esri.com.

In addition you may contact Dr. Lauren Scott at Esri, the creator of all the tools through the Esri forums. Lauren is very helpful.

  1. Go to Esri forums at http://forums.arcgis.com/
  2. Sign in using your global account
  3. Search for Lauren Scott
  4. Click on on her profile
  5. You will get an option to contact her directly

For the spatial statistics forum, see: http://forums.arcgis.com/forums/110-Spatial-Statistics?sort=lastpost&order=desc

For Dr. Lauren Scott’s threads, search on “Lauren Scott”