The goal of this effort was to assess if we could use spatial statistical tools provided within ArcGIS 10.1 to evaluate geospatial relationships to differentiate between natural subsurface leakage pathways and anthropogenic leakage pathways. Evidence for gas migration within the subsurface is well documented through petroleum system and CO2 sequestration related studies. In addition, there are a number of examples of paleo- and modern natural gas seepage from deep subsurface systems to the near-surface in both onshore and offshore environments. Gas/fluid migration are key to the documentation of fluid flux relative to both resource/storage potential estimates and improving constraints on potential environmental impacts associated with gas migration from subsurface to groundwater systems and the atmosphere. Migration of gas from subsurface systems is particularly visible in the marine environment where gas plumes into the water column have been observed and studied by a number of researchers worldwide.
Migration of gas from subsurface systems to near surface systems has been the target of relatively few investigations, although a better understanding of these processes is of increasing interest due to the focus on potential storage of CO2 in geologic media and on methane recovery using hydrofracking techniques. To date, many of the studies evaluating the flux and migration of gas from subsurface to near surface systems have focused on anthropogenic concerns such as potential leakage related to CO2 storage, landfills, and natural gas storage facilities. Given the variability of the natural geologic system and the limited amount of existing field data and studies examining these systems, questions about the timing and nature of gas migration in many regions persist. Evaluating available data to assess whether interactions between subsurface and near-surface systems are related to naturally occurring mechanisms or are the result of anthropogenic activities offers additional challenges. However, improving understanding of mid to large-scale, field- and regional-scale controls on gas flux will help further constrain the relationships between geologic media, and the nature and timing of gas flux in a variety of subsurface systems. Through the evaluation of field datasets utilizing laboratory, geospatial and geostatistical analytical methods patterns related to gas occurrences and concentrations in subsurface systems can be identified.

simplified box model 4 2013

To support these evaluations in a test scenario for CO2 geologic storage potential in the Appalachian Basin’s Oriskany formation, we utilized geostatistical tools to assess a variety of parameters associated with the reservoir characterization of the Oriskany formation including thickness, porosity, subsurface depth to top of formation, pressure, temperature, and a variety of pore-filling media relationships such as brine density, salinity, etc. The data leveraged for this analysis was largely based upon borehole geophysical log interpretations from West Virginia, Ohio, and Pennsylvania. The distribution of these data were irregular, and not every data point had values for all the reservoir parameters being evaluated. The variability of the wellbore data points is highlighted in the figure below.

map

To fill in the gaps for areas of low data density or even no data density to develop a complete dataset with values for all parameters to assist with CO2 storage estimations, each parameters needs to have a value (at least a minimum, maximum, and average value) for all parameters. As a result, the following tools and approaches were evaluated within ArcGIS for this dataset:
1. Use the min, max, and average values for the entire dataset and assign those values to all empty grids
2. Inverse Distance Weighting (IDW) evaluated to fill in the empty grids
3. Kriging also evaluated as a tool to fill in the empty grids
Example of these results for porosity data:

porosity

trends

The graph above shows the relative distribution and compares average % porosity values for the raw data points against interpolated values using ArcGIS’ kriging versus inverse distance weighting (IDW) approaches. While offering generally similar results, the IDW approach for this dataset appears to interpolate and fill in data gaps in a manner that more closely follows the trends of the actual average porosity data versus that of the kriging interpolation algorithm.
Summary and preliminary conclusions:
1. Recommend using IDW interpolations to fill gaps
a. Best fit to original well data for parameters
b. Represent natural fluctuations in data range
c. Correlates parameters across spatial distances
2. Intent to represent “field-” to basin-scale features and variability not reservoir/borehole scale features.
3. Help inform technology development, risk evaluation, knowledge gaps
4. For extrapolation to more complex systems and problems dealing with 3D, multi-system components:
a. Basic geostatistical tools are sufficient for preliminary analyses and filling in gaps
b. But more complex analyses will require other software/approaches

3D

After tinkering with several different tools and datasets, my end goal for this class became to digitally reproduce my manual methodology for mapping wetlands.

The reason my methodology is manual (rather than digital) is because temporal and spatial context are very important when mapping wetlands in the Willamette Valley using Landsat satellite imagery.

For example, riparian wetlands along the Willamette River have the same spectral signal as poplar farms and orchards in the same vicinity. If using a single date to classify wetlands, a digital classification may confuse land use types. However, looking backward and forward in time, I can tell which areas are stable wetlands and which are rotated crops. Additionally, I have other layers such as streams, soils, elevation, and floodplain to see whether it is logical for a wetland to exist in a certain area.

bgwtime

 

When I classify wetlands I have several different data layers available to aid in decision making:

Spectral

  • (40) annual Landsat tasseled cap images from 1972-2012
“Ancillary”
  • Lidar inundation  raster
  • Streams
  • Soils
  • Others
My options for classifying wetlands using my satellite imagery include:
  • Spectral-only supervised classification -Use time series as additional channels
  • Object oriented classification
  • Incorporate ancillary with spectral
  • Mix

Arc seemed like the perfect application to attempt to incorporate my spectral Landsat data with my other layers.

At first I thought using one of the regression tools on a mapped sample of my study area could allow me to classify the rest of my study area using OLS predictions. However, when I looked at my data, the relationships between datasets did not appear robust enough.

spm

 

Instead, I decided do to a binary raster overlay to find possible locations of  wetlands. I selected a small study area focused around South Corvallis because it had a good mix of both riparian forest stands as well as hardwood orchards.

I included spectral data from a 2011 tasseled cap image using the indices brightness, wetness, and greenness. I calculated the spectral stats for BGW using areas of known riparian wetlands. A new binary raster was created for B,G, and W; pixels with data around the mean +/- one standard deviation were given a “1” and all other pixels were given a “0”.bgw

 

I also included a 1 meter Lidar two year inundation floodplain map. Areas in floodplain were reclassified as “1” with all other space as “0” as well as a 100m distance raster of the streams in the area

lidar streamraster

All layers were given equal weight in the raster overlay. The result was a raster with little to no error of omission but high error of commission.

(Areas of yellow indicate manually mapped and validated riparian wetlands; green is result of raster overlay).

results

 

Just for comparison, I decided to run a spectral classification in ENVI using my manually mapped wetlands as regions of interest (i.e. training sites).

The result presented increased error of omission but decreased error of commission.

specclass

 

You can run spectral classification in Arc but the process is not streamlined and can become convoluted. Additionally, ENVI automatically extrapolated classification of an image based on training data; Arc is clunky when it comes to prediction and extrapolation.

predict

 

My final thoughts on the Arc spatial statistics toolbox are that:

  • Arc spatial stats are (mostly) useless when it comes to categorical/non-continuous data
    • Especially useless when it comes to mixing continuous with non-continuous
  • Raster tools are not multi-channel friendly (I had to process each band separately)
  • Classification and prediction are convoluted, multistep processes in arc
    • Operations that ENVI, R, etc. do flawlessly
  • I should probably expand to other software for serious analysis
    • eCognition, habitat suitability mappers, etc.

After considerable experimentation with a variety of ArcGIS’s Spatial Statistics tools, including Hot Spot Analysis, Cluster Analysis, Spatial Autocorrelation, Geographically Weighted Regression, and Ordinary Least Squares, I think I may have found a viable method for analyzing my SSURGO Soils dataset. For my final class presentation for this course, I employed the Grouping Analysis tool to explore the spatial patterns and clusters of high clay content within the sub-AVAs of the northern Willamette Valley. The visual correspondence between the resulting groups and the soil orders (i.e. taxonomy) was surprisingly accurate.

Reading through the literature on ESRI’s webpage about Grouping Analysis, I learned that one should start the Grouping Analysis using one variable, incrementally adding more with each subsequent run of the analysis. Following suit, I have experimented with both the addition of more variables as well as the total number of ideal groups for a given data set. While the soils present in each of the sub-AVAs are incredibly heterogenous and diverse, they do share some similarities, particularly with regard to clay content and soil taxonomy.

Northern_Willamette_Valley_AVA_Soil_Taxonomy_20130612Northern_Willamette_Valley_AVA_Grouping_Analysis_20130612

The results published here reflect an analysis using the variables of percent clay content, Ksat, Available Water Storage at 25cm, 50cm, and 150cm, respectively; choosing to parse the data into 5 groups. I also took advantage of the “Evaluate Optimal Number of Groups parameter” option within the toolbox, which generates additional statistics meant to identify the number of groups that will most readily distinguish one’s data set into distinct groups.

In addition, I generated Output Report Files with each run so that I could explore the statistical results in more depth. I’ve attached these for those of you who are interested in seeing what the results look like. I find it interesting that for almost all of my AVA data sets save for one, the resulting reports are suggesting that 15 is the optimal number of groups. I’m not sure if this is because 15 is the maximum number of groups that the tool can generate, or if this is a result of the particular variables I am using as inputs.

chehalem_grouping5

dundee_grouping5

eola_amity_grouping5

ribbon_ridge_grouping5

yamhill_carlton_grouping5

Additional variables that I plan on adding include percent sand, percent silt, bulk density, percent organic matter, and parent material. I am also considering incorporating raster data sets of slope, aspect, landform, vegetation zone, precipitation, minimum temperature, and maximum temperature. Performing multiple iterations of the Grouping Analysis will help me to identify a suitable combination of these variables, as well as the optmimal number of groups. Once those have been identified, I plan on performing the same analysis on each AVA, and then on buffered polygons of the AVAs at distances of 500m, 1000m, 1500m, 2000m, 2500m, and 3000m. In so doing, I hope to identify the degree to which different sub-AVAs in the northern Willamette Valley differ from directly adjacent landscapes. This will allow me to articulate those sub-AVAs which best correspond to the underlying soil classes in those areas.

My goal for this class was to learn what tools available in ArcGIS’s Spatial Statistics toolbox might be useful in helping to answer my research questions regarding  spatial-temporal relationships between mountain pine beetle outbreak\spread and coinciding environmental characteristics such as topography, climate, and forest attributes.  I quickly learned that the spatial statistics tools in ArcMap are not suitable for my data.  All tools expect vector data as inputs and I am working with raster image data.  I can do conversions from raster to vector, but it really doesn’t make sense computationally and theoretically.  Also, I found that my image data needed a lot more pre-processing to get it to a point where it could be analyzed than I originally thought.  I spent the majority of the class time preparing my data, which included: georegistration, disturbance detection, and disturbance causal agent identification.  Once I had completed these steps for a pilot image scene I was able to do a quick look at outbreak elevation histograms through time, and also annual climate prior to and during the outbreak.

The remainder of this post will walk through the steps I took to process the imagery and present some initial findings on mountain pine beetle outbreak and environment relationships.

The first step in image preprocessing was to properly georegister mis-registered images.  Please click on the images below to see an example of a before and after image registration correction.

fiximg1 fiximggood

Figure 1. Satellite image before and after georegistration correction (click images to see animation)

The registration process was coded and automated using the R “Raster” package.  The first step was to sample the image with points and extract small image chips or windows from around the point in the reference image and the misregistered image (see figure below).

 

sample

Figure 2. Sample reference and misregistered image and extract image chips from around the sample points.

Within the registration program, for each sample point, the misregistered image chip is “slid” over the reference image chip at all defined x and y offset distances and cross correlation coeffiecients are calculated for each shift based on the intersecting pixel values.  Please click on the figure below to see an example of the moving window cross correlation.

 

itpfind  Click image to see animation

Figure 3.  Animation example of moving window cross correlation analysis between reference and misregistered image chips for iterative xy offset shifts.

A cross correlation surface is produced for each sample point.  The apex of the conical peaks represent the offset that is needed to correct the misregisted image.  A 2nd order polynomial equation is created from all legitimate peaks and used to warp the misregistered image into its correct geographic position.

 

ccc_surfaces

Figure 4. Example cross correlation surfaces showing the best offset matches between the reference image and misregistered image chips

With all images in their correct geographic position, the “LandTrendr” change detection program was applied to the long time series of Landsat imagery to identify pixels that have been disturbed.  A discriminant analysis of empirical variables related to spectral, shape, and topographic characteristics of identified disturbances was conducted to predict disturbance agent from a training set of labeled disturbances.  The figure below depicts a mountain pine beetle outbreak identified in central Colorado.  Click on the image to see the outbreak start and progression (colors represent magnitude change to forest: low-high\blue-red)

 

test click image to see animation

Figure 5: Mountain pine beetle outbreak spread as captured by annual Landsat satellite imagery

From the outbreak progression video above, you can see that the outbreak appears to move up slope.  To find out if this is truly the case I extracted elevation values for all insect affected pixels per year and plotted the histogram of both elevation value density and frequency.  Please click on the images below to see the shift in elevation and area affected as the outbreak progresses.

histdensehistcountClick images to see animation

Figure 6. Animated histograms depicting the progression of mountain pine beetle progression up slope.

I was also curious about what the annual climate was doing before and during the outbreak.  I extracted PRISM climate data from 1985 to 2008 in the region of the outbreak and plotted it with the count of insect-disturbed pixels.  The figure shows that maximum and minimum annual temperature begin to increase 1 to 2 standard deviations from mean about 5 years before the outbreak really takes off.  This corresponds with a 3-4 year drop in annual precipitation.  These conditions could have drought stressed the forests and provided a highly productive climate for the beetle to reproduce multiple times in a season and avoid freeze-kill.

 

climate

Figure 7.  Graph showing annual deviation from mean for PRISM climate variables and insect-disturbed pixels for 23 years.

In closing, I found the ArcMap spatial statistics unable to work with my raster format data, but was able to make a lot of progress in data preparation and analysis and exploration of satellite image detected insect outbreaks and corresponding environmental factors.

 

 

 

First of all, I would like to say I learned about Ordinary Least Squares (OLS) and Geographically Weighted Regression (GWR) tools from ArcGIS help 9.3, 10.0, and 10.1. This final report is based on the knowledge I gained through those help menus.

My learning goal in this class was to learn ArcGIS Spatial Statistics Tools that help answer my project questions. For my project, I am examining whether or not various types of dissolved organic matter (DOM) sources are controlled by factors such as land use types, stream orders, other water quality parameters, and seasonal flow patterns. At the beginning of the quarter, I was going to use SSN & STARS and FLoWS, which are add-in spatial statistical tools for stream networks; however, they did not work as they are for ArcGIS 9.3.

I moved onto GWR, which is one of the Spatial Statistics Tools. These are the steps and tools I had to follow and use to run GWR.

  1. OLS to select independent variables
  2. Spatial Autocorrelation (Moran’s I) on residuals to make sure that I did not miss any significant independent variables
  3. GWR
  4. Spatial Autocorrelation (Moran’s I)

Because I have not gotten to the point to identify different DOM sources, I used dissolved organic carbon (DOC) concentrations as a dependent variable. For independent variables, I examined land use (forest, urban, and agriculture), stream order, absorbance at 254 nm, and water quality parameters such as dissolved organic nitrogen (DON), nitrate (NO3), ammonium (NH4+), and ortho-phosphate (PO43+).

The data used were collected at the end of the dry season in 2012. Thus, the temporal variation was not considered here. I did not have time to create catchment and watershed layers for each sampling location, so I eyeballed the percentage of each land use for the purpose of completing this class project. I used data from 21 sampling sites.

My study sites

  1. OLS and Moran’s I

Plotting scatter plots with Excel (I could have used the ArcGIS scatter plot tool) before running OLS helped me understand my data. Once I plotted scatter plots, it was evident that absorbance at 254 nm was highly related to DOC concentrations. Absorbance at 254 nm is used to describe aromaticity of DOM. It makes sense why these two variables were correlated because high aromatic compounds mean high DOC (note, high DOC does not mean high in aromaticity). My samples must have been aromatic at the end of the dry season in 2012. Thus, I decided to use absorbance at 254 nm as the first independent variable. OLS generates two forms of outputs: a report and a residual layer.

As I expected, the result was good. About 88% of dependent variable’s variance was explained by absorbance at 254 nm by looking at Adjusted R-Sqaured. Asterisk on the probability (p-value) indicates the particular independent variable is statistically significant (p < 0.05). Another information to look at is Corrected Akaike Information Criterion (AICc). The low AICc indicates good model fits with the observed data.

OLS result 1

According to Wikipedia,(http://en.wikipedia.org/wiki/Akaike_information_criterion)

AIC = 2k – 2ln(L)

where k is the number of parameters and L is the maximized value of the likelihood function.

AICc = AIC + 2k(k+1)/(n-k-1)

where n is the number of sample.

Another output of OLS is a residual layer. Red color indicates the observed dependent variable value is higher than the corresponding predicted value. Blue color indicates the observed dependent variable value is lower than the predicted value.

OLS residual layer 1

Before moving onto GWR, we must run Moran’s I on residuals to see if they are clustered. If they were, I still have an important independent variable that I can add to the OLS model. The result of the Moran’s I indicated the residuals were not clustered; however, I decided to add one more independent variable for a practice purpose.

Assuming residuals were clustered, the next thing we have to do is to check scatter plots to see if there are other independent variables that are correlated to the dependent variable. I found DON and DOC were correlated, so I decided to add DON as the second independent variable. I expected to see the correlation between DOC and the percentage of each land use type in each catchment; however, there was no strong relationship between them.

As I was running OLS, I realized I had to make the field numerical type to be DOUBLE.

DOC vs. DON

DOC vs. Land Use Types

OLS was run again, and the report showed me the improvement in the OLS model as the adjusted R2 value was increased to 0.947 and AICc was decreased to – 7.67. The term, 2k(k+1)/n-k-1, in the AICc equation is a positive value (i.e. 0.67). When I plugged in other values, L turned out to be about 477. The L with one independent variable of absorbance at 254 nm was about 0.097.

I decided to use both absorbance at 254 nm and DON because I am satisfied with improved values of adjusted R2 and AICc. And the sign of coefficients are both positive as expected from previous scatter plots.

I ran Moran’s I on the new residuals and found no clustering. Now it was time for me to check other statistical results from OLS.

OLS result 2

There are two statistical terms to tell if each independent variable is statistically significant: Probability and Robust Probability. If Koenker Statistics is significant (look for the aretisk), there are regional variations and I can trust only robust probability. My report showed a significant Koenker Statistics. This makes sense because I have regional variations among sampling sites. Clustered sampling sites around the H.J. Andrews Experimental Forest represent undisturbed forest. In Corvallis, clustered sampling sites represent urban and agricultural land use.

Additionally, I have to check if those independent variables are redundant in describing the dependent variable by examining Variance Inflation Factor (VIF). This value needs to be less than 7.5. If it is greater than 7.5, I have to remove the corresponding independent variable. Absorbance at 254 nm and DON were not redundant as their VIFs were about 3.1.

Lastly, the residuals cannot be distributed normally because normally distributed residuals indicate that the model is biased. You can create a histogram or check if Jarque – Bera Statistics is significant in the report output. My model was not biased.

The histogram of residuals

2. GWR

Although Koenker Statistics from OLS suggested there was a regional variance, the result of a regional model of GWR did not improve from a global model of OLS.

GWR result

GWR generates a layer for each independent variable. Standard Error of Coefficient describes which part of the region in the study area is significant for the particular independent variable. As you can see, there is no variation in Standard Error of Coefficient for both absorbance at 254 nm and DON. That is probably why the ALCc and Adjusted R2 of OLS model and GWR model turned out to be the same.

Abs 254 nm, Std. Error of Coefficient

DON, Std. Error of Coefficient

 

3. Next steps

a) Investigate why GWR result was not improved from OLS result even though Koenker Statistics suggested it should improve.

b) Create a layer with stream networks (using Generate Network Spatial Weights?), so I can use it as a spatial weighting for GWR to improve GWR model.

c) Consider interpolative my pointe features first. My inputs were all point features. I wonder if it is beneficial to interpolate first.

d) Find out if it is a problem to run OLS and GWR when my sampling sites are clustered.

 

Where did the time go ?! Seems like we just began this term and it is already over. While I feel pretty unproductive right now, especially after seeing the outstanding work my fellow grad students have accomplished, I am extremely appreciative of the opportunity to explore the statistical tools available in ArcGIS and elsewhere that this course has provided. I found the freedom to explore our own data sets and the advice and encouragement provided by Julia and everyone else in the class incredibly rewarding. Thanks Julia for a great class!

I began the term with a pretty undefined thesis project (its still fairly fuzzy but at least I can start to make out a faint outline now). My data set is small and limited to a single survey season. As depicted in Figure 1, I have GPS track logs and the encounter locations  for 20 cetacean surveys conducted in the Marquesas Islands of French Polynesia.

Map of the study area in the Marquesas Islands of French Polynesia showing the extent of small boat surveys and cetacean encounter locations.
Figure 1. Map of the study area in the Marquesas Islands of French Polynesia showing the extent of surveys and the location of cetaceans encountered during small boat surveys in 2012.

My primary focus is on a poorly understood species,  melon-headed whales (Peponocephala electra).

Melon-headed whales courtesy of ARKive, http://www.arkive.org/melon-headed-whale/peponocephala-electra/image-G85523.html.
Melon-headed whales courtesy of ARKive, http://www.arkive.org/melon-headed-whale/peponocephala-electra/image-G85523.html.

These members of the dolphin family form very large “herds” (50  to as many as 1000 individuals) and have been observed congregating near the shore of Nuku Hiva in very specific locations on a regular, daily basis (Figure 2).

Encounter locations for melon-headed whales in the Marquesas Islands. Many of these locations are identical to those documented by Gannier in the mid-90s.
Figure 2. Encounter locations for melon-headed whales in the Marquesas Islands. Many of these locations are identical to those documented by Gannier in the mid-90s.

I spent most of this term finding, accessing, downloading, importing, reclassifying, converting, re-downloading, cussing at, etc., etc.,  environmental and bathymetric data. Using this data and other environmental data such as information on currents and sea surface height, I hope to investigate the differences and similarities between melon-headed whale encounter locations in order to 1) characterize these resting/socializing areas and 2) develop a model to predict possible resting/socializing locations in areas that have not been surveyed.

I was able to explore some of the tools in the Spatial Statistics Toolbox but for this data many of the tools are not applicable. For example, Ordinary Least Squares and Geographically Weighted Regression assume that there is linearity in the data. My data does not show linearity, even after transformation, and my response variable is not continuous. Running the Average Nearest Neighbor Tool produced the results that one would predict after looking at the maps provided above – the encounter locations are more clustered than predicted (Figure 3).

Figure  . Results of a Nearest Neighbor analysis for melon-headed whales. As predicted, encounter locations appear to be more clustered than expected by chance.
Figure . Results of a Nearest Neighbor analysis for melon-headed whales. As predicted, encounter locations appear to be more clustered than expected by chance.

All of these results brought me to a point where I needed to take a step back and reexamine my data and my objectives. I felt like I was attempting to ask questions that just aren’t going to be answered by my data. My main question involves the characteristics of the encounter locations that define melon-headed whale resting locations. To get at this question, I plan on defining encounter locations spatially, i.e. delineate polygons of a certain size around encounter locations, and statistically examine the similarities and differences between the polygons using the environmental and oceanographic data mentioned above. I will continue to explore the tools available in ArcMap as well as the plethora of non-ArcGIS tools to answer this question.

I set out this term to delve into a vegetation dataset.  It consists of nearly 2000 veg surveys in seven salt marshes across the Pacific Northwest.  My goal is to be able to predict how vegetation will change with shifts in climate and increasing sea level.  Given the diversity of wildlife that utilize estuaries in various stages of their life cycle, understanding habitat will respond is critical to developing conservation plans.  To achieve this, I broke the problem into three stages:

1: Identify vegetation communities from field data

2: Create a habitat suitability model for each community under current conditions

3: Use habitat suitability model to project for changes in climate and sea level to estimate community response

 

Due to the large number of species identified (42), I first needed to reduce the dataset to only the most common species. I choose to use only the species found in more than 5% of all the survey plots. This left me with 16 species. I then explored how to best combine these species into communities.  By modeling communities rather than species, I am assuming each species within a community will respond the same. Given that salt marsh vegetation is generally stratified by elevation, this is a reasonable assumption to begin with but one that I will need to revisit in the future.  To determine the communities, I used canonical correspondence analysis (CCA), which can be thought of as a Principle Component Analysis for categorical data.  I defined the niche of the communities using 5 environmental variables: elevation (standardized for tidal range), mean daily flooding frequency, distance to channel and bay, distance to bay, and channel density.  The resulting CCA graph:

cca_results

I then used a script in R to determine the optimum number of clusters given the CCA results by minimizing within cluster sum of squares.  Using the following graph, and my own interpretation of the CCA results, I settled on using 5 communities.

k_means_cluster_graph

This figure shows the survey plot locations, coded by community. Notice the differences in complexity across the sites (Bandon has many while Grays Harbor and Nisqually have fewer).

MaxEnt Predictors_Community_Locations

 

To create a continuous prediction of communities and develop a model to project climate responses, I choose to use the MaxEnt habitat suitability modeling tool.  Essentially, MaxEnt compares where a species (or community) occurs against the environment (background).  It creates response curves by extracting patterns while maximizing entropy (randomness).  MaxEnt can take continuous and categorical data as input, and the number of model parameters (few parameters=smoother response curves) can be controlled through the regularization value (1 is default).   You can also control which ‘features’ are used to create the response curves (linear, quadratic, product, hinge, threshold).  In an attempt to create a parsimonious model, I only used linear and hinge features, but left regularization set to 1.  Results from MaxEnt are logistically scaled (0 to 1).  Because I am modeling muliple communities in the same area, I needed a method for determining which community is predicted.  The simplest is to choose the community with the highest predicted value.  This hasn’t been done in the literature, due to issues with how presence data usually collected. But because this dataset comes from standardized field surveys, and I’m using the same predictor layers for all communities, I’m presuming using the maximum value is legitimate.  In addition to the 5 physical predictor layers from the CCA, I added 3  climatic layers to the model; annual precip, max temp in August, min temp in Jul–each are 30 year averages from the PRISM dataset.  Here are the predicted communities from MaxEnt:

MaxEnt_maximum_classification_2

 

I used two methods to determine the potential error in using the maximum predicted value for the community classification. First, I found the number of communities in each location with a predicted value of greater than 50%.  In the figure below, yellow indicates areas where no community has >50% predicted value, while green represents areas with one community over 50%.  There areas with higher community richness (2 or 3) are relatively small, so I have more confidence in this method.

MaxEnt_community_richness

Second, I determined the number of communities within 25% of the maximum predicted value [max value – (max value * 0.25)].  This gives an indication of separation in the predicted values across communities. Here, yellow indicates areas where a single community is separated from the other predicted communities. Green are areas with 2 communities with a close prediction. Given the large proportion of yellow and green, I am again given confidence in using the maximum predicted value for community classification.

MaxEnt_community_richness_prevalence

Here are the ROC AUC curves. AUC is a measure of model fit, with 1 being perfect and 0.5 random.  All models except GP2 shows relatively good model fit (over .75 is usually deemed a worthwhile model).  The species within Gp2 are the most common generalists and I would not have expected MaxEnt to be able to model this community very well.  As I pursue this further, I will likely further split up Gp2 in effort to produce better community classifications.

AUC curves

I have several ‘next steps’ to continue developing this model.  First, I would like to include vegetation data from 7 California salt marshes in order to better capture the environmental variation along the coast.  Developing elevation response models for each site is necessary in order to project this model under climate change and sea-level rise scenarios.  I would also like to explore additional environmental layers, such as soil type and distance to ocean mouth (salinity proxy) to further refine the defined niche.

Humpback whales feed in the temperate high latitudes along the North Pacific Rim from California to Japan during the spring, summer, and fall and migrate in the winter to the near-tropical waters of Mexico, Hawaii, Japan, and the Philippines to give birth and mate (Calambokidis et al. 2001; Calambokidis et al. 2008).  Although whales show strong site fidelity to feeding and breeding grounds, genetic analysis of maternally inherited DNA (mitochondrial DNA or mtDNA) reveals greater mixing of individuals on the feeding grounds (Baker et al. 2008).  This mixing makes it difficult to determine regional population structure and may complicate management decisions.  For example, should the feeding grounds be managed as one population unit or is there evidence to suggest that more than one management unit is present?  If more than one, are they affected differently by coastal anthropogenic activities, and therefore, require population specific management strategies?

With this in mind, I decided to explore the spatial pattern of humpback whales from the Western and Northern Gulf of Alaska, a subset of data collected during the SPLASH Project (Structure of Populations, Levels of Abundance, and Status of Humpbacks; http://www.cascadiaresearch.org/splash/splash.htm).  Specifically, I am interested in the following questions:

  1. Do whales form clusters? Do whales that are more closely related (have the same mtDNA haplotype) cluster together?
  2. Are there spatial patterns in whale distribution based on depth?  Do more closely related whales cluster together based on depth?
  3. Are there spatial patterns in whale distribution based on slope?  Do more closely related whales cluster together based on slope?

The bathymetry layer, GEBCO_08 Grid, version 20091120 ( http://www.gebco.net) was used for depth and slope analyses in questions 2 and 3.  Depth data were extracted using ArcGIS 10.1 Extract Values to Points tool within the Spatial Analyst Toolbox.  Slope values were derived  from the bathymetry data using ArcGIS 10.1 and the Slope tool; slope values were then extracted using the Extract Values to Points tool within the Spatial Analyst Toolbox.

**Results presented here are strictly for the purposes of exploring the functionality of the ArcGIS tools found in the Spatial Statistics Toolbox.  They should be considered preliminary and should not be reproduced elsewhere.**

 

Part 1: Average Nearest-Neighbor Analysis

This tool is based on the null hypothesis of complete spatial randomness and calculates a nearest neighbor index based on the average distance from each feature to its nearest neighboring feature.  The nearest-neighbor ratio is calculated as the Observed Mean Distance divided by the Expected Mean Distance and has a value of 1 under complete spatial randomness.  Values greater than 1 indicate a dispersed pattern, while values less than 1 indicate a clustered pattern.

 

Clustering in whales

 

Haplotype

n

Obs Mean Dist (m)

Exp Mean Dist (m)

N-N Ratio

Z-score

p-value

Pattern

All

788

1913.54

19537.25

0.0979

-48.443

0.00000

Clustered

A-

202

5962.24

33738.56

0.1767

-22.385

0.00000

Clustered

A+

220

8753.78

34046.85

0.2571

-21.080

0.00000

Clustered

A3

91

7404.13

34929.24

0.2120

-14.381

0.00000

Clustered

E1

73

16789.95

50701.04

0.3312

-10.932

0.00000

Clustered

E3

46

12843.30

34918.80

0.3679

-8.203

0.00000

Clustered

F2

83

14402.10

51259.82

0.2810

-12.532

0.00000

Clustered

The output of this tool indicates that whales, regardless of mtDNA haplotype, are significantly clustered in the Western and Northern Gulf of Alaska.  This result is not entirely surprising, given that humpback whales tend to form small groups on the feeding grounds.  However, the results of this tool are very sensitive to changes in the study area, and therefore it is best to use this tool with a fixed study area.  This approach was not done for the current analysis.  Instead, the area of the minimum enclosing rectangle around the input features was used and this area varied for each haplotype variable.

Based on the results it seems the average nearest neighbor tool may not be the most appropriate tool for discovering spatial patterns in humpback whales.  However, it would be worth running the tool again using a fixed study area before discarding its utility for this data set completely.

Alternatively, it would be worth conducting a refined nearest-neighbor analysis in which the variable of interest (mtDNA) is the complete distribution function of all observed nearest neighbor distances (not just the mean nearest-neighbor distance) and use a specified distance with which to test for complete spatial randomness.  This method is not currently available within the ArcGIS Spatial Statistic Toolbox and would need be conducted in another software package such as R.

 

Part 2: Hot Spot Analysis

This tool uses the Getis-Ord Gi* statistic to identify statistically significant hot spots (clusters of high values) and cold spots (clusters of low values) given a set of weighted features.   For each feature in the data set, a Gi* statistic is returned as a z-score.  The larger the positive z-score, the more intense the clustering of high values (hot spot).  The smaller the negative z-score, the more intense the clustering of the low values (cold spot).

HotspotScale

Figure 1. The output scale for the hot spot analysis tool.  When interpreting the results, it is useful to remember that a feature mapped as bright red may not be because its value is particularly large but because it is part of a spatial cluster of high values.  Conversely, a feature mapped as bright blue may not be because its value is particularly small but because it is part of a spatial cluster of low values.  Thus, the more positive a z-score is, the hotter the hot spot (darker red), while the more negative a z-score is, the colder a cold spot (darker blue).

 

Spatial patterns in whale distribution based on depth

 

Hotspot_Depth_HapsOnly

Figure 2. Results of hot spot analysis for all whales (n=799) based on depth (m), no mtDNA considered.

 

The output from this tool shows the presence of several hot and cold spots regardless of mtDNA haplotype (Figure 2). The hot spots (red) indicate that whales in these areas occur at shallower depths and the results are statistically significant.   There are also several statistically significant cold spots (blue) where whales are found at deeper depths, often beyond the continental shelf.

 

Hotspot_Depth_ByHap

Figure 3. Results of hot spot analysis by haplotype based on depth (m).

 

The output by haplotype also shows the presence of several hot and cold spots, although the location of each varies by haplotype (Figure 3).  The A+ and A- haplotypes show statistically significant hot spots in the Northern Gulf of Alaska while the E1 and F2 haplotypes show a less intense cluster of in the same region, although still significant.  The E1 haplotype also shows a significant hot spot in the Western Gulf of Alaska.  These hot spots reflect whales clustering by haplotype at shallower depths.  The A3 and E3 haplotypes have relatively little clustering – no hot spots and a very small cold spot in the western region.  In general, for all haplotypes, cold spots are located in the western region or beyond the continental shelf where whales cluster at deeper depths.

 

Spatial patterns in whale distribution based on slope

 

Hotspot_Slope_HapsOnly

Figure 4. Results of hot spot analysis for all whales (n=799) based on slope (degrees), no mtDNA considered.

 

The output from this tool shows the presence of several hot and cold spots regardless of mtDNA haplotype (Figure 4). The hot spots (red) indicate that whales in these areas occur at steeper slopes and the result is statistically significant.   There are also several statistically significant cold spots (blue) where whales are found at flatter slopes.

 

Hotspot_Haplotype_Slope

Figure 5. Results of hot spot analysis by haplotype based on slope (degrees).

 

The output by haplotype also shows the presence of several hot and cold spots, although the location of each varies by haplotype (Figure 5).  The A+, A-, A3 and F2 haplotypes show statistically significant hot spots in the Northern Gulf of Alaska while the A3, F2, and E3 (to a lesser extent) haplotypes also show hot spots in the western region.  These hot spots reflect whales clustering by haplotype at steeper slopes.  The A+, A-, A3, and F2 haplotypes have statistically significant cold spots in the northern region, while a cold spot for the E1 haplotype occurs in the western region.   These cold spots reflect whales clustering by haplotype at flatter slopes.

Reflecting on my results, I initially thought perhaps the hot/cold spot patterns found might be influenced by the uneven sampling effort and differences in sample size.  However, on 23 May 2013 Lauren Scott from Esri commented on this very subject in response to a posting by Jen Bauer (http://blogs.oregonstate.edu/geo599spatialstatistics/2013/04/24/discerning-variables-spatial-patterns-within-a-clustered-dataset/#comment-1393).  Lauren stated that even if sampling is uneven (e.g., many samples are taken from some areas, while fewer samples are taken at others), the impact to the results of a hot spot analysis will be minimal.  She provided the following for further clarification.   In areas with many samples, the tool will have more information to compute its result.  The tool will “compare the local mean based on lots of samples to the global mean based on ALL samples for the entire study area and decide if the difference is statistically significant or not”.  In areas with fewer samples, “the local mean will be computed from only a few observations/samples… the tool will compare the local mean (based on only a few pieces of information) to the global mean (based on ALL samples) and determine if the difference is significant”.  Thus, my concern seems to be unwarranted.

 

In general, the hot spot tool seems to be more useful than the average nearest neighbor tool for the humpback whale data set used here.  Statistically significant clustering of whales occurs with and without consideration of mtDNA for both depth and slope.  Although preliminary, the results from this tool highlight areas for further investigation using additional spatial analysis techniques.

 

Challenges discovered with the ArcGIS Spatial Statistics Toolbox

My biggest challenges using the ArcGIS Spatial Statistics Toolbox are twofold.  First, many of the tools require the use of a numeric variable (either continuous or discrete) and do not support “out of the box” categorical variables, such as mtDNA haplotype.  Thus, in order to look for spatial patterns in haplotypes, I had to split the data up by haplotype, create separate feature classes for each haplotype, and then run the tool several times to get my results.  Given that I was working with a small data set, the repetition was relatively painless but I am certain it would be useful to have this process automated (perhaps using model builder or python scripting).  Not only would this speed up processing but it would also eliminate the addition of human induced error.  Second, the hot spot analysis only allows for the input of one variable at a time.  What if one suspected that the spatial pattern of humpback whales (with or without mtDNA consideration) is related to depth and another environmental variable (e.g. sea surface temperature, productivity or currents)?  I believe this type of analysis would need to be conducted in another software package such as R.

~~~~~~~~~~~~~~~~~~~~~

Baker, C. S., D. Steel, J. Calambokidis, J. Barlow, A. M. Burdin, P. J. Clapham, E. Falcone, J. K. B. Ford, C. M. Gabriele, U. González-Peral, R. LeDuc, D. Matilla, T. J. Quinn, L. Rojas-Bracho, J. M. Straley, B. L. Taylor, J. Urbán Ramírez, M. Vant, P. R. Wade, D. Weller, B. H. Witteveen, K. Wynne, and M. Yamaguchi. 2008. geneSPLASH: an initial, ocean-wide survey of mitochondrial (mt) DNA diversity and population structure among humpback whales on the North Pacific. Final Report for contract 2006-0093-008, submitted to National Fish and Wildlife Foundation.

Calambokidis, J., E.A. Falcone, T. J. Quinn, A. M. Burdin, P. J. Clapham, J. K. B. Ford, C. M. Gabriele, R. LeDuc, D. Mattila, L. Rojas-Bracho, J. M. Straley, B. L. Taylor, J. Urbán, D. Weller, B. H. Witteveen, M. Yamaguchi, A. Bendlin, D. Camacho, K. Flynn, A. Havron, J. Huggins, N. Maloney, J. Barlow, and P. R. Wade. 2008. SPLASH: Structure of Populations, Levels of Abundance and Status of Humpback Whales in the North Pacific. Final report for Contract AB133F-03-RP-00078 from U.S. Dept of Commerce.

 Calambokidis, J., G.H. Steiger, J. M. Straley, L. M., Herman, S. Cerchio, D. R. Salden, U. R.  Jorge, J. K. Jacobsen, O. V. Ziegesar, K. C. Balcomb, C. M. Gabriele, M. E. Dahlheim, S. Uchida, G. Ellis, Y. Mlyamura, P. de guevara Paloma Ladrón, M. Yamaguchi, F. Sato, S. A. Mizroch, L. Schlender, K. Rasmussen, J. Barlow, and T. J. Q. Ii. 2001. Movements and population structure of humpback whales in the North Pacific. Marine Mammal Science. 17:769–794.

 

South Sister Watershed.  Log Drive layer created by Rebecca Miller (see Miller, 2010).
South Sister Watershed. Log Drive layer created by Rebecca Miller (see Miller, 2010).

I spent the majority of this class in preliminary stages of my research.  While I was able to go through nearest neighbor analysis and hot spot analysis following the “Spatial Pattern Analysis of Dengue Fever”, which was helpful, I did not analyze my own data.  Nonetheless, I do intend to use spatial statistical tools to analyze the data that I do collect for my summer research, described below.

This summer, I’ll be working with the BLM to assess stream temperature in South Sister Creek.  In 2009, the BLM placed several stream enhancement structures (~90 according to some reports) in South Sister, a 3- 4 order tributary to the Smith River in the Coast Range of Oregon nested in the Umpqua Basin. The Creek serves as spawning ground for Oregon Coastal Coho salmon (ESA listed), steelhead trout, and pacific  lamprey.  Over time, the creek has been degraded by human use.    Recent work verified that the Creek had been used as a log drive to transport logs during the 19th and 20th centuries which may have contributed to the simplification of the stream (Miller, 2010).    Stream cleaning, a restoration practice that removed log jams from the stream, has also reportedly occurred along the stream in the 1980’s (Bureau of Land Management, 2009).

In 2006, 2011, and 2012 the BLM collected data from 9 temperature gages along the stream and its tributaries during the summer months (from mid or late June through September).  This summer, those same sites will be monitored (see Figure 1).    The BLM is interested in whether or not their efforts made a detectable difference on stream temperature.  However, the data collection record is not long enough to detect a change from a temporal perspective.

Nonetheless, there are interesting questions to be asked regarding the relative influence of the enhancement structures, riparian and topographic shade, and substrate on stream temperature at fine spatial scales.   Previous research has indicated the importance of several variables on stream temperature, with surface and groundwater inputs, riparian shade/solar inputs, discharge, and hyporheic exchange exerting moderate to high influence (Poole & Berman 2003).  Large wood in streams have been linked to increased channel and stream-bed complexity, which may be indicative of hyporheic exchange (Arrigoni et al 2008).  

Specifically, the following questions are being asked:

  1. What is the spatial and temporal variability of temperature at a small scale?
  2. How well are the current stationary data loggers (i.e.  hobo tidbits) representing the ambient fine-scale patterns of stream temperature?
  3. Which factors dominate or explain the most spatial variation in stream temperature at a fine scale?
    1. Log jam density
    2. Presence or absence of alluvial substrate (a proxy for hyporheic exchange)
    3. Topographic shading
    4. Riparian shading

Figure 2 WRS protocol

Over the past few weeks I’ve been working on a research design to help answer these questions.    To map the heterogeneity and relative influence of log jam density, shade, and substrate on the spatial distribution of stream temperature, the study is employing a stratified sampling approach designed to capture as many combinations as possible of the variables outlined in Figure 2 at varying elevations of the stream.     LiDAR data and a shade modeler program will be used to identify areas that are most likely to have 10-2 shade (both topographic and vegetation).  An analysis of log jam density (data from previously gathered GPS points taken at the right bank of each in-stream log jam) will be undertaken in ArcMap to divide the study area into reaches that are densely, moderately, and sparsely populated by jams.  Data regarding the final stratification layer, streambed substrate, will be gathered during the site layout and field mapping phase of the protocol due to the lack of a priori, spatially explicit knowledge of stream-bed condition (alluvial or bedrock).   The field map will be made during the first survey and be used as a reference for locating temperature sampling sites and processing data.

Once my data is collected, I would like to explore the use of Geographically Weighted Regression – however much of my data will be categorical, which may mean GWR will have to be avoided.  I welcome any comments related to what types of analysis I should consider doing that might also inform my study design.


 

Arrigoni, A. S., Poole, G. C., Mertes, L. A. K., O’Daniel, S. J., Woessner, W. W., & Thomas, S. A. (2008). Buffered, lagged, or cooled? Disentangling hyporheic influences on temperature cycles in stream channels. Water Resources Research, 44(9), n/a–n/a. doi:10.1029/2007WR006480

Bureau of Land Management. (2009). South Sisters and Jeff Creek Stream Enhancement Project; Phase IV. Coos Bay: Bureau of Land Management.

Miller, R. (2010). Is the Past Present? Historical Splash-dam Mapping and Stream Disturbance Detection in the Oregon Coastal Province. Oregon State Unviersity. Retrieved 2013

Poole, G. C., & Berman, C. H. (2001). An Ecological Perspective on In-Stream Temperature: Natural Heat Dynamics and Mechanisms of Human-CausedThermal Degradation. Environmental management, 27(6), 787–802.

 

 

  • R: code to assess spatial independence of residuals of a model

This code is useful to check if the assumption of (spatial) independence between residuals of a model is met.

Useful because it can be applied to the residuals of any model we run

It generates a plot of bubbles located according to the X-Y coordinates of the corresponding residuals. The size of the bubbles represents the magnitude of the residuals, while the color indicates if they are positive or negative. If there is spatial independence of the residuals, no pattern should be observed in the disposition of the bubbles. NOTE: requires “sp” package

library(sp)                                                     #Load the sp package
Residuals<-resid(model)                          #Generates a vector with the residuals from the model
mydata <- data.frame(Residuals, RawData$XCooords, RawData$YCoords)    # Generate a data frame with the residuals and corresponding coordinates (obtained from the dataset in which the model was ran)
colnames(mydata)<-c(“Residuals”,”X”, “Y”)                   #Assign names to the columns
coordinates(mydata) <- c(“X”,”Y”)                                    #Identify which columns correspond to the coordinates

bubble (data, “Residuals“, col = c(“black”,”grey”), xlab = “X-coordinates”, ylab = “Y-coordinates”)   #Plot the bubbles

exBubbles

 

  • ArcGIS: Iterator option in Model Builder

This option allows to run the same function over all the files contained inside a folder, freeing the user from doing repetitive tasks.

ex_iterator