I set out this term to delve into a vegetation dataset.  It consists of nearly 2000 veg surveys in seven salt marshes across the Pacific Northwest.  My goal is to be able to predict how vegetation will change with shifts in climate and increasing sea level.  Given the diversity of wildlife that utilize estuaries in various stages of their life cycle, understanding habitat will respond is critical to developing conservation plans.  To achieve this, I broke the problem into three stages:

1: Identify vegetation communities from field data

2: Create a habitat suitability model for each community under current conditions

3: Use habitat suitability model to project for changes in climate and sea level to estimate community response

 

Due to the large number of species identified (42), I first needed to reduce the dataset to only the most common species. I choose to use only the species found in more than 5% of all the survey plots. This left me with 16 species. I then explored how to best combine these species into communities.  By modeling communities rather than species, I am assuming each species within a community will respond the same. Given that salt marsh vegetation is generally stratified by elevation, this is a reasonable assumption to begin with but one that I will need to revisit in the future.  To determine the communities, I used canonical correspondence analysis (CCA), which can be thought of as a Principle Component Analysis for categorical data.  I defined the niche of the communities using 5 environmental variables: elevation (standardized for tidal range), mean daily flooding frequency, distance to channel and bay, distance to bay, and channel density.  The resulting CCA graph:

cca_results

I then used a script in R to determine the optimum number of clusters given the CCA results by minimizing within cluster sum of squares.  Using the following graph, and my own interpretation of the CCA results, I settled on using 5 communities.

k_means_cluster_graph

This figure shows the survey plot locations, coded by community. Notice the differences in complexity across the sites (Bandon has many while Grays Harbor and Nisqually have fewer).

MaxEnt Predictors_Community_Locations

 

To create a continuous prediction of communities and develop a model to project climate responses, I choose to use the MaxEnt habitat suitability modeling tool.  Essentially, MaxEnt compares where a species (or community) occurs against the environment (background).  It creates response curves by extracting patterns while maximizing entropy (randomness).  MaxEnt can take continuous and categorical data as input, and the number of model parameters (few parameters=smoother response curves) can be controlled through the regularization value (1 is default).   You can also control which ‘features’ are used to create the response curves (linear, quadratic, product, hinge, threshold).  In an attempt to create a parsimonious model, I only used linear and hinge features, but left regularization set to 1.  Results from MaxEnt are logistically scaled (0 to 1).  Because I am modeling muliple communities in the same area, I needed a method for determining which community is predicted.  The simplest is to choose the community with the highest predicted value.  This hasn’t been done in the literature, due to issues with how presence data usually collected. But because this dataset comes from standardized field surveys, and I’m using the same predictor layers for all communities, I’m presuming using the maximum value is legitimate.  In addition to the 5 physical predictor layers from the CCA, I added 3  climatic layers to the model; annual precip, max temp in August, min temp in Jul–each are 30 year averages from the PRISM dataset.  Here are the predicted communities from MaxEnt:

MaxEnt_maximum_classification_2

 

I used two methods to determine the potential error in using the maximum predicted value for the community classification. First, I found the number of communities in each location with a predicted value of greater than 50%.  In the figure below, yellow indicates areas where no community has >50% predicted value, while green represents areas with one community over 50%.  There areas with higher community richness (2 or 3) are relatively small, so I have more confidence in this method.

MaxEnt_community_richness

Second, I determined the number of communities within 25% of the maximum predicted value [max value – (max value * 0.25)].  This gives an indication of separation in the predicted values across communities. Here, yellow indicates areas where a single community is separated from the other predicted communities. Green are areas with 2 communities with a close prediction. Given the large proportion of yellow and green, I am again given confidence in using the maximum predicted value for community classification.

MaxEnt_community_richness_prevalence

Here are the ROC AUC curves. AUC is a measure of model fit, with 1 being perfect and 0.5 random.  All models except GP2 shows relatively good model fit (over .75 is usually deemed a worthwhile model).  The species within Gp2 are the most common generalists and I would not have expected MaxEnt to be able to model this community very well.  As I pursue this further, I will likely further split up Gp2 in effort to produce better community classifications.

AUC curves

I have several ‘next steps’ to continue developing this model.  First, I would like to include vegetation data from 7 California salt marshes in order to better capture the environmental variation along the coast.  Developing elevation response models for each site is necessary in order to project this model under climate change and sea-level rise scenarios.  I would also like to explore additional environmental layers, such as soil type and distance to ocean mouth (salinity proxy) to further refine the defined niche.

Incremental spatial autocorrelation (ISA) uses Moran’s I to test for spatial autocorrelation within distance bands. Analysis is run on a given parameter (eg. percent cover, elevation).

Interpratation

  • ISA returns z-score and p-values
  • Significant p-value indicates spatial clumping
  • Non-significant p-value indicates random processes at work
  • Indicates significant peaks in z-score
  • Higher z-score indicates more spatial clumping
  • Distance of first peak in z-score usually used for further analysis
  • Useful for determining the appropriate scale for further analysis
  • Hot Spot Analysis
  • Density tools which ask for a radius
  • Determine whether a subsample should be taken to remove autocorrelation

AGST_Coos_lowresAGST_ISA_graph

 

 

Coastal salt marshes are at great risk from a large number of factors, especially climate change and sea-level rise.  In an effort with the USGS, I’m working to determine how different salt marshes along the Pacific Coast will response to changes in sea level.  Part of our approach is collecting fine-scale, baseline field data in the form of RTK GPS elevation points and vegetation surveys.  Through analysis of data from a range of sites (up to 15 along the coast), I hope to better characterize plant  habitat requirements with an ultimate goal of producing improved community response projections under sea-level rise scenarios.  In this class, and in Jim Graham’s Spatial Modeling/Big Data class, I will be working with the elevation & veg data to characterize spatial relationships of plant species against plot-level factors (inundation frequency, distance to channel, elevation) and site factors (temperature, salinity, tidal range).  I have hundreds of vegetation plots per site, with about 2000 survey plots completed across our PNW sites.

Currently, I’m still in data processing mode, combining databases and gathering environmental data; field data collected wrapped up in January. However, the inundation data needs to be developed, first by kriging the elevation data into DEMs and then using site-specific waterlogger data to determine flooding frequency. The water logger data itself needs to be processed for barometric pressure and elevation. Marsh channels need to be digitized before a distance to channel raster can be created. There’s a lot of work still be done to get the data in shape for analysis, however by focusing on one or two sites, I’ll be able to explore the spatial statistics toolbox and push forward with this project.

The two posts I found that I may benefit the most from are:

  1. Supplemental Spatial Statistics Toolbox: http://www.arcgis.com/home/item.html?id=694e0f97355740d7bba6b8b356c0b925

The tools for integrated spatial autocorrelation and exploratory regression analysis seem like they would be useful for investigating spatial relationships and identifying important response variables for spatial models.

  1. Integrating R and ArcGIS:

http://www.arcgis.com/home/item.html?id=a5736544d97a4544aa47d06baf910f6d

I’ve spent much more time in R running spatial models than in Arc, having to bring model outputs into Arc for mapping after the analysis is complete.  For more complex models, this is probably still the most efficient method, but for simpler analysis it may be easier to run the analysis and produce maps in Arc.

I also found the regression analysis pages very useful as a reference, in addition to the page on ‘Finding a Meaningful Model’ http://www.esri.com/news/arcuser/0111/findmodel.html .  The tutorials for hot spot, regression analysis, and model builder seem like they would be worthwhile to run through and of general benefit to others in the class.

Kevin Buffington