The research question and Background

Soil microbes are critically important for soil to function, because they provide the functional backbone for many soil ecosystem services such as the carbon and nitrogen cycles. The structure and composition of microbial communities can have a large influence on the rates at which these ecosystem services can be performed; therefore, determining the spatial distribution of these microbial communities is necessary to establish accurate functionality of soils across a landscape. However, this can be very difficult due to the fact that microbial communities are extremely diverse. In one gram of soil there can be over a billion microbes with anywhere from 4,000 to 50,000 different species. There are two different potential forces governing the spatial distribution of microbial communities. The first theory, first popularized by Baas and Becking (1934), stated that “Everything is everywhere but, the environment selects”. This idea is in line with the classical ecology niche theory, in which species distribution is dictated by their niche. The opposite of this paradigm is the idea that microbes are more limited by dispersal potential then environmental selection factors. This idea reflects the ideas of neutral theory, which states that species distributions are completely stochastic, and at any given location, the species composition is determined by the species composition of its neighboring areas.
Until recently, these theories have not really been tested in microbial communities. However, with the advent of high throughput sequencing in the 1990s, an in situ view of microbial communities became possible. In a recent study, O’Brien et al. (2016) determined that at a fine scale (1cm2) community composition was extremely heterogeneous and correlated weakly with environmental factors; however, as the scale and scope increased, patterns in microbial community distribution began to emerge and were closely linked to environmental parameters. This study illustrates that when examining microbial communities, the spatial scale of the study will influence your results.
In my study, we want to know which principles govern microbial distributions across a regional landscape. Are microbial communities across a regional landscape influenced by environmental parameters, or is their distribution dictated more by dispersal limitation? There has been some work on a similar sized scale that has determined that bacterial diversity is more closely linked to environmental parameters then dispersal limitation. Specifically studies from all over the world have found the most influential environmental factor to be soil pH (Fierer and Jackson 2006, Griffiths et al. 2011).
The Dataset
During the spring of 2015, 113 soil samples were collected across the state of Oregon to capture the spatial pattern of microbial community structure across a regional landscape. Sampling locations were stratified by Common Resource Areas (CRA), which are geographically associated land resource regions based on geology, climate, water, soils, biological resources, and land use. This stratification was used to help capture all of the spatial heterogeneity of the soil across a regional landscape. A subset of this data (90 out of 113) was previously sampled by the NRCS in soil series survey, where an extensive lab work up was conducted to catalog each of these soils’ physical and chemical properties.
Hypothesis
The physical and chemical properties in soil will have a greater influence on microbial communities than spatial variability. In essence, on a regional landscape scale, the factors dictating microbial community composition will be linked more closely to environmental factors in the soil rather than how geographically distant two samples are.
If this is true then we need to be able to quantify the heterogeneity of a landscape of these environmental parameters. In this blog post I will use the 90 sample points from the NRCS to see how well interpolation methods preform when examining the spatial heterogeneity of edaphic properties, using soil pH as a proxy. To examine the accuracy of these interpolation methods we use the SSURGO database of soil pH as our truth. This will allow me to compare the interpolation techniques and determine their strength and weaknesses.
Approaches
The first method used was spatially extrapolating pH point data over their relative CRA polygon. In theory since these CRA polygons were delineated by similar environmental factors, the edaphic properties such as soil pH should be conserved within each polygon.
Inverse Distance Weighted (IDW) Interpolation method was also attempted. This method essentially determines the value of any unknown point by taking weighted averages of all known points, where the weights are determined by their distances to the unknown point.
The last interpolation method used was ordinary kriging. Just as IDW only considers distances between points for interpolation, so does kriging; however, unlike IDW kriging incorporates the fact that after a certain distance there is no more useful information to explain the variation at a given point, determined through the variogram. Therefore, kriging uses a variogram model to determine the maximum distance from any given point that information should come from. The figure below shows the relationship between the variance explained and distance from the given point. The fitted line is the model used in the ordinary kriging interpolation and how the weights for the average are determined.

Variogram

After interpolation, these maps were then compared to the SSURGO map to test their performance. This comparison was done by finding the absolute difference at each pixel and summing the variance observed. Since the SSURGO map was not completely finished for all of Oregon, each of the interpolated maps were clipped to same spatial extent as the SSURGO map.
Results

The map shown below is the soil pH values extrapolated over its respective CRA. You can see the Willamette Valley is quite homogeneous in its soil pH values while Oregon’s Eastern side is quite a bit more heterogeneous. The problem with this map is it draws sharp lines between CRA polygons where in theory the real pH values would have a continuum as you moved from one CRA to another.

CRA

The IDW map seemed to show the same pattern in the Willamette valley; however, the heterogeneity of eastern side of Oregon seems to be blended together. The IDW interpolation also seemed to develop pockets of like pH values which look more like artifacts of the IDW algorithm than real characteristics of the landscape.

IDW_plot
Like IDW, ordinary kriging also blended the heterogeneity of the eastern side of Oregon together; however, it did so without creating many of these artificial pockets that the IDW method seemed to create.

kriging
The main quantitative difference between the SSURGO map and the interpolated maps can be seen in the western part of Oregon. The SSURGO map shows a much higher level of heterogeneity compared to the interpolated maps as seen in the Willamette Valley. It also shows a large amount of heterogeneity in the southeastern side of Oregon that was only captured in the CRA map.

SSURGO_map

As expected, the IDW map performed the least favorably (a variance score of 2676) while both the kriging and CRA map had very similar results (variance scores of 2440 and 2414, respectively). This is quite surprising since the kriging map seemed to have a homogenizing effect on soil pH in eastern Oregon which was present in both the CRA map and the SSURGO map. Below is a map of the difference in pH between the CRA map and the SSURGO values of pH. You can see the key areas in which the CRA failed were areas of high heterogeneity of the SSURGO map.

DIfference map

Significance
Through this analysis we have determined that across this landscape there is a large amount of heterogeneity in edaphic properties. There may be significantly correlated parameters that have not been captured in the SSURGO database. If so, it may be inappropriate to interpolate these values using just our sampling points.
Finally, if it is determined that geographic distance is the best proxy for determining microbial community distribution at this scale, an interpolated map of the community needs to be presented in a way in which the audience understands where the map has strong areas of success and areas in which we are quite sure the map is wrong.
What I Learned: Software
The majority of the project was conducted in R’s statistical environment, which is due to several reasons. First, I was already familiar with the programming language. Moreover, R is quite powerful in its data manipulation capability and its ability to execute complex statistics in its base functions. Because R is also completely free and availability to anyone, it allows me to conduct my analyses without relying on a paid programs. The main drawback to this particular statistical software is it has a rather steep learning curve which made AcrGIS more preferable in some aspects. I found ArcGIS rather useful in its ability to quickly display results and data with minimal effort. It is able to quickly display a large amount of information very quickly in a user-friendly way. However, when attempting statistical interpolation methods in ArcGIS, I felt rather limited in its customizability and was completely ignorant of the underlying processes it was doing during interpolation.
R on the other hand requires more work upfront to create and display raster and vector files; however, once the data is in, manipulating it and doing analytical analysis on it was quite straightforward. When conducting spatial analysis, R is transparent; each function comes with its own documentation about what exactly is happening during the execution of the function. It also requires the users to have a working understanding of the basic methodology when doing spatial analysis, which is always a good thing.
All in all, both programs have their niches and best uses; however, one should not feel limited to just one. Exploiting the strengths of each program and bypassing their relative weakness by shifting to different platforms should be everyone’s goal when conducting any sort of analysis including spatial analysis.

What I Learned: Spatial Statistics
Before joining this class I had a very limited understanding of spatial statistics. Through the class exercises and a lot of reading I developed a comfortable understanding of the basic idea behind the process used in several different spatial statistic methods. Some of these models include IDW, kriging, geographically weighted regression, PCA ordination, and boosted regression tree. However, I feel the most important lessons I learned how to critically interpret the results of these analysis such as geographically weighted regression in ways that make intuitive sense.
Work Cited
Fierer, Noah, and Robert B. Jackson. “The Diversity and Biogeography of Soil Bacterial Communities.” Proceedings of the National Academy of Sciences of the United States of America 103, no. 3 (January 17, 2006): 626–31. doi:10.1073/pnas.0507535103.

Griffiths, Robert I., Bruce C. Thomson, PHillip James, Thomas Bell, Mark Bailey, and Andrew S. Whiteley. “The Bacterial Biogeography of British Soils.” Environmental Microbiology 13, no. 6 (June 1, 2011): 1642–54. doi:10.1111/j.1462-2920.2011.02480.x.

O’Brien, Sarah L., Sean M. Gibbons, Sarah M. Owens, Jarrad Hampton-Marcell, Eric R. Johnston, Julie D. Jastrow, Jack A. Gilbert, Folker Meyer, and Dionysios A. Antonopoulos. “Spatial Scale Drives Patterns in Soil Bacterial Diversity.” Environmental Microbiology, March 1, 2016, n/a-n/a. doi:10.1111/1462-2920.13231.

Exploring Giant Gourami Distribution Models

The research question:

In the context of predicting the impacts of the rapidly expanding aquaculture industry and understanding the impacts of natural and human changes on the landscape, one of the overarching research question that I am interested in is: What is the natural distribution of the the air-breathing giant gourami (Osphronemus goramy) in South East Asia and how is it predicted to change over time with respect to 1) the biophysical properties of the landscape, 2) human impacts, and 3) climate projections into the future.

Specific to this class, the research question that I explored was, what is the distribution of the giant gourami in SE Asia in the year 2000 based on 4 environmental variable (NDVI, precipitation, surface temperature, and river flow accumulation) and human population density.

Background: Giant gourami inhabits regions characterized by fresh to brackish water and in slow-moving areas like swamps, lakes, and large rivers.  Given its unique ability to breather air, this fish can survive in poorly oxygenated water to anoxic areas.  Already a popular fish in the aquarium trade and eaten in some regions, this fish is farmed in SE Asia.  I expect that with climate change, increased urbanization, and the changing hydrologic profile of the system due to potential dams that this fish may become more suitable than others for its ability to live in ‘poorer’ environmental conditions.  

The Dataset:

gourami_distPoint_dataMy dataset consists of points of fish presence/pseudo-absence across SE Asia (image above) characterized by associated environmental variables of interest.  The ‘presence’ portion of my dataset was pulled from fishbase.com, consisting of 47 occurrence points for giant gourami collected between primarily 1980s-1990s through an unknown sample protocol.  Clipping to my study region, SE Asia left me with 36 presence points.  Background data, or pseudo-absence points were generated randomly along rivers and streams of SE Asia in ArcMap.  Environmental data for all points were taken from freely available satellite datasets listed below on Google Earth Engine in a 1km buffer around the point data and filtered to the date range of Feb-Jun 2000 when possible (for retaining consistency in the dataset).

  • NDVI: 32-day Landsat 7
  • Precip: CHIRPS (Climate Hazards Group InfraRed Precipitation with Station data) at 0.05deg resolution.
  • Flow: HydroSHEDS flow accumulation 15-arc seconds (Feb 11-22, 2000)
  • Temp: NCEP/NCAR reanalysis 2.5deg resolution
  • Population: WorldPop project- Estimated mean population density 2010 & 2016 at a 100x100m resolutiongourami_BRT_rastersgouramiVar_histos_ALL

Hypotheses:

Based on my point data and the variables used, my hypotheses are below (with the third un-tested):

  1. Giant gourami are distributed in a wide range of habitat types, including relative warmer surface temperatures, a range of flows, and in areas with range of precipitation.
  2. This distribution of the Giant gourami is not affected by human population density.
  3. Giant gourami distributions will not change much given predicted climate warming (untested)

Approaches:

This was an exploratory analysis that employed the species distribution model, boosted regression trees (BRT).  This is an ensemble tree-based species distribution model that iteratively grows small/simple regression trees based on the residuals from all previous trees to explore non-parametric data. BRTs consist of two components: regression trees and boosting and ultimately help to identify the variables in the dataset that best predict where the species is expected based on the presence/absence data.  (you can view my the boosted regression tutorial for more on this analysis approach).

Results:

After building my dataset in Arc and through Google Earth Engine, I was able to produced BRT results in R studio for several combinations of learning rates and tree complexities with and without the population data as a variable.  Preliminary analysis indicates that precipitation and temperature contribute the most in determining where giant gourami are expected, based on the data.  Digging into my two hypotheses, I explored the contribution of temperature on giant gourami distribution and population density.

Precipitation was identified by the BRT models as the strongest contributor in determining in the likelihood of finding a giant gourami in my study area.  

  • R output:
  • > gourami.tc3.lr005$contributions
  • var:   rel.inf
  • mean_Precip: 60.511405
  • mean_temp: 25.182547
  • pop_density: 10.984079
  • NDVI_mean: 1.673674
  • flow: 1.64829

Exploring the spatial relationship between precipitation trends and my presence/pseud-oabsence point data, it is clear that the presence points are clustered in regions with lower rainfall.  As a means to explore this relationship further, it might be helpful to shrink the extent of my study area to represent a smaller area, more representative of the conditions associated with the presence points.
precip_SEasia_gourami

Population density did not appear to have an effect on model performance as expected.  As shown in the model output below, which displays model output for models with population density as a variable in the left column and without on the right.  Model deviance in cross-validation does not change when population density is removed as an explanatory variable.

gourami_TC3-4_lr005_PoP-noPoPExploring this relationship visually on a map, this result makes sense as the point data are distributed across a wide range of population densities.

pop_seasia_points

Significance:

The results of this exploratory analysis revealed some interesting patterns in giant gourami distribution, but is limited in a big way by the point data used.  Points were only present in Malaysia due to availability, so this distribution pattern is possibly more a function of sample effort than actual distribution.  If the analysis were limited to Malaysia only, it may provide a better representation of the data.  

Understanding the spatio-temporal patterns that govern the range of species like the giant gourami allow resource managers to help meet increasing demands and at the same time mitigate environmental harm.  Protein consumption is expected to increase to 45kg per capita by 2020, a 25% increase from 1997—the fish consumption rate is no outlier.  The growing aquaculture industry provides roughly half of the global fish supply (FAO, 2014).  The giant gourami are 1 species of ~400 air breathing fish present–ideal candidates for aquaculture.  This prospect presents an opportunity for increased protein production in a changing climate, but also increases threat of invasion/outcompetition.

 

Lessons Learned:

In this analysis I learned about integrating Arc, Google Earth Engine, and R for building my dataset as well as running a boosted regression tree model.  Through the process, I tracked my progress, workflow, and challenges.  This allowed me to identify areas where I could have been more efficient.  For example, I generated my background/psuedo-absence points in Arc and then brought them into Earth Engine, when I could have done the whole process in Earth Engine.

A note on Earth Engine: the reason that I chose to use this platform was for its ability to access large global datasets across wide ranging time frames quickly and without the need to download to manipulate or extract data.  Earth Engine functions through javascript, which is a hurdle to overcome for someone new to programming.

My data analysis was done in R studio, through which I learned to run a BRT model with the ‘gbm’ package, I learned some simple histogram plotting functions, and how to import/view raster images with the packages ‘raster’ and ‘rgdal’.

In terms of the statistics that I explored for this analysis, I have listed some of my take home points below:

  • Boosted Regression Tree strengths and limitations
    • Deals with non-parametric data well
    • Is able to deal with small sample sizes
    • Pseudo-absence points present issues in interpreting results since they are not true absence points
    • Sample bias is potentially a problem
    • Is not a spatially explicit analysis so issues like spatial autocorrelation are not dealt with
  • Hot Spot Analysis (conducted on a different dataset)
    • A quick and easy way to identify patterns at different scales for data points that are independent of the researcher and span a large temporal and spatial range.  
    • could be an approach to hypothesis generation.  

 

 

Spatial Autocorrelation (Moran’s I): This tool measures spatial autocorrelation using feature locations and feature values simultaneously.   The spatial autocorrelation tool utilizes a multidimensional and multi-directional factors.  The Moran’s I index will be a value between -1 and 1. Positive spatial autocorrelation will show values that are clustered. Negative autocorrelation is dispersed. Random is close to zero. The tool generates a Z-score and p-value which helps evaluate the significance of the Moran’s index.

Morans_Calculations

Figure 1: Calculations used for the Moran’s I tool. (ESRI image)

The output of the Moran’s I tool can be found in the results section of ArcGIS.  Upon opening the HTML report for the Moran’s I results you will see a graph showing how the tool calculated the data and whether or not the data is dispersed, random, or clustered.  This report will also include the Moran’s Index value, z-score, p-value.  It will also provide a scale for the significance of the p-value and critical value for the z-score.

MoransI_SampleOutput

Figure 2: Sample output for the Moran’s I tool. (ESRI image)

 

Data and Analysis

Before conducting this test, I sampled the SST and the CHL-a values at each of the feature locations (sea turtle locations) using the Extract Multi Values to Points tool. This tool “Extracts cell values at locations specified in a point feature class from one or more rasters, and records the values to the attribute table of the point feature class.”

SeaTurtlePointData

Figure 3: Sea turtle locations in the Gulf of Mexico, derived from http://seamap.env.duke.edu/

 

Chl_image_GOM

Figure 4: Chlorophyll-a(mg/m^3) data for January 2005 within the Gulf of Mexico, derived from NOAA.

SST_image_GOM

Figure 5: Sea Surface Temperautre (Celsius) data for January 2005 within the Gulf of Mexico, derived from NOAA.

I tested the spatial autocorrelation of chlorophyll-a and sea surface temperature at each feature location.  The conceptualization of spatial relationships method used was the inverse distance and the Euclidean distance measure was used for the distance method.  I selected a 500km distance (smaller distances were too small for the study site).

Results:

Sea Surface Temperature: The results of the spatial autocorrelation tool suggest that the pattern of sea surface temperature at each feature location is clustered. The Moran’s Index was 0.402514, the z-score was 2.608211, and the p-value was 0.009102.  Since the critical value (z-score) was greater than 2.58 there is less than 1-percent likelihood that the clustered pattern is a result of random chance.

SeasurfaceTemp_Morans_resullts1Of2 SeasurfaceTemp_Morans_resullts2Of2

Figure 6: Sea surface temperature results for Moran’s I tool.

Chlorophyll-a :  The results of the spatial Autocorrelation tool suggest that the pattern of chlorophyll at each feature location is clustered.  The Moran’s Index was 0.346961, the z-score was 2.216243, and the p-value was 0.026675.  The critical value (z-score) was less than 2.58 but greater than 1.96 thus suggesting that there is less than 5-percent likelihood that the clustered pattern is a result of random chance.

Chlorophyll_Morans_resullts1Of2 Chlorophyll_Morans_resullts2Of2

Figure 7: Results of the spatial autocorrelation Moran’s I for chlorophyll-a at the leatherback sea turtle locations.

What does this mean:

As suggested in the hotspot analysis there is clustering of the data.  The spatial autocorrelation tool indicates that clustering is occurring with regard to the sea surface temperature and chlorophyll values at their respective locations with regard to sea turtles.   Conducting an ordinary least squared analysis may lead to more information about which factors contribute more to the clustered pattern.

Introduction:  Leatherback sea turtles are endangered world wide.  Their habitat is often impeded due to the oil and gas industry in the Gulf of Mexico.  In 2010, the Deepwater Horizon Oil Spill affected much of the biota when 200million gallons of oil spilled into the Gulf.  Leatherback sea turtles are the most endangered sea turtle. My research  question is focused on the extent to which these turtles utilize the Gulf of Mexico.  Leatherback sea turtles feed almost exclusively on jellyfish. I will be assessing their range by looking at the correlation between leatherback sea turtle point data, sea surface temperature and chlorophyll as proxies for where jellyfish may occur.

„Research Question: What is the correlation between the location of leatherback sea turtles, sea surface temperature, and chlorophyll?

Approach: I used the hotpsot analysis tool in order to explore the sea turtle point data.  the hotspot analysis tool identify where statistically significant hotspots or clusters of sea turtles are located within the Gulf of Mexico.

Sea turtle data:

SeaTurtlePointData

 

Results:  The results of the hotspot analysis (shown below) suggest that the turtle locations are significantly clustered off the coast of Louisiana and Texas between approximately 2,000m to 3,000m depth of water.  However, the results of this analysis appear to be quite deceptive.  Upon taking measurements of turtles in the hotspot cluster it appears as though they may be more dispersed.  Further analysis is needed in order to determine further patterns of sea turtle locations.

hotspotAnalysis_5June

Boosted regression trees (BRT) are an ensemble tree-based species distribution model that iteratively grows small/simple trees based on the residuals from all previous trees.  The model is run on a random subset of your data, and ALL predictor variables are considered to produce the best splits at each node.  The tree-complexity determines how deep each individual tree will be grown to.  Anticipated interactions can be captured by setting the appropriate tree complexity.  The learning rate determines the overall weight of each individual tree.  This is an advanced form of regression methods which consists of two components. -The two elith08_fig1main components of BRT: regression trees and boosting.

 

  1. Regression trees:  Partitions the predictor space into rectangles, using a series of rules to identify regions with the most homogenous response to the predictor and fits a constant to each region.
  2. Boosting: An iterative procedure that reduces the deviance by accounting residuals of previous tree(s) by fitting another tree
    • Each individual tree inform subsequent trees and thus the final model
    • The boosting component makes boosted regression distinct from other tree-based methods.

 

Objective of this analysis:

Identify the biophysical parameters associated with giant gourami distribution in SE Asia, starting with a set of 47 global occurrence points.  This was an exploratory exercise to learn about the physical variables that might be important for the distribution of the giant gourami. 

My Data:

Pulled the 47 occurrence points for the giant gourami from fishbase.com and were used as the basis for the dataset.

ArcMap: generate points and convert to KML for import into Google Earth Engine

//Get coordinates for occurrence points for species of interest (Giant Gourami) from fishbase.com

//create random points to use as ‘pseudo absence’ points

//generate random points within study region and only in rivers

Google Earth Engine: ‘gather’ biophysical data for points from satellite imagery–used for ease of access to spatial layers

//load in image layers of interest (NDVI, CHIRPS, Population density, Flow, Surface Temp.)

//export to CSV for analysis in R Studio

 

R code for running the BRT model:

I ran the model using the gbm package in R, based on a tutorial by Jane Elith and John Leathwick (https://cran.r-project.org/web/packages/dismo/vignettes/brt.pdf)

 

>>source(“BRT/brt.functions.R”)

>>install.packages(‘gbm’)

>>library(gbm)

# define the dataset

>>gourami <- read.csv(“BRT/gourami/gourami_data.csv”)

# data consists of 39 presence and 305 pseudo absence (344)

# 5 predictor variables

>>gourami.tc3.lr005 <- gbm.step(data=gourami,

    gbm.x = 3:7, #columns in the dataset where the response variables are located

   gbm.y = 2, #column in the dataset where presence/absence data is located (0/1)

   family = “bernoulli”,

    tree.complexity = 3, #tree depth determines the number of layers in each tree

  learning.rate = 0.005, #weight of each tree in the overall model

    bag.fraction = 0.75) #fraction of the dataset used to build/train the model

 

The three main parameters to pay attention to at this point are tree complexity, learning rate, and bag fraction. The remaining fraction of the dataset not used in the bag fraction is then used for cross validation for model evaluation. These three parameters can be varied to determine the ‘best’ model.

Results

Initial output for model with tree complexity of 3, learning rate of 0.005, and bag fraction of 0.75.  Several warning messages were displayed with this particular model, which are not addressed in this tutorial:

mean total deviance = 0.707

mean residual deviance = 0.145

estimated cv deviance = 0.259 ; se = 0.058

training data correlation = 0.916

cv correlation =  0.838 ; se = 0.043

training data ROC score = 0.989

cv ROC score = 0.958 ; se = 0.02

elapsed time –  0.13 minutes

 

Warning messages:

1: glm.fit: algorithm did not converge

2: glm.fit: fitted probabilities numerically 0 or 1

 

Relative contributions of each variable in determining where the species is expected.  For this model, precipitation has the strongest pull:

> gourami.tc3.lr005$contributions

                   var   rel.inf

mean_Precip: 61.223530

mean_temp: 25.156042

pop_density: 10.299844

NDVI_mean:  1.685350

Flow:   1.635234

 

Interactions:

NDVI Precip flow Temp Pop
NDVI 0 29.62 0.11 0.07 0.08
Precip 0 0 17.00 317.66 84.51
flow 0 0 0 0 0.93
Temp 0 0 0 0 3.29
Pop 0 0 0 0 0

Partial dependence plots visualize the effect of a single variable on model response, holding all other variables constant.  Model results vary the most with precipitation as seen in the top left plot.  Mean temperature and population density appear to also play a role in giant gourami distribution based on these plots, but may be more apparent if you zoom in on the upper temperature threshold or the lower population density range.

gourami_tc3_lr005_plots
Model comparison, varying tree complexity and learning rate to evaluate the best setting. Top row illustrates model fit for a tree complexity of 3 with a learning rate of 0.01 (Left) and 0.005 (right).  The bottom row illustrates model fit for a tree complexity of 4 with learning rate 0.01(L) and 0.005(R) :Gourami_BRT_Rplot_3x4d3-4_lr0.1-0.005_wNOpop

It appears that a model with a tree complexity of 3 and a learning rate of 0.005 performs the best. This model indicates that precipitation has the largest effect on the distribution of giant gourami in SE Asia, based on the initial 34 occurrence points.  

Model critique: BRTs are not a spatially explicit model and thus relies only on the relationship between the variables and the sample points.  Additionally, due to the complex nature of the model (often outputting thousands of trees), the results can be difficult to interpret or explain.

 

Introduction

In my last post, I discussed the importance of quantifying social contacts to predict microbe transmission between individuals in our study herd of African Buffalo. This post deals with the other half of the equation: environmental transmission. My overarching hypothesis is that spatial and social behaviors are important for transmission of host-associated microbes. The goal of this pilot project is to establish a method for quantifying habitat overlap between individuals in the herd so that I can (eventually) compare habitat overlap with microbe community similarity and infection risk.

Methods/Results

Workflow:

I used the following outline to determine pairwise habitat overlap between a “test” pair of buffalo:

  1. Project GPS points in ArcMap
  2. Clip GPS points to the enclosure boundary layer
  3. Use GMe to generate kernel density estimates and 95% isopleths
  4. Back in Arc, used the identity tool to calculate area of overlap between the two individuals
  5. Computed percent overlap using Excel

Geospatial Modelling Environment (GME):

GME is a free geostatistical software that combines the computational power of R with the functionality of ESRI ArcGIS to drive geospatial analyses. A few of the functions that are especially useful for analyzing animal movement are: kernel density, minimum convex polygons, movement parameters (step length, turn angle, etc), converting locations to paths and paths to points, and generating simulated random walks. For this analysis, I used GME to generate kdes and estimate home ranges using 95% isopleths.

Kernel density estimates and isopleth lines

Kernel densities can be conceptualized as 3-dimensional surfaces that are based on density of points. Isopleth lines can be drawn in GME based on a raster dataset (such as a kernel density) and can be set to contain a specified volume of surface area. For this analysis, I was interested in calculating 95% isopleths based on the kernel densities of GPS points. In real life, this means the area that an animal spends the majority of its time in.

Using the identity tool to compute overlap

After generating home range estimates for two individuals, I uploaded the resulting shapefiles to ArcMap and used the Identity tool to overlap the home ranges.  To use the identity tool, you input one identity feature and one or more input features. Arc then computes a geometric intersection between input features and identity features and merges their attributes in the output.

overlap

The map above shows 95% isopleths from animal 1 (yellow), animal 13, (purple), and the area of overlap computed using the intersect tool (red line). I exported the output table to excel, where I calculated percent overlap between animals.

Conclusion

Overall, this method seems like it will be great for estimating habitat overlap. A few things that I’m concerned about are:

(a) Habitat use may be so similar for all animals that overlap cannot be related to differences in microbe transmission.

(b) Habitat use may correlate very strongly with contacts, in which case it will be difficult to control for the effects of contacts on microbe transmission.

(c) Percent overlap can be different for each individual in a pair. In my example, #13 overlapped #1 by ~80%, while #1 overlapped #13 by 90%.

I just want to be aware of these potential issues and start thinking about how to deal with them as they arise. Any suggestions would be appreciated, as always!

Introduction 

My research goal is to determine the effects of social and spatial behavior on disease transmission and nasal microbiome similarity. The overarching hypothesis of this project is that composition of microbial communities depends on host spatial and social behavior because microbe dispersal is limited by host movement. My research group studies a herd of African Buffalo in Kruger National Park that is GPS collared and lives in a 900 hectare predator-free enclosure. In conjunction with the GPS collars, each animal has a contact collar that records contact length and ID of any animal that comes within ~1.5 meters. However, for this class I focused on the GPS collars in the effort to test whether they are sufficient for inferring contact networks.

 

The purpose of this step in my research was to establish a method for creating distance matrices between animals at different time points using only GPS data. I wanted to determine whether this is a viable option for inferring contact networks and could be used in lieu of the contact collars. If this method is effective, my plan was to iterate through multiple time points to determine average distances between animals over time.

Results from this pilot study showed that the method is feasible, however, GPS collars would be less effective for inferring a contact network than the contact collars because temporal resolution is sacrificed.

 

Data Manipulation in R

The raw data from the buffalo GPS collars is in the form of individual text files, one text file per individual per time point. Step one was to generate a csv file that included all individuals at one time point. This was done in R using the following code:

##Load necessary packages

library(reshape2)
library(ggplot2)

##Set working directory and set up loop to import data

setwd(“C:\\Users\\couchc\\Documents\\GPS_captures_1215\\GPS_capture_12_15”)
SAVE_PATH <- getwd()
BASE_PATH <- getwd()

CSV.list<-list.files(BASE_PATH,pattern=glob2rx(“*.csv”))

all.csv<-read.csv(CSV.list[1], header = TRUE)

names(all.csv)

buffalo<-data.frame(ID=all.csv$title,X=all.csv$LATITUDE,Y=all.csv$LONGITUDE, Day=all.csv$MM.DD.YY, Hour=all.csv$HH.MM.SS, s.Date=all.csv$s.date, s.Hour=all.csv$s.hour)
## Now we want to melt them and cast them based on some data within the frame.

melted.all.csv<-melt(buffalo,id=c(“ID”))
names(melted.all.csv)

casted.all.csv<-dcast(melted.all.csv, LATITUDE+LONGITUDE~id)
# Where our previous data was organized with ID as a column,
# now ID is a row, and we can see what value an ID took at a given location.

length(buffalo[,1])
ggplot(buffalo[buffalo$ID==buffalo$ID[1],], aes(x=X, y=Y, value = ID))+
geom_raster()

 

Distance Matrix

Because the GPS collars are not synchronized, the 30-minute time intervals can be offset quite a bit, and there are quite a few missing time intervals. To try and correct this problem, I rounded times to the nearest 30 minutes in excel. I then imported the new excel file back into R to generate a “test” distance matrix for a single time point using the following code:

-Show code and output once I’m able to run it on a lab computer (My computer doesn’t have the processing power to do it in a reasonable time).

 

Conclusion

This method showed some interesting results: inter-individual distances can be calculated in R using GPS collar data, and if the process were iterated over multiple time points, average pairwise distances could be computed between each individual. A mantel test could be used to determine correlation between . The problem with rounding time points to the nearest 30 minutes is that it doesn’t guarantee that the buffalo are actually ever coming within the distance that is given by the matrix. Since some of the collars are offset by up to 15 minutes, the animals could be in the recorded locations at different times, and never actually come into contact with each other. The benefit of using GPS collars to infer social behavior is that it gives us actual distances between individuals, which adds an element of spatial resolution not provided by the contact collars. Contact collars only read when another animal is within a specific distance, but no measure of actual distance is recorded. However, since we are looking for the best way to predict contact transmission of microbes for this pilot project,  we are not interested in contact distances past a certain threshold. Although the distance matrix could prove useful for other behavior studies, it provides less relevant information than the contact collars. This could be a useful method if we wanted actual distances between organisms. However, to simplify we would like to set a threshold distance for microbe transmission, which is easily done with the contact collars. However, we would need much higher temporal resolution in our GPS data to be able to build distance matrices that are precise enough to infer contact networks, and this process is more easily done using contact collars.

 

Overall, this pilot project was useful because it demonstrated some of the potential benefits and drawbacks of using GPS intervals to infer contact networks. Although I do not plan to use the method described in this post for my dissertation research, it may be beneficial in future studies of animal behavior or herd structure. The exercise was very useful for improving my confidence in R — with lots of help from my classmates of course!

Relationships that vary in space and time: A challenge for simple linear regression

Regression is a fundamental analytical tool for any researcher interested in relating cause and effect. A basic assumption of regression is that of stationarity, the principal that the relationship between a predictor variable and its response is constant across its sample space; that a relationship is true in all regions where it is being applied. This assumption is a particularly poor assumption in spatially based analyses, where we know that interactions may exist between known and often unknown factors in how response variable relates to a given explanatory variable. While this is a  challenge to simple linear regression, it is also what generally makes spatial problems interesting: the fact that relationships are not constant across space and time.

Spatially weighted regression challenges the assumption of stationary in that where simple linear regression develops a single relationship to describes a phenomena, spatially weighted regression allows the relationship to vary spatially. Unhinging the relationship between a explanatory variable and its response spatially creates a set of local coefficient for each instance where an explanatory variable is offered. This is done through the use of a  weighting function. Wherein simple linear regression, each data point assumes equal weight with regards to the final relationship, a weighting function applies greater import to values closer to where a regression point would be calculated.


Spatial weighting function (1)

Fig 1: A spatial weighting function weights data points closer to a  regression point. In this way bandwidths can vary across a feature space, such that two local regression values may be constructed of a different number of data points.


SWR_drawing

Fig 2: Spatially weighted regression allows the relationship between a response and explanatory to vary across a study region.

 

NDVI and weed density: A case study in spatially weighted regression

Normalized difference vegetation index (NDVI) has been used in remote sensing  as a proxy for phenology in many remote sensing and cropping systems studies. NDVI is calculated as the ratio of red to near-infrared light, and is generally related to the amount of green photo-synthetically active tissue. In principal, weeds and crops should be distinguishable based on how their NDVI response varies in time.


Question: Can NDVI be used to predict weed density? Does the relationship between NDVI and weed density vary spatially? 


NDVI_swr

Fig 3: A general hypothesis for how weeds and crop may in their NDVI response. Weeds may mature earlier or later than a crop, but this relationship may also vary spatially.

Here, spatially weighted regression is offered as a method for distinguishing relationships between weed density and NDVI. Allowing the relationship between NDVI and weed density to vary spatially may allow one classify areas of the field based on the nature of these relationships.


NDVI_gif

Fig 4: NDVI over the course of the growing season at a 17 acre field located NE of Pendleton, OR, Summer 2015. Note that the field does not increase or decrease in NDVI evenly, but rather peak NDVI passes as a wave across the field. [ Note: GIF does not appear animated in preview. Click on the GIF directly if animation is not working ]

An important first step in deciding if a data set is suitable for spatially weighted regression is to look at the residuals of the linear relationship you are choosing to model. Here we examine the following function for predicting weed density:

NDVI_form (1)

This function uses 3 samples of NDVI over the course of the growing season, centered around the time of peak NDVI. The purpose of using these 3 times was to try and emphasize time periods where weeds and crop would vary in their relative response in NDVI.


PlantHS_obs

Fig 5: Weed densities were calculated based on linear transects made prior to harvest. Weed hot spots were calculated using the Getis-Ord Gi* statistic in ArcMap. Weed hot spots from a prior analysis were used as input for predicting weed density in this exercise.


PlantHS_glm_pred

Fig 6: Predicted weed density for a multiple regression model based on 3 measurements of NDVI surrounding peak NDVI.  Multiple R2:  0.044, P-value < 0.001

 


PlantHS_glm_resid

Fig 7: Residuals from a multiple regression model based on 3 measurements of NDVI surrounding peak NDVI.  Both high and low residuals cluster spatially, indicating that the relationship between NDVI and weed density may vary spatially and may be a good candidate for geographically weighted regression.


PlantHS_swr_pred

Fig 8: Predicted weed density from a spatially weighted regression. Quasi-global R2: 0.92.

By using a spatially weighted regression, we’re able to account for 92 percent of the variance that occurs in the distribution of weeds in this field. Unlike in a standard regression, the result of this process is a collection of local regression formula. In this sense, the result is not a result that can be easily extrapolated to predict weeds distributions in future data sets. However, these coefficients do offer us the opportunity to look for some spatial patterns that may yield additional information as to what the nature of these local spatial relationships might be


 


Slope_coef

Fig 10:  Map classified by coefficient slope.

   Question:

Where are the statistically significant regions of high and low Alaska Skate catch rate from commercial longliners in the Bering Sea?

AK_skate_lgl_rate

Name of the tool or approach used.

ArcGIS- Hot Spot Analysis

Brief description of steps you followed to complete the analysis.

The dataset that I used to complete this analysis includes over 200 species of fish and four gear types, so I chose to limit the scope of this exercise to one species in one gear type.  So I narrowed the dataset by selecting Alaska Skates caught by longliners, and ran a Hot Spot Analysis on this selection.  Within the Hot Spot tool in Arc, I selected catch “RATE” as the input field for comparison, retained the default selections for the spatial relationship (“FIXED_DISTANCE_BAND) and distance method (“EUCLIDEAN_DISTANCE”), and ran the analysis.  
Narrowing the temporal scope of the analysis, I ran it again on two subsets of the data.  One on points from 1998-2007 and the second on points from 2008-2014.

Brief description of results obtained.

AK_skate-lgl_hotspot
Alaska Skate catch rate hot spot analysis 1998-2014

The original analysis includes AK skate catch rates on Bering Sea longliners from 1998-2014 and reveals some interesting patterns about fishing behavior as well as where skate catch rates are higher.  There is a clear line of cold spots along the southern edge of the oceanic shelf that runs north of the Aleutian Islands.  A few distinct hotspots are also revealed in the western Aleutians near Adak and Atka islands

AK_skate-lgl_hotspot98-07
Hot spot analysis on Skate catch rate from 1998-2007.
AK_skate-lgl_hotspo08-14t
Hot spot analysis of Skate catch rate 2008-2014

Changing the temporal scope of the analysis revealed some very different patterns than the full time period.  There are a number of hypotheses that can be developed for further exploration as to why the patterns change.  

Critique of the method – what was useful, what was not?

For this dataset, I think that this tool is a quick and easy way to identify patterns at different scales and could be an approach to hypothesis generation.  This dataset is one that the sample points are independent of the researcher and span a large temporal and spatial range.  It seems that for this tool these characteristics in the dataset make it more applicable.

Introduction:

Hotspot analysis is a method of statistical analysis for spatially distributed data whereby both the magnitude of a point and its relationship to the points around it are taken into consideration to identify statistically significant extreme values in a data set. While a single spatial data point may represent an extreme value, in the context of weeds management what often matters is being able to identify aggregates of values which taken as a whole represent an extreme condition. Where one large weed may have some impact, many medium and smaller weeds are likely to have a disproportionate impact. This has implications for weed management and weed science, specifically of the variety of techniques available to users for mapping the spatial distribution of invasive weeds, most rely on expert knowledge to decide what constitutes a “weedy” versus a “non-weedy” condition. This consideration is made independently of the geographic context of a value, usually by visually evaluation of the distribution of measured values, and deciding some cutoff, real or perceived, in the distribution of values. These cutoffs are often subjective, and are typically yield poor results when applied to independently generated data sets. Hotspot  analysis offers us a road forward where statistically significant extreme regions can be identified, and various mapping techniques compared by how well they predict these extreme regions.

My goal for this project was to see if hotspot analysis provided a better technique for mapping weeds populations in wheat fields at the time of harvest. Previously, we visually evaluated the histograms, and using some knowledge of how the weeds were distributed to determine cutoffs for “weedy” and “non-weedy” values. Here I wanted to see if I could generate more accurate maps of weed density relative to a reference data set.

Sources of data:

Combine

Fig. 1: Sampling for weed density was done at the time of harvest using a spectrometer in the grain stream of a combine, and with a visual evaluator watching for green weeds during harvest.
In Summer 2015 we harvested a field winter wheat NE of Pendleton, OR using combine outfitted with a spectrometer in the grain. A ratio of red to near infrared reflectance was taken, along with visual observations made by an observer from inside the cab of the combine. Prior to harvest, we walked the field on gridded transects, recording observations of all weeds at the species level. These data were all re-gridded to a common 7m X 7m grid using bilinear interpolation, and brought into ArcGIS for analysis.

Classification:

In ArcGIS, hotspot analysis was conducted using Hotspot analysis tool. The Getis-Ord Gi* statistic identifies statistically significant hot and cold spots in whatever spatially referenced value you use as an feature class for classification. For the purposes of this exercise, positive Getis-Ord Gi* statistics were considered to be “weedy”, while not-significant and negative scores were considered to be “non-weedy”. Cutoffs based on a visual evaluation of histogram, and knowledge of what should represent a weedy condition used to distinguish between “weedy” and “non-weedy” values  and their associated maps classified accordingly.

 

Combine_hotspot_ClassificationCombine_histogram_Classification

Fig 2: Ground reference data set classified using Hotspot analysis and histogram classification. All green values were considered positive for weeds.

Spectral_histogram_ClassificationSpectral_Hotspot_Classification

Fig 3: Spectral assessment of weeds distribution classified using Hotspot analysis and histogram classification. All green values were considered positive for weeds.

 

Greenwalking_histogram_ClassificationGreenwalking_hotspot_Classification

Fig 4: Visual assessment of weeds distribution classified using Hotspot analysis and histogram classification.  All green values were considered positive for weeds.

Comparing techniques
Classified maps were then compared for accuracy using a confusion matrix approach. Accuracy here is defined as the rate of true positives and true negatives, divided by the total number of samples. Classified maps were also compared for their precision and sensitivity, where precision is the rate of true positives of the condition positives, and precision is the rate of true positives over the predicted positives. Precision addresses the question of if I predict it positive, how often am I correct? Sensitivity addresses the question, when the reference data predicts it positive, how often was my prediction correct?

ERROR_MATRIX

Fig 5: Error assessment comparing spectral assessment of weediness to ground reference data using the histogram classification.RESULTS

Fig 6: Results of the error assessments for all mapping techniques compared with their ground reference data.

Results:

These results show that hotspot analysis was a superior method for image classification when compared with a visual evaluation of the distribution.  This is probably a result of the fact that hotspot analysis takes into consideration the neighborhood of values a measurement resides in. Hotspot analysis increased the accuracy for both the spectral and the on-combine visual assessments. There were mixed results regarding its impacts on sensitivity and precision however. Hotspot classification increased the precision of the visual assessment, but did so at a cost to sensitivity. Conversely, the hotspot technique increased the sensitivity of the spectral assessment, but did so at a cost to its precision. It may be that precision comes at a cost to sensitivity using this method for comparing classifications. In general however, the most substantial gains were in accuracy, which is the most important factor for this mapping effort.