Exploring Giant Gourami Distribution Models

The research question:

In the context of predicting the impacts of the rapidly expanding aquaculture industry and understanding the impacts of natural and human changes on the landscape, one of the overarching research question that I am interested in is: What is the natural distribution of the the air-breathing giant gourami (Osphronemus goramy) in South East Asia and how is it predicted to change over time with respect to 1) the biophysical properties of the landscape, 2) human impacts, and 3) climate projections into the future.

Specific to this class, the research question that I explored was, what is the distribution of the giant gourami in SE Asia in the year 2000 based on 4 environmental variable (NDVI, precipitation, surface temperature, and river flow accumulation) and human population density.

Background: Giant gourami inhabits regions characterized by fresh to brackish water and in slow-moving areas like swamps, lakes, and large rivers.  Given its unique ability to breather air, this fish can survive in poorly oxygenated water to anoxic areas.  Already a popular fish in the aquarium trade and eaten in some regions, this fish is farmed in SE Asia.  I expect that with climate change, increased urbanization, and the changing hydrologic profile of the system due to potential dams that this fish may become more suitable than others for its ability to live in ‘poorer’ environmental conditions.  

The Dataset:

gourami_distPoint_dataMy dataset consists of points of fish presence/pseudo-absence across SE Asia (image above) characterized by associated environmental variables of interest.  The ‘presence’ portion of my dataset was pulled from fishbase.com, consisting of 47 occurrence points for giant gourami collected between primarily 1980s-1990s through an unknown sample protocol.  Clipping to my study region, SE Asia left me with 36 presence points.  Background data, or pseudo-absence points were generated randomly along rivers and streams of SE Asia in ArcMap.  Environmental data for all points were taken from freely available satellite datasets listed below on Google Earth Engine in a 1km buffer around the point data and filtered to the date range of Feb-Jun 2000 when possible (for retaining consistency in the dataset).

  • NDVI: 32-day Landsat 7
  • Precip: CHIRPS (Climate Hazards Group InfraRed Precipitation with Station data) at 0.05deg resolution.
  • Flow: HydroSHEDS flow accumulation 15-arc seconds (Feb 11-22, 2000)
  • Temp: NCEP/NCAR reanalysis 2.5deg resolution
  • Population: WorldPop project- Estimated mean population density 2010 & 2016 at a 100x100m resolutiongourami_BRT_rastersgouramiVar_histos_ALL

Hypotheses:

Based on my point data and the variables used, my hypotheses are below (with the third un-tested):

  1. Giant gourami are distributed in a wide range of habitat types, including relative warmer surface temperatures, a range of flows, and in areas with range of precipitation.
  2. This distribution of the Giant gourami is not affected by human population density.
  3. Giant gourami distributions will not change much given predicted climate warming (untested)

Approaches:

This was an exploratory analysis that employed the species distribution model, boosted regression trees (BRT).  This is an ensemble tree-based species distribution model that iteratively grows small/simple regression trees based on the residuals from all previous trees to explore non-parametric data. BRTs consist of two components: regression trees and boosting and ultimately help to identify the variables in the dataset that best predict where the species is expected based on the presence/absence data.  (you can view my the boosted regression tutorial for more on this analysis approach).

Results:

After building my dataset in Arc and through Google Earth Engine, I was able to produced BRT results in R studio for several combinations of learning rates and tree complexities with and without the population data as a variable.  Preliminary analysis indicates that precipitation and temperature contribute the most in determining where giant gourami are expected, based on the data.  Digging into my two hypotheses, I explored the contribution of temperature on giant gourami distribution and population density.

Precipitation was identified by the BRT models as the strongest contributor in determining in the likelihood of finding a giant gourami in my study area.  

  • R output:
  • > gourami.tc3.lr005$contributions
  • var:   rel.inf
  • mean_Precip: 60.511405
  • mean_temp: 25.182547
  • pop_density: 10.984079
  • NDVI_mean: 1.673674
  • flow: 1.64829

Exploring the spatial relationship between precipitation trends and my presence/pseud-oabsence point data, it is clear that the presence points are clustered in regions with lower rainfall.  As a means to explore this relationship further, it might be helpful to shrink the extent of my study area to represent a smaller area, more representative of the conditions associated with the presence points.
precip_SEasia_gourami

Population density did not appear to have an effect on model performance as expected.  As shown in the model output below, which displays model output for models with population density as a variable in the left column and without on the right.  Model deviance in cross-validation does not change when population density is removed as an explanatory variable.

gourami_TC3-4_lr005_PoP-noPoPExploring this relationship visually on a map, this result makes sense as the point data are distributed across a wide range of population densities.

pop_seasia_points

Significance:

The results of this exploratory analysis revealed some interesting patterns in giant gourami distribution, but is limited in a big way by the point data used.  Points were only present in Malaysia due to availability, so this distribution pattern is possibly more a function of sample effort than actual distribution.  If the analysis were limited to Malaysia only, it may provide a better representation of the data.  

Understanding the spatio-temporal patterns that govern the range of species like the giant gourami allow resource managers to help meet increasing demands and at the same time mitigate environmental harm.  Protein consumption is expected to increase to 45kg per capita by 2020, a 25% increase from 1997—the fish consumption rate is no outlier.  The growing aquaculture industry provides roughly half of the global fish supply (FAO, 2014).  The giant gourami are 1 species of ~400 air breathing fish present–ideal candidates for aquaculture.  This prospect presents an opportunity for increased protein production in a changing climate, but also increases threat of invasion/outcompetition.

 

Lessons Learned:

In this analysis I learned about integrating Arc, Google Earth Engine, and R for building my dataset as well as running a boosted regression tree model.  Through the process, I tracked my progress, workflow, and challenges.  This allowed me to identify areas where I could have been more efficient.  For example, I generated my background/psuedo-absence points in Arc and then brought them into Earth Engine, when I could have done the whole process in Earth Engine.

A note on Earth Engine: the reason that I chose to use this platform was for its ability to access large global datasets across wide ranging time frames quickly and without the need to download to manipulate or extract data.  Earth Engine functions through javascript, which is a hurdle to overcome for someone new to programming.

My data analysis was done in R studio, through which I learned to run a BRT model with the ‘gbm’ package, I learned some simple histogram plotting functions, and how to import/view raster images with the packages ‘raster’ and ‘rgdal’.

In terms of the statistics that I explored for this analysis, I have listed some of my take home points below:

  • Boosted Regression Tree strengths and limitations
    • Deals with non-parametric data well
    • Is able to deal with small sample sizes
    • Pseudo-absence points present issues in interpreting results since they are not true absence points
    • Sample bias is potentially a problem
    • Is not a spatially explicit analysis so issues like spatial autocorrelation are not dealt with
  • Hot Spot Analysis (conducted on a different dataset)
    • A quick and easy way to identify patterns at different scales for data points that are independent of the researcher and span a large temporal and spatial range.  
    • could be an approach to hypothesis generation.  

 

Boosted regression trees (BRT) are an ensemble tree-based species distribution model that iteratively grows small/simple trees based on the residuals from all previous trees.  The model is run on a random subset of your data, and ALL predictor variables are considered to produce the best splits at each node.  The tree-complexity determines how deep each individual tree will be grown to.  Anticipated interactions can be captured by setting the appropriate tree complexity.  The learning rate determines the overall weight of each individual tree.  This is an advanced form of regression methods which consists of two components. -The two elith08_fig1main components of BRT: regression trees and boosting.

 

  1. Regression trees:  Partitions the predictor space into rectangles, using a series of rules to identify regions with the most homogenous response to the predictor and fits a constant to each region.
  2. Boosting: An iterative procedure that reduces the deviance by accounting residuals of previous tree(s) by fitting another tree
    • Each individual tree inform subsequent trees and thus the final model
    • The boosting component makes boosted regression distinct from other tree-based methods.

 

Objective of this analysis:

Identify the biophysical parameters associated with giant gourami distribution in SE Asia, starting with a set of 47 global occurrence points.  This was an exploratory exercise to learn about the physical variables that might be important for the distribution of the giant gourami. 

My Data:

Pulled the 47 occurrence points for the giant gourami from fishbase.com and were used as the basis for the dataset.

ArcMap: generate points and convert to KML for import into Google Earth Engine

//Get coordinates for occurrence points for species of interest (Giant Gourami) from fishbase.com

//create random points to use as ‘pseudo absence’ points

//generate random points within study region and only in rivers

Google Earth Engine: ‘gather’ biophysical data for points from satellite imagery–used for ease of access to spatial layers

//load in image layers of interest (NDVI, CHIRPS, Population density, Flow, Surface Temp.)

//export to CSV for analysis in R Studio

 

R code for running the BRT model:

I ran the model using the gbm package in R, based on a tutorial by Jane Elith and John Leathwick (https://cran.r-project.org/web/packages/dismo/vignettes/brt.pdf)

 

>>source(“BRT/brt.functions.R”)

>>install.packages(‘gbm’)

>>library(gbm)

# define the dataset

>>gourami <- read.csv(“BRT/gourami/gourami_data.csv”)

# data consists of 39 presence and 305 pseudo absence (344)

# 5 predictor variables

>>gourami.tc3.lr005 <- gbm.step(data=gourami,

    gbm.x = 3:7, #columns in the dataset where the response variables are located

   gbm.y = 2, #column in the dataset where presence/absence data is located (0/1)

   family = “bernoulli”,

    tree.complexity = 3, #tree depth determines the number of layers in each tree

  learning.rate = 0.005, #weight of each tree in the overall model

    bag.fraction = 0.75) #fraction of the dataset used to build/train the model

 

The three main parameters to pay attention to at this point are tree complexity, learning rate, and bag fraction. The remaining fraction of the dataset not used in the bag fraction is then used for cross validation for model evaluation. These three parameters can be varied to determine the ‘best’ model.

Results

Initial output for model with tree complexity of 3, learning rate of 0.005, and bag fraction of 0.75.  Several warning messages were displayed with this particular model, which are not addressed in this tutorial:

mean total deviance = 0.707

mean residual deviance = 0.145

estimated cv deviance = 0.259 ; se = 0.058

training data correlation = 0.916

cv correlation =  0.838 ; se = 0.043

training data ROC score = 0.989

cv ROC score = 0.958 ; se = 0.02

elapsed time –  0.13 minutes

 

Warning messages:

1: glm.fit: algorithm did not converge

2: glm.fit: fitted probabilities numerically 0 or 1

 

Relative contributions of each variable in determining where the species is expected.  For this model, precipitation has the strongest pull:

> gourami.tc3.lr005$contributions

                   var   rel.inf

mean_Precip: 61.223530

mean_temp: 25.156042

pop_density: 10.299844

NDVI_mean:  1.685350

Flow:   1.635234

 

Interactions:

NDVI Precip flow Temp Pop
NDVI 0 29.62 0.11 0.07 0.08
Precip 0 0 17.00 317.66 84.51
flow 0 0 0 0 0.93
Temp 0 0 0 0 3.29
Pop 0 0 0 0 0

Partial dependence plots visualize the effect of a single variable on model response, holding all other variables constant.  Model results vary the most with precipitation as seen in the top left plot.  Mean temperature and population density appear to also play a role in giant gourami distribution based on these plots, but may be more apparent if you zoom in on the upper temperature threshold or the lower population density range.

gourami_tc3_lr005_plots
Model comparison, varying tree complexity and learning rate to evaluate the best setting. Top row illustrates model fit for a tree complexity of 3 with a learning rate of 0.01 (Left) and 0.005 (right).  The bottom row illustrates model fit for a tree complexity of 4 with learning rate 0.01(L) and 0.005(R) :Gourami_BRT_Rplot_3x4d3-4_lr0.1-0.005_wNOpop

It appears that a model with a tree complexity of 3 and a learning rate of 0.005 performs the best. This model indicates that precipitation has the largest effect on the distribution of giant gourami in SE Asia, based on the initial 34 occurrence points.  

Model critique: BRTs are not a spatially explicit model and thus relies only on the relationship between the variables and the sample points.  Additionally, due to the complex nature of the model (often outputting thousands of trees), the results can be difficult to interpret or explain.

 

   Question:

Where are the statistically significant regions of high and low Alaska Skate catch rate from commercial longliners in the Bering Sea?

AK_skate_lgl_rate

Name of the tool or approach used.

ArcGIS- Hot Spot Analysis

Brief description of steps you followed to complete the analysis.

The dataset that I used to complete this analysis includes over 200 species of fish and four gear types, so I chose to limit the scope of this exercise to one species in one gear type.  So I narrowed the dataset by selecting Alaska Skates caught by longliners, and ran a Hot Spot Analysis on this selection.  Within the Hot Spot tool in Arc, I selected catch “RATE” as the input field for comparison, retained the default selections for the spatial relationship (“FIXED_DISTANCE_BAND) and distance method (“EUCLIDEAN_DISTANCE”), and ran the analysis.  
Narrowing the temporal scope of the analysis, I ran it again on two subsets of the data.  One on points from 1998-2007 and the second on points from 2008-2014.

Brief description of results obtained.

AK_skate-lgl_hotspot
Alaska Skate catch rate hot spot analysis 1998-2014

The original analysis includes AK skate catch rates on Bering Sea longliners from 1998-2014 and reveals some interesting patterns about fishing behavior as well as where skate catch rates are higher.  There is a clear line of cold spots along the southern edge of the oceanic shelf that runs north of the Aleutian Islands.  A few distinct hotspots are also revealed in the western Aleutians near Adak and Atka islands

AK_skate-lgl_hotspot98-07
Hot spot analysis on Skate catch rate from 1998-2007.
AK_skate-lgl_hotspo08-14t
Hot spot analysis of Skate catch rate 2008-2014

Changing the temporal scope of the analysis revealed some very different patterns than the full time period.  There are a number of hypotheses that can be developed for further exploration as to why the patterns change.  

Critique of the method – what was useful, what was not?

For this dataset, I think that this tool is a quick and easy way to identify patterns at different scales and could be an approach to hypothesis generation.  This dataset is one that the sample points are independent of the researcher and span a large temporal and spatial range.  It seems that for this tool these characteristics in the dataset make it more applicable.

My spatial problem

A description of the research question that you are exploring.  Global change is occurring from the continuing variability in the climate, the ecological responses to the climate drivers, and the socioeconomic and political response.  These changes alter the landscape in predictable and unforeseen ways, simultaneously causing modifications in interactions between the landscape and all the biological communities.  Our reliance on natural resources such as fish, highlights the coupled impacts of these changes between the human and natural system.  Exploring the impacts of this coupled system this term, the spatial problem that Ill address is to understand how habitat suitability models differ for the giant gourami (Osphronemus goramy) in the Mekong Basin between models that are based on the physical landscape and those that incorporate human impacts.  I will use the environmental indicators surface temperature, salinity, and turbidity to map the potential habitat for the giant gourami with an additional layer informed by indicators of human impacts such as land use, population, and proximity to industry to evaluate the differences.

The giant gourami is an air-breathing fish, native to this region, and grown commercially as a food fish as well as for the aquarium market throughout SE Asia (Lefevre et al., 2014).  The fish inhabits freshwater, brackish, benthopelagic environments in swamps, lakes, and rivers among vegetation, found in medium to large rivers and stagnant water bodies.  People around the world rely on fish as a primary source of protein and income, and the growing aquaculture industry provides roughly half of the global fish supply (FAO, 2014).  However, to meet the demands of a rapidly growing population (exceeding 7 billion by 2020), a rising middle class, and an increasingly urban population (65% by 2020), protein consumption is expected to increase to 45kg per capita by 2020, a 25% increase from 1997—the fish consumption rate is no outlier.

 

gourami

A description of the dataset.

  1. Boundary data for the IUCN defined habitat range for the giant gourami in the Mekong Basin species (shown below).  This status code for the species and is listed in this dataset as “Probably Extant.”  However, this particular status code is listed as “discontinued for reasons of ambiguity.”  So it is my hope that this analysis will provide insight into the IUCN-defined habitat and assess how it has changed through time by assessing the parameters used to develop the IUCN data and evaluate additional landscape variables.
  2. Point data on fish occurence 1930-1982 that Ill use to develop a baseline habitat suitability index: http://www.fishbase.org/Map/OccurrenceMapList.php?genus=Osphronemus&species=goramy&dsource=darwin_all_v2
  3. Current surface temperature from http://www.worldclim.org/
  4. Rivers of SE Asia from http://www.naturalearthdata.com/
  5. Vegetation classifications from NDVI data- time series- http://glam1.gsfc.nasa.gov/
  6. Additional information to evaluate the human impacts is available through the Mekong River Basin Study from the World Resources Institute: http://www.wri.org/resources/data-sets/mekong-river-basin-study

IUCN_Gourami

Hypotheses: I expect that the potential habitat for the giant gourami has increased over time and with increased human impacts do to the physiological resilience of the species.  This fish inhabits regions characterized by fresh to brackish water and in slow-moving areas like swamps, lakes, and large rivers.  Given its unique ability to breather air, this fish can survive in poorly oxygenated water to anoxic areas.  I expect that with climate change, increased urbanization, and the changing hydrologic profile of the system due to potential dams that this fish may become more suitable than others for its ability to live in ‘poorer’ environmental conditions.

Approaches: I hope to use python or modelbuilder to iterate through the available datasets to assess the changing habitat based on a habitat suitability index for the giant gourami.  There is also a time-series tool in Arc that I would like to explore.

Expected outcome: I hope to develop a habitat suitability index for the giant gourami and compare habitat suitability models for the potential habitat based on the changing physical landscape and increasing human impacts.  If the data are available, I hope to create a simple time-series animation for each model.

Significance: Fish production from aquaculture is poised to absorb an increasing amount of this demand for meat, offering techniques that offset some of the environmental costs of production.  Depending on the species and farming conditions, fish production can achieve some of the lowest feed-conversion ratios of any type of terrestrial animal meat production.  If farmed responsibly, some species of the diverse group of air-breathing fish such as the giant gourami present an advantage in aquaculture for their unique ability to breathe air.  However, it is critical to understand the impact of increased production levels on the natural range of the species order to mitigate the unwanted invasions or overloading of the natural environment.  A study to assess the spatio-temporal patterns of the habitat suitability of potential aquaculture species will allow for managers to make informed decisions about aquaculture siting and resource allocation.

Your level of preparation: In terms of my experience with the tools available for this type of analysis, I am starting to develop my comfort with ArcInfo, ModelBuilder, and Python for GIS.  However, am no expert.  I have also been exposed to some statistical applications of R, but am again not an expert.

 

FAO. (2014). The State of World Fisheries and Aquaculture 2014. Rome, Italy.

Lefevre, S., Wang, T., Jensen, a., Cong, N. V., Huong, D. T. T., Phuong, N. T., & Bayley, M. (2014). Air-breathing fishes in aquaculture. What can we learn from physiology? Journal of Fish Biology, 84, 705–731. doi:10.1111/jfb.12302