Boosted Regression Trees for Giant Gourami Distribution

Boosted regression trees (BRT) are an ensemble tree-based species distribution model that iteratively grows small/simple trees based on the residuals from all previous trees. The model is run on a random subset of your data, and ALL predictor variables are considered to produce the best splits at each node. The tree-complexity determines how deep each individual tree will be grown to. Anticipated interactions can be captured by setting the appropriate tree complexity. The learning rate determines the overall weight of each individual tree. This is an advanced form of regression methods which consists of two components. -The two main components of BRT: regression trees and boosting.

Regression trees: Partitions the predictor space into rectangles, using a series of rules to identify regions with the most homogenous response to the predictor and fits a constant to each region.
Boosting: An iterative procedure that reduces the deviance by accounting residuals of previous tree(s) by fitting another tree

Each individual tree inform subsequent trees and thus the final model
The boosting component makes boosted regression distinct from other tree-based methods.

Objective of this analysis:

Identify the biophysical parameters associated with giant gourami distribution in SE Asia, starting with a set of 47 global occurrence points. This was an exploratory exercise to learn about the physical variables that might be important for the distribution of the giant gourami.

My Data:

Pulled the 47 occurrence points for the giant gourami from fishbase.com and were used as the basis for the dataset.

ArcMap: generate points and convert to KML for import into Google Earth Engine

//Get coordinates for occurrence points for species of interest (Giant Gourami) from fishbase.com

//create random points to use as ‘pseudo absence’ points

//generate random points within study region and only in rivers

Google Earth Engine: ‘gather’ biophysical data for points from satellite imagery–used for ease of access to spatial layers

//load in image layers of interest (NDVI, CHIRPS, Population density, Flow, Surface Temp.)

//export to CSV for analysis in R Studio

R code for running the BRT model:

I ran the model using the gbm package in R, based on a tutorial by Jane Elith and John Leathwick (https://cran.r-project.org/web/packages/dismo/vignettes/brt.pdf)

>>source(“BRT/brt.functions.R”)

>>install.packages(‘gbm’)

>>library(gbm)

# define the dataset

>>gourami <- read.csv(“BRT/gourami/gourami_data.csv”)

# data consists of 39 presence and 305 pseudo absence (344)

# 5 predictor variables

>>gourami.tc3.lr005 <- gbm.step(data=gourami,

gbm.x = 3:7, #columns in the dataset where the response variables are located

gbm.y = 2, #column in the dataset where presence/absence data is located (0/1)

family = “bernoulli”,

tree.complexity = 3, #tree depth determines the number of layers in each tree

learning.rate = 0.005, #weight of each tree in the overall model

bag.fraction = 0.75) #fraction of the dataset used to build/train the model

The three main parameters to pay attention to at this point are tree complexity, learning rate, and bag fraction. The remaining fraction of the dataset not used in the bag fraction is then used for cross validation for model evaluation. These three parameters can be varied to determine the ‘best’ model.

Results

Initial output for model with tree complexity of 3, learning rate of 0.005, and bag fraction of 0.75. Several warning messages were displayed with this particular model, which are not addressed in this tutorial:

mean total deviance = 0.707

mean residual deviance = 0.145

estimated cv deviance = 0.259 ; se = 0.058

training data correlation = 0.916

cv correlation = 0.838 ; se = 0.043

training data ROC score = 0.989

cv ROC score = 0.958 ; se = 0.02

elapsed time – 0.13 minutes

Warning messages:

1: glm.fit: algorithm did not converge

2: glm.fit: fitted probabilities numerically 0 or 1

Relative contributions of each variable in determining where the species is expected. For this model, precipitation has the strongest pull:

> gourami.tc3.lr005$contributions

var rel.inf

mean_Precip: 61.223530

mean_temp: 25.156042

pop_density: 10.299844

NDVI_mean: 1.685350

Flow: 1.635234

Interactions:

	NDVI	Precip	flow	Temp	Pop
NDVI	0	29.62	0.11	0.07	0.08
Precip	0	0	17.00	317.66	84.51
flow	0	0	0	0	0.93
Temp	0	0	0	0	3.29
Pop	0	0	0	0	0

Partial dependence plots visualize the effect of a single variable on model response, holding all other variables constant. Model results vary the most with precipitation as seen in the top left plot. Mean temperature and population density appear to also play a role in giant gourami distribution based on these plots, but may be more apparent if you zoom in on the upper temperature threshold or the lower population density range.

Model comparison, varying tree complexity and learning rate to evaluate the best setting. Top row illustrates model fit for a tree complexity of 3 with a learning rate of 0.01 (Left) and 0.005 (right). The bottom row illustrates model fit for a tree complexity of 4 with learning rate 0.01(L) and 0.005(R) :

It appears that a model with a tree complexity of 3 and a learning rate of 0.005 performs the best. This model indicates that precipitation has the largest effect on the distribution of giant gourami in SE Asia, based on the initial 34 occurrence points.

Model critique: BRTs are not a spatially explicit model and thus relies only on the relationship between the variables and the sample points. Additionally, due to the complex nature of the model (often outputting thousands of trees), the results can be difficult to interpret or explain.

GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Just another blogs.oregonstate.edu site

Leave a reply Cancel reply

Contact Info