{"id":1960,"date":"2016-06-05T13:36:58","date_gmt":"2016-06-05T20:36:58","guid":{"rendered":"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/?p=1960"},"modified":"2016-06-05T13:40:49","modified_gmt":"2016-06-05T20:40:49","slug":"boosted-regression-trees-giant-gourami-distribution","status":"publish","type":"post","link":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/2016\/06\/05\/boosted-regression-trees-giant-gourami-distribution\/","title":{"rendered":"Boosted Regression Trees for Giant Gourami Distribution"},"content":{"rendered":"<p><b>Boosted regression trees (BRT)<\/b><span style=\"font-weight: 400\"> are an ensemble tree-based species distribution model that iteratively grows small\/simple trees based on the residuals from all previous trees. \u00a0The model is run on a random subset of your data, and ALL predictor variables are considered to produce the best splits at each node. \u00a0The tree-complexity determines how deep each individual tree will be grown to. \u00a0Anticipated interactions can be captured by setting the appropriate tree complexity. \u00a0The learning rate determines the overall weight of each individual tree. \u00a0This is an advanced form of regression methods which consists of two components. -The two <a href=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2016\/06\/elith08_fig1.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-1961 alignright\" src=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2016\/06\/elith08_fig1-215x300.jpg\" alt=\"elith08_fig1\" width=\"336\" height=\"469\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2016\/06\/elith08_fig1-215x300.jpg 215w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2016\/06\/elith08_fig1.jpg 379w\" sizes=\"auto, (max-width: 336px) 100vw, 336px\" \/><\/a>main components of BRT: regression trees and boosting.<\/span><\/p>\n<p>&nbsp;<\/p>\n<ol>\n<li style=\"font-weight: 400\"><b>Regression trees<\/b><span style=\"font-weight: 400\">: \u00a0Partitions the predictor space into rectangles, using a series of rules to identify regions with the most homogenous response to the predictor and fits a constant to each region. <\/span><\/li>\n<li style=\"font-weight: 400\"><b>Boosting<\/b><span style=\"font-weight: 400\">: An iterative procedure that reduces the deviance by accounting residuals of previous tree(s) by fitting another tree <\/span><\/li>\n<\/ol>\n<ul>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Each individual tree inform subsequent trees and thus the final model<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">The boosting component makes boosted regression distinct from other tree-based methods.<\/span><\/li>\n<\/ul>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><b>Objective of this analysis:<\/b><\/p>\n<p><span style=\"font-weight: 400\">Identify the biophysical parameters associated with giant gourami distribution in SE Asia, starting with a set of 47 global occurrence points. \u00a0This was an exploratory exercise to learn about the physical variables that might be important for the distribution of the giant gourami.\u00a0<\/span><\/p>\n<p><b>My Data:<\/b><\/p>\n<p><span style=\"font-weight: 400\">Pulled the 47 occurrence points for the giant gourami from fishbase.com and were used as the basis for the dataset.<\/span><\/p>\n<p><b>ArcMap<\/b><span style=\"font-weight: 400\">: generate points and convert to KML for import into Google Earth Engine<\/span><\/p>\n<p><span style=\"font-weight: 400\">\/\/Get coordinates for occurrence points for species of interest (Giant Gourami) from fishbase.com<\/span><\/p>\n<p><span style=\"font-weight: 400\">\/\/create <\/span><b>random<\/b><span style=\"font-weight: 400\"> points to use as \u2018pseudo absence\u2019 points<\/span><\/p>\n<p><span style=\"font-weight: 400\">\/\/generate random points within study region and only in rivers<\/span><\/p>\n<p><b>Google Earth Engine<\/b><span style=\"font-weight: 400\">: \u2018gather\u2019 biophysical data for points from satellite imagery&#8211;used for ease of access to spatial layers<\/span><\/p>\n<p><span style=\"font-weight: 400\">\/\/load in image layers of interest (NDVI, CHIRPS, Population density, Flow, Surface Temp.)<\/span><\/p>\n<p><span style=\"font-weight: 400\">\/\/export to CSV for analysis in R Studio<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>R code<\/b><span style=\"font-weight: 400\"> for running the BRT model:<\/span><\/p>\n<p><span style=\"font-weight: 400\">I ran the model using the gbm package in R, based on a tutorial by Jane Elith and John Leathwick (<\/span><a href=\"https:\/\/cran.r-project.org\/web\/packages\/dismo\/vignettes\/brt.pdf\"><span style=\"font-weight: 400\">https:\/\/cran.r-project.org\/web\/packages\/dismo\/vignettes\/brt.pdf<\/span><\/a><span style=\"font-weight: 400\">)<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">&gt;&gt;source(&#8220;BRT\/brt.functions.R&#8221;)<\/span><\/p>\n<p><span style=\"font-weight: 400\">&gt;&gt;install.packages(&#8216;gbm&#8217;)<\/span><\/p>\n<p><span style=\"font-weight: 400\">&gt;&gt;library(gbm)<\/span><\/p>\n<p><span style=\"font-weight: 400\"># define the dataset<\/span><\/p>\n<p><span style=\"font-weight: 400\">&gt;&gt;gourami &lt;- read.csv(&#8220;BRT\/gourami\/gourami_data.csv&#8221;)<\/span><\/p>\n<p><span style=\"font-weight: 400\"># data consists of 39 presence and 305 pseudo absence (344)<\/span><\/p>\n<p><span style=\"font-weight: 400\"># 5 predictor variables <\/span><\/p>\n<p><span style=\"font-weight: 400\">&gt;&gt;gourami.tc3.lr005 &lt;- gbm.step(data=gourami, <\/span><\/p>\n<p><span style=\"font-weight: 400\"> \u00a0\u00a0\u00a0<\/span> <span style=\"font-weight: 400\">gbm.x = 3:7, #columns in the dataset where the response variables are located<\/span><\/p>\n<p><span style=\"font-weight: 400\"> \u00a0\u00a0<\/span> <span style=\"font-weight: 400\">gbm.y = 2, #column in the dataset where presence\/absence data is located (0\/1)<\/span><\/p>\n<p><span style=\"font-weight: 400\"> \u00a0\u00a0<\/span> <span style=\"font-weight: 400\">family = &#8220;bernoulli&#8221;,<\/span><\/p>\n<p><span style=\"font-weight: 400\"> \u00a0\u00a0\u00a0<\/span> <strong><i>tree.complexity = 3, #tree depth determines the number of layers in each tree<\/i><\/strong><\/p>\n<p><strong><i> \u00a0<\/i> <i>learning.rate = 0.005, #weight of each tree in the overall model<\/i><\/strong><\/p>\n<p><strong><i> \u00a0\u00a0\u00a0<\/i> <i>bag.fraction = 0.75<\/i>) #fraction of the dataset used to build\/train the model<\/strong><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">The three main parameters to pay attention to at this point are tree complexity, learning rate, and bag fraction. The remaining fraction of the dataset not used in the bag fraction is then used for cross validation for model evaluation. These three parameters can be varied to determine the \u2018best\u2019 model.<\/span><\/p>\n<p><b>Results<\/b><\/p>\n<p><span style=\"font-weight: 400\">Initial output for model with tree complexity of 3, learning rate of 0.005, and bag fraction of 0.75. \u00a0Several warning messages were displayed with this particular model, which are not addressed in this tutorial:<\/span><\/p>\n<p><i><span style=\"font-weight: 400\">mean total deviance = 0.707 <\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400\">mean residual deviance = 0.145 <\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400\">estimated cv deviance = 0.259 ; se = 0.058 <\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400\">training data correlation = 0.916 <\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400\">cv correlation = \u00a00.838 ; se = 0.043 <\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400\">training data ROC score = 0.989 <\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400\">cv ROC score = 0.958 ; se = 0.02 <\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400\">elapsed time &#8211; \u00a00.13 minutes <\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p><i><span style=\"font-weight: 400\">Warning messages:<\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400\">1: glm.fit: algorithm did not converge <\/span><\/i><\/p>\n<p><i><span style=\"font-weight: 400\">2: glm.fit: fitted probabilities numerically 0 or 1 <\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">Relative <\/span><b>contributions <\/b><span style=\"font-weight: 400\">of each variable in determining where the species is expected. \u00a0For this model, precipitation has the strongest pull:<\/span><\/p>\n<p><span style=\"font-weight: 400\">&gt; gourami.tc3.lr005$contributions<\/span><\/p>\n<p><span style=\"font-weight: 400\"> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0var \u00a0\u00a0rel.inf<\/span><\/p>\n<p><span style=\"font-weight: 400\">mean_Precip: 61.223530<\/span><\/p>\n<p><span style=\"font-weight: 400\">mean_temp: 25.156042<\/span><\/p>\n<p><span style=\"font-weight: 400\">pop_density: 10.299844<\/span><\/p>\n<p><span style=\"font-weight: 400\">NDVI_mean: \u00a01.685350<\/span><\/p>\n<p><span style=\"font-weight: 400\">Flow: \u00a0\u00a01.635234<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>Interactions<\/b><span style=\"font-weight: 400\">:<\/span><\/p>\n<table style=\"height: 244px\" width=\"450\">\n<tbody>\n<tr>\n<td><\/td>\n<td><span style=\"font-weight: 400\">NDVI<\/span><\/td>\n<td><span style=\"font-weight: 400\">Precip <\/span><\/td>\n<td><span style=\"font-weight: 400\"> flow <\/span><\/td>\n<td><span style=\"font-weight: 400\">Temp <\/span><\/td>\n<td><span style=\"font-weight: 400\">Pop<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400\">NDVI<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">29.62<\/span><\/td>\n<td><span style=\"font-weight: 400\">0.11<\/span><\/td>\n<td><span style=\"font-weight: 400\">0.07<\/span><\/td>\n<td><span style=\"font-weight: 400\">0.08<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400\">Precip<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">17.00<\/span><\/td>\n<td><span style=\"font-weight: 400\">317.66<\/span><\/td>\n<td><span style=\"font-weight: 400\">84.51<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400\">flow<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0.93<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400\">Temp<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">3.29<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400\">Pop<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<td><span style=\"font-weight: 400\">0<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><b>Partial dependence plots <\/b><span style=\"font-weight: 400\">visualize the effect of a single variable on model response, holding all other variables constant. \u00a0Model results vary the most with precipitation as seen in the top left plot. \u00a0Mean temperature and population density appear to also play a role in giant gourami distribution based on these plots, but may be more apparent if you zoom in on the upper temperature threshold or the lower population density range. <\/span><\/p>\n<p><a href=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2016\/06\/gourami_tc3_lr005_plots-e1465158809100.jpeg\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1964 aligncenter\" src=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2016\/06\/gourami_tc3_lr005_plots-e1465158809100-300x162.jpeg\" alt=\"gourami_tc3_lr005_plots\" width=\"720\" height=\"389\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2016\/06\/gourami_tc3_lr005_plots-e1465158809100-300x162.jpeg 300w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2016\/06\/gourami_tc3_lr005_plots-e1465158809100-768x415.jpeg 768w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2016\/06\/gourami_tc3_lr005_plots-e1465158809100.jpeg 941w\" sizes=\"auto, (max-width: 720px) 100vw, 720px\" \/><\/a><a href=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2016\/06\/gourami_tc3_lr005_plots.jpeg\"><br \/>\n<\/a><b>Model comparison<\/b><span style=\"font-weight: 400\">, varying tree complexity and learning rate to evaluate the best setting. Top row illustrates model fit for a tree complexity of 3 with a learning rate of 0.01 (Left) and 0.005 (right). \u00a0The bottom row illustrates model fit for a tree complexity of 4 with learning rate 0.01(L) and 0.005(R) :<a href=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2016\/06\/Gourami_BRT_Rplot_3x4d3-4_lr0.1-0.005_wNOpop.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1962 aligncenter\" src=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2016\/06\/Gourami_BRT_Rplot_3x4d3-4_lr0.1-0.005_wNOpop-300x257.png\" alt=\"Gourami_BRT_Rplot_3x4d3-4_lr0.1-0.005_wNOpop\" width=\"770\" height=\"660\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2016\/06\/Gourami_BRT_Rplot_3x4d3-4_lr0.1-0.005_wNOpop-300x257.png 300w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2016\/06\/Gourami_BRT_Rplot_3x4d3-4_lr0.1-0.005_wNOpop.png 635w\" sizes=\"auto, (max-width: 770px) 100vw, 770px\" \/><\/a><\/span><\/p>\n<p><span style=\"font-weight: 400\">It appears that a model with a tree complexity of 3 and a learning rate of 0.005 performs the best. This model indicates that precipitation has the largest effect on the distribution of giant gourami in SE Asia, based on the initial 34 occurrence points. \u00a0<\/span><\/p>\n<p><b>Model critique<\/b><span style=\"font-weight: 400\">: BRTs are not a spatially explicit model and thus relies only on the relationship between the variables and the sample points. \u00a0Additionally, due to the complex nature of the model (often outputting thousands of trees), the results can be difficult to interpret or explain.<\/span><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Boosted regression trees (BRT) are an ensemble tree-based species distribution model that iteratively grows small\/simple trees based on the residuals from all previous trees. \u00a0The model is run on a random subset of your data, and ALL predictor variables are considered to produce the best splits at each node. \u00a0The tree-complexity determines how deep each&hellip; <a href=\"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/2016\/06\/05\/boosted-regression-trees-giant-gourami-distribution\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":7725,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[661527,1],"tags":[],"class_list":["post-1960","post","type-post","status-publish","format-standard","hentry","category-tutorials-2016","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/posts\/1960","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/users\/7725"}],"replies":[{"embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/comments?post=1960"}],"version-history":[{"count":5,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/posts\/1960\/revisions"}],"predecessor-version":[{"id":1968,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/posts\/1960\/revisions\/1968"}],"wp:attachment":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/media?parent=1960"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/categories?post=1960"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/tags?post=1960"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}