Logistic Regression in R

Introduction

I have done a cluster analysis on a fire regimes produced by the MC2 dynamic global vegetation model (DGVM) run across the Pacific Northwest (PNW) over the time period 1895-2100 (see my previous Cluster Analysis in R post for details). Clusters were based on fire return interval (FRI), carbon consumed by fire, and fraction of cell burned. MC2 is a process-based model with modules for biogeochemistry, fire, and biogeography. Process models are generally complicated models with high computational costs. I wanted to explore the question “How well could a statistical model reproduce the results from a process-based model?”

A complete answer to that question would take much work, of course, as MC2 produces data on vegetation types, carbon pools, carbon fluxes, water flows, and, of course, fire. But this investigation provides a starting point and has the potential to show that at least one aspect of the model could have a statistical alternative to process-based computation.

Method description

Overview

Inputs into the MC2 include monthly weather in the form of maximum temperature, minimum temperature, precipitation, and mean dew point temperature, as well as elevation, and soil depth. I chose to use these inputs with a set of logistic regression models (described below) to predict the fire regime cluster of each cell in the results from my cluster analysis. To simplify the number of regression variables, I summarized the monthly variables by “summer” (April-September), “winter” (October-January), and annually.

Logistic regression

Conceptually, logistic regression differs from linear regression in that logistic regression is designed to predict the probability of an outcome or occurrence in a binary data set. It is commonly used in models predicting the presence or absence of a phenomenon (e.g. fire occurrence, species occurrence) over a spatial domain. Linear regression uses one or more continuous or binary explanatory variables to calculate the binary response variable.

The formula used in linear regression is:

e^t / (e^t + 1)

which is equivalent to

1 / (1 + e^-t)

where t is the logit, a linear function of the explanatory values. The graph of the function is S-shaped.

An example in R should make things clearer:

Make a dataset with an independent variable and a dependent binary variable:

myDS = data.frame( iv = 1:50, # independent variable dv = c(rep(0,25),rep(1,25)) )

plot(myDS$iv,myDS$dv)
Your graph should look like this:

Scramble the dependent variables between 20 and 30 to make the plot messier, like real life:

myDS$dv[21:30] = sample(1:0,10,replace=TRUE)
plot(myDS$iv,myDS$dv)
Your graph should look similar to this one, with the regions of 1 and 0 values overlapping:

Now run a logistic regression on the data and look at the summary of the result:

tstGlm = glm(dv ~ iv, data = myDS, family = binomial)
summary(tstGlm)
Your summary should look something like this:

Call: glm(formula = dv ~ iv, family = binomial, data = myDS)
Deviance Residuals: Min 1Q Median 3Q Max -1.55618 -0.15882 -0.01953 0.15884 2.01358

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -9.0868 3.0071 -3.022 0.00251 ** iv 0.3429 0.1113 3.080 0.00207 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)
Null deviance: 69.235 on 49 degrees of freedom Residual deviance: 19.165 on 48 degrees of freedom

AIC: 23.165
Number of Fisher Scoring iterations: 7
There is a good explanation of these in this video: https://www.youtube.com/watch?v=xl5dZo_BSJk. A good reference for formulas used can be found at http://documentation.statsoft.com/portals/0/formula%20guide/Logistic%20Regression%20Formula%20Guide.pdf

But the short explanation is:

Call: The call that was made to the glm function

Deviance residuals: Not really useful in a binary response regression

Coefficients:

Intercept: The intercept, or constant coefficient, used in the logit linear function
Estimate: These are the coefficients used with each variable in the logit (in this case we have one variable in the logit)
Std. Error: The standard error associated with each coefficient.
Z Value: The z value for the coefficient. This is the Estimate divided by the Std. Error.
Pr(>z): The p value associated with each z value.
Null deviance: A measure of deviance between the model and a model that only has the constant coefficient

Residual deviance: A measure of deviance to the model produced.

AIC: Akaike information criterion. This is a measure of the quality of the model based on goodness of fit and model simplicity. For models created using the same number of data points, a model with a lower AIC is considered better than one with a higher AIC.

Make a plot of the regression function along with the original points. You have to use a data frame in which the column has the same name as the column you used to create the regression:

tmpData = data.frame(iv = seq(0,50,0.2))
plot(seq(0,50,0.2), predict(tstGlm, newdata = tmpData, type = 'response'), type='l')
points(myDS$iv,myDS$dv)

Notice how the logistic curve, which predicts the probability of a positive outcome, moves from a y value of 0 to a y value of 1 where the dependent values from the original dataset are mixed between 1 and 0. From this, it is easy to see how logistic regression predicts probabilities for binary data.

Logistic regression can be used with multivariate regression, too. Each explanatory value has a term in the logit. Each term’s coefficient determines its influence on the final prediction.

Stepwise Logistic Regression

Finding the best combination of explanatory values to use can present a challenge. Often, there are more variables than desired, and using too many can result in an overfitted model. One method used to determine which variables to use is stepwise regression. There are two approaches to this method: forward and backward.

In forward stepwise regression, the algorithm starts with the constant coefficient. Then it tests all of the explanatory values to find the one that, by itself, has the greatest explanatory power. It computes the AIC for the model and then tests the remaining explanatory variables to find which adds the greatest explanatory power. It then computes the AIC with the added variable. If the AIC is lower (better) with the new explanatory variable, it repeats the process. If the AIC is greater, it discards the new variable and stops.

Backward stepwise regression takes the opposite approach. It starts with all the variables and removes the one with the least explanatory value and then calculates AIC. It stops at the point when removing a variable would raise the AIC.

One approach to finding a model with stepwise regression is to do both forward and backward regression and see if one returns a better result than the other. Additionally, the results can be modified by removing any explanatory variables that have a high p-value to see if removing them yields a better model. I used both forward and backward regression and in two cases, found model improvement by removing an explanatory variable with a low p value.

A problem with stepwise regression is that it does not guarantee the best possible regression function. To find that, one has to test all possible combinations of explanatory variables, but that is generally a very compute-intensive proposition. So, even though it is easy to find criticisms of stepwise regression, it is commonly used.

Without going into too much detail, here is some R code that performs stepwise regression:

# Generate some independent, explanatory variables. Note how some are more “scrambled” than others
indepVs = data.frame( v1 = c(1:50), v2 = c(1:50), v3 = c(1:50), v4 = c(1:50), v5 = c(1:50), v6 = c(1:50) )
indepVs$v2[23:27] = sample(c(23:27),5) indepVs$v3[21:30] = sample(c(21:30),10) indepVs$v4[16:35] = sample(c(16:35),20) indepVs$v5[11:40] = sample(c(11:40),30) indepVs$v6[1:50] = sample(c(1:50),50)
# The dependent variable, scrambled in the middle

depV = rep(0,50) depV[26:50] = 1 depV[21:30] = sample(c(0:1),10,replace=TRUE)

# Generate a full model (using all the explanatory variables)
fullGlm = glm( depV ~ v1+v2+v3+v4+v5+v6, data = indepVs, family = binomial() )

# Generate an empty model
emptyGlm = glm(depV ~ 1, data = indepVs,family=binomial)
# Do forward stepwise regression
forwardGlm = step( emptyGlm, scope = list(lower=formula(emptyGlm),upper=formula(fullGlm)), direction='forward' )

# Do a backward stepwise regression

backGlm = step(fullGlm)

Plotting the results and examining the resulting models using summary() will give you insights into how the process is working.

Geographically weighted logistic regression

I did some searching for geographically weighted logistic regression and found a couple of papers that used the method. I did not come across any easily implemented solutions for it (which does not mean one does not exist). I think doing geographically weighted logistic regression could have a lot of value, especially over large and varied geographic areas, like the one I’m studying.

I did a logistic regression on a single EPA Level III ecoregion (The Blue Mountains) within my study area and compared the results of that to results from the model produced using the whole study region and found it had a higher predictive value for the ecoregion. This is an indication that geographically weighted regression could provide advantages.

The Analysis

I used logistic regression as part of a process to predict which fire regime each cell in my dataset would have. My explanatory values were derived from the input dataset I used for modeling, these included: annual, summer, and winter means for maximum temperature, minimum temperature, precipitation, and mean dewpoint temperature; elevation; and soil depth. For each of my four fire regimes, I generated a logistic regression predicting whether a cell was in the fire regime or not. Based on recommendations from https://www.medcalc.org/manual/logistic_regression.php, I used this formula to determine the number of data points to use in each regression:

10 * num_exp_vars / min(frac_pos_cases, frac_neg_cases)

where num_ind_vars is the number of explanatory variables, frac_pos_cases is the fraction of positive (presence) cells in the dataset, and frac_neg_cases is the fraction of negative cases in the dataset.

This gave me four logistic regression functions. For each cell in the study area, I ran each of the logistic functions and recorded the probability returned. The predicted fire regime for a cell was assigned based on the logistic function that produced the highest probability. I repeated this entire procedure for the Blue Mountain Ecoregion only.

Results

Across the full study area, this process correctly predicted the fire regime with 61% accuracy (Keep in mind random assignment would be expected to return 25% accuracy). Within the Blue Mountain Ecoregion, the process predicted the fire regime with 46% accuracy using the logistic functions derived from the data for the entire study area, and with 74% accuracy using the logistic functions derived for the Blue Mountain Ecoregion. This is s a strong indication that geographic influences exist and that geographically weighted logistic regression has the potential to improve results.

Method Critique

I was very pleased with the results of this method, and frankly, a little surprised at how successful the method was at predicting fire regime. This investigation leads me to believe that statistical models should be considered as alternatives to process-based models for some vegetation modeling tasks, especially considering the lower computational overhead of statistical models.

The biggest downside to the method is the manual procedures involved in finding the best logistic function. This process could be automated with some programming, however, as could a geographically weighted version of logistic regression.

GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Just another blogs.oregonstate.edu site

Leave a reply Cancel reply

Contact Info