Tim Sheehan

GEO 584: Advanced Spatial Statistics and GIScience

Final Blog Post

Introduction

The MC2 dynamic global vegetation model

Dynamic global vegetation models (DGVMs) model vegetation dynamics over regional to global scales. DGVMs are process-based models, meaning they model the physical and chemical processes that drive vegetation growth. MC2 (Bachelet et al., 2015) is one such model. MC2 models processes on a regular latitude longitude grid using a monthly time step. Each grid cell is treated independently. That is to say, processes in one cell do not affect processes in any other cell. MC2 inputs include soil characteristics, elevation, and monthly means for minimum temperature, maximum temperature, precipitation, and dew point temperature.

Unlike many DGVMs, MC2 includes a fire module that models the effects of fire on vegetation. The current version of MC2 simulates a fire whenever fuel conditions exceed a defined threshold. In other words, ignitions are assumed. This has the potential to overestimate fire occurrence in areas where the fuel threshold is frequently exceeded, to underestimate fire occurrence in areas where the threshold is never or rarely exceeded, and to model only severe fires. Other aspects of model results, especially vegetation occurrence, age, and density as well as carbon dynamics, are driven in part by fire, so correctly modeling fire is important to correctly modeling other processes.

In an effort to make MC2 fire modeling more realistic, I have implemented a stochastic ignitions algorithm in the MC2 fire model. Rather than having a single fuel threshold that determines fire occurrence, the stochastic model uses a probabilistic method (Monte Carlo) to determine if an ignition source is present in a model grid cell. If an ignition source is present, then a second probabilistic method, based on fuel conditions, is used to determine whether the ignition leads to a fire occurrence. If a fire occurs, it is modeled in the same way fires are modeled with assumed ignitions.

A fire regime is the pattern, frequency, and intensity of wildfire in an area and is inextricably linked to an area’s ecology. Despite the multiple factors that define a fire regime, aspects of DGVM fire results, including those from MC2, are commonly presented in terms of individual metrics over time, for example an increase in fire frequency in the 21st century versus the 20th century (e.g. Sheehan et al., 2015).

EPA level III ecoregions

EPA ecoregions (https://archive.epa.gov/wed/ecoregions/web/html/ecoregions-2.html) are geographic regions for North America based on similarities of biotic and abiotic phenomena affecting ecosystem integrity and quality. EPA ecoregions are defined at 4 levels (I through IIII) with higher levels covering smaller areas. This study uses level III ecoregions which range in area from approximately 24,000 to 100,000 km2.

Research questions and hypotheses

Research questions for this study are:

  • How consistent are fire regimes within ecoregions?
  • How do fire regimes change under projected climate?
  • Could fire regimes produced by MC2 be predicted by a statistical model?

To gain insight into these questions, I tested the following hypotheses:

  1. Fire-free cells will decrease in the 21st century compared to the 20th due to increasing temperatures.
  2. Fire regimes will group within individual ecoregions over the 20th century, but less so in the 21st century due to changes in fire regime as a result of projected changes in climate between the two centuries.
  3. Logistic regression will be able to predict the fire regimes produced by MC2 due to embedded statistical relationships between fire and input variables in the model.
  4. Logistic regression performed on an ecoregional basis will have greater predictive power than that one performed over the entire region due to local geographical relationships within each ecoregion.

Methods

Study area, resolution, and data

The spatial domain for this study covers the Pacific Northwest (PNW) (Fig. 1), defined as the portion of the conterminous United States north of 42° latitude and west of -111° longitude. The area was mapped to a 2.5 arc minute (approximately 4 km x 4 km) latitude/longitude grid. Modeling was done over the period 1895-2100. I also repeated the analysis using only the Blue Mountain Ecoregion (Fig. 1C), located within the PNW.

SheehanFinalBlogFig1

Figure 1: A) Index map of PNW, B) digital elevation model (DEM) of PNW, and C) EPA Level III ecoregions.

Inputs for MC2 include static soil and elevation data as well as monthly climate data. Climate data consist of minimum temperature, maximum temperature, precipitation, and dew point temperature. Inputs for 1895-2010 were from the PRISM (http://www.prism.oregonstate.edu/) 4 km dataset. This dataset is created using interpolation of observed data. Inputs for 2011-2100 came from the Coupled Model Intercomparison Project 5 (CMIP 5, http://cmip-pcmdi.llnl.gov/cmip5/) Community Climate System Model 4 (CCSM4, http://www.cesm.ucar.edu/models/ccsm4.0/) run using the Representative Concentration Pathway (RCP) 8.5 (https://en.wikipedia.org/wiki/Representative_Concentration_Pathways) CO2 concentration data. RCP 8.5 is based on a “business as usual” scenario of CO2 production through the end of the 21st century. 2011-2100 climate data was downscaled to the 2.5 arc minute resolution using the Multivariate Adaptive Constructed Analogs method (MACA; Abatzoglou and Brown, 2012; http://maca.northwestknowledge.net/). MC2 outputs include annual values for vegetation type, carbon pools and fluxes, hydrological data, and fire occurrence data including fraction of grid cell burned and carbon consumed by fire. MC2 input and output data were divided into two datasets based on time period: 20th century (1901-2000) and 21st century (2001-2100).

k-means cluster analysis

For this study, fire regime was defined as a combination of mean annual carbon consumed by fire, the mean annual fraction of cell burned, and the fire return interval (FRI) over the century. (Note that FRI is usually calculated over periods longer than a century, especially in ecosystems rarely exposed to fire). For cluster analysis, the two datasets were combined, and normalized to avoid uneven influences caused by differing data value ranges. I used K-means clustering in R to produce 4 fire regime clusters (details in my blog post on K-means clustering in R: http://blogs.oregonstate.edu/geo599spatialstatistics/2016/05/21/cluster-analysis-r/).

Logistic regression

I did stepwise logistic regression modeling for each of the fire regime clusters (details in my blog post on logistic regression in R: http://blogs.oregonstate.edu/geo599spatialstatistics/2016/05/27/logistic-regression-r/). Explanatory values were taken from inputs to the MC2 model runs (Fig. 2). Climate variables were summarized annually, by “summer” (April-September), and by “winter” (October-January). The mean of these was then taken over each century. Additional explanatory variables were soil depth and elevation. For each cluster, presence was membership in the cluster and absence was non-membership. The process resulted in four logistic functions, for each fire regime cluster.

Tournament cluster prediction

I used a tournament method to predict the fire regime cluster. For each grid cell, each cluster’s linear regression model was used to calculate the probability that the grid cell would be in that cluster. The cluster with the highest membership probability was chosen as the predicted cluster. I constructed confusion matrices and contingency tables of predicted fire regimes versus actual fire regimes. I also produced maps showing the predicted fire regime cluster, the actual fire regime cluster, and the Euclidean distance between cluster centers for the predicted and actual fire regime clusters.

Results

Changes in input variables over time

Climate models consistently project a warming climate (Sheehan et al., 2015) in the Northwest during the 21st century, but precipitation changes vary across models. The results of the CCSM4 run with the RCP 8.5 CO2 values indicate that temperature warms approximately 2 to 4 °C over the entire region, that precipitation is relatively stable with local increases, and that mean dew point temperature is stable to increasing by up to approximately 3 °C (Fig. 2).

SheehanFinalBlogFig2

Fig. 2: Climate variables and their differences by century: A) 20th century maximum temperature; B) 21st century maximum temperature; C) change in maximum temperature between 20th and 21st centuries; D) 20th century precipitation; E) 21st century precipitation; F) change in precipitation between 20th and 21st centuries; G) 20th century dew point temperature; H) 21st century dew point temperature; and I) change in dew point temperature between 20th and 21st centuries.

Cluster analysis

The four clusters yielded by the analysis (Table 1) (details and graphs in my blog post on K-means clustering in R) can be described in terms of the factors used to generate them. These are: 1) low carbon consumed, medium fraction burned, low FRI; 2) high carbon consumed, medium fraction burned, medium FRI; 3) low carbon consumed, low fraction burned, medium FRI; 4) no fire. Euclidean distances between cluster centers are listed in Table 2.

Table 1: Non-normalized values for centers of clusters from cluster analysis. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl1

Table 2: Euclidean distances between cluster centers (normalized values). (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl2

Overall, there were more fire-free cells in the 20th century (31%) than in the 21st (18%). This is readily apparent in the Cascade and Rocky mountain areas (Fig. 3). Fire regimes are fairly consistent in some level III ecoregions and less so in others (Fig. 3). In some cases there is greater consistency in the 20th century (e.g. Coast Range, Cascades, and North Cascades) while others show more consistency in the 21st century (e.g. Northern Basin and Range and Columbia Plateau). Still in others are dominated by more than one but with the exclusion of others (e.g. Willamette Valley). Overall, coverage by the no-fire fire regime decreases while coverage by the high carbon consumed, medium fraction burned, medium FRI increases, especially in the Eastern Cascades Slopes and Foothills, Northern Rockies, and Blue Mountains ecoregions.

SheehanFinalBlogFig3

Fig. 3: Fire regime coverage determined by cluster analysis with EPA Level III Ecoregion overlay for A) the 20th century; B) the 21st century; C) Euclidean distance between cluster centers for time periods. (C: Carbon; Med: Medium; Frac: Fraction)

Logistic regression and tournament fire regime prediction

Of the fourteen explanatory variables used in the logistic regression functions, every variable was used in at least one regression, and soil depth was the only variable used in all regressions (Table 3).

Table 3: Explanatory variables used in the logistic regression functions. (D: Depth; Max: Maximum; T: Temperature; Min: Minimum; Ppt: Precipitation)

SheehanFinalBlogTbl3

The confusion matrix (Table 4) and marginal proportions table (Table 5) for the predictions show that predictions were correct in 62% of grid cells, and that for each fire regime cluster, the correct value was predicted more than any incorrect value. Maps for 20th and 21st century predicted and actual fire regime clusters (Fig. 4, A-B, E-F) show that predictions align fairly well with actual fire regime clusters. Maps of the distances between the centers of the predicted and actual fire regime clusters (Fig. 4) show that when clusters differ, they generally differ by a small amount. The mean Euclidean distance between predicted and actual clusters is 0.21 for the 20th century and 0.18 for the 21st, far less than the minimum distance between any two cluster centers, 0.33.

Table 4: Confusion matrix for predicted and actual fire regime clusters for combined 20th and 21st centuries. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl4

Table 5: Marginal proportions table for predicted and actual fire regime clusters for combined 20th and 21st centuries. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl5

SheehanFinalBlogFig4

Fig. 4: Predicted and actual fire regime clusters and Euclidean distances between predicted and actual cluster centers: A) predicted fire regime clusters for the 20th century; B) actual fire regime clusters for the 20th century; C) Euclidean distances between predicted and actual cluster centers for the 20th century; D) predicted fire regime clusters for the 21st century; E) actual fire regime clusters for the 21st century; and F) Euclidean distances between predicted and actual cluster centers for the 21st century. (C: Carbon; Med: Medium; Frac: Fraction)

Geographical influences

Within the Blue Mountain Ecoregion a logistic regression could not be performed on the no-fire fire regime due to its low occurrence. For this reason no predictions were made for it. Logistic regression was performed for the other fire regimes and an analysis of the ecoregion was performed. Results (Tables 6 and 7) show that predictions using the regressions from the Blue Mountain ecoregion were more accurate (66% correct classification) than predictions for the ecoregion using the logistic regression for the entire PNW (Tables 8 and 9) (63% correct classification).

Table 6: Confusion matrix for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the ecoregion. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl6

Table 7: Marginal proportions table for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the ecoregion. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl7_copy

Table 8: Confusion matrix for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the entire PNW. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl8

Table 9: Marginal proportions table for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the entire PNW. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl4

Discussion

Results and hypotheses

Fire-free cells decreased over the region while precipitation and dew point temperature increased. It is highly likely that rising temperatures were ultimately behind this, but analysis to prove this more forcefully is beyond the scope of this study. That said, I would suggest that hypothesis 1 (Fire-free cells will decrease in the 21st century compared to the 20th due to increasing temperatures) is strongly supported.

Given the strong influence of fire on ecosystem characteristics, and a warming climate expanding the area where fire is likely, I expected fire regimes to change, especially between along the edges of low elevation, flatter ecoregions and the hillier, higher elevation ecoregions surrounding them (for example the Columbia Plateau and Northern Rockies ecoregions). While this expected expansion took place in some places, the opposite effect took place within other ecoregions. The relationships between fire and ecoregions seem to hold or become stronger in as many cases as they become weaker. Thus the hypothesis 2 (Fire regimes will group within individual ecoregions over the 20th century, but less so in the 21st century due to projected changes in climate between the two centuries) is not supported.

The use of logistic regression with a tournament algorithm yielded a statistical model with very strong predictive powers. The high rate of correct prediction and the relatively small distances between mis-predicted fire regime clusters and actual fire regime clusters indicate that statistical relationships are embedded with the process-based MC2 and that logistic regression can be used to project fire regime clusters. Hypothesis 3 is supported.

The predictive power of the logistic regression developed for the Blue Mountain Ecoregion was somewhat better than that of the regression for the entire area. One should not base a conclusion on a sample of one, but this does provide some support for hypothesis 4 (Logistic regression performed on an ecoregional basis will have greater predictive power than that one performed over the entire region due to local geographical relationships within ecoregions).

Broader significance

Fire is projected to occur in areas where it has not been present in the past. The relationship between cohesiveness of fire regime and ecoregion varies from ecoregion to ecoregion. As the climate warms and fire regimes change, though, changing fire regimes will likely change characteristics of some areas, and some ecotones will likely shift as a result. The implications include that for some ecoregions, boundaries will likely shift, and land managers should consider the potential effects of not only the introduction of fire into previously fire-free areas, but also shifts in ecotones.

There is a case to be made for the use of both statistical and process-based vegetation models. Process-based models not only produce predictive results, but they also provide opportunities to find non-linear responses (so-called tipping points) as well as novel combinations of vegetation types, fire, and weather/climate. Statistical models, on the other hand, are much easier to develop and to execute. While a statistical model might not provide as detailed a relationship between predictive and resulting variables, they can often provide more timely results. My results indicate that it might be possible to substitute a statistical model for a process-based vegetation model under certain circumstances.

Statistical analyses are commonly used to analyze climate models by comparing hindcasted model results to observed data. In my experience, vegetation modelers struggle to validate hindcasted results, often using simple side-by-side comparisons of observational and model data. The results from this study point to a possible way to deepen such evaluations. Researchers could develop one set of statistical models using actual historical observations for vegetation type, fire, etc. and another set using modeled data for the same time period. Comparing the models to one another, they could evaluate the similarities of differences of explanatory values’ influences to gain insight to how well the model is emulating natural processes.

Future work

This study can best be described as a pilot study for combining multiple fire-related output variables into fire regimes instead of analyzing them individually. How to best define a fire regime becomes both a philosophical and research question. I will be researching this in the coming weeks and months as I plan to use a refined version of the methods I used in this course for one of the chapters in my dissertation. I have not come across the use of cluster analysis to characterize vegetation model fire regimes, so I am hoping it will be considered a novel approach.

The effects of the stochastic ignitions algorithm within the model cannot be ignored when doing a cluster analysis. Ideally, the model would be run long enough for stochastic effects to be evened out over model time. Another approach is to run multiple (hundreds to thousands) of iterations of the model and average the results. Time and resource constraints make this impractical. Another possible method would be to use a spatial weighting kernel to smooth results over the study area. This method would seem justified given the tendency for fire regimes to cluster geographically. This will be another avenue I investigate.

Finally, the results from the Blue Mountain Ecorgion speak to the potential for geographically weighted regression. I will also do more research on that method.

My Learning

Cluster analysis

About a year ago I was first exposed to cluster analysis. This course gave me a great venue in which to try it on my own. Going through the process taught me the method and some of the pitfalls. I did not normalize the values on my first pass using cluster analysis, and I learned that this can be very important to get meaningful results.

Logistic regression and tournament

The logistic regression portion of this study reintroduced me to a technique I had used before, but did not understand well. As I worked through the logistic regression, I gained a better understanding of the process and how it worked. Putting together the logistic regression blog post made me go through the steps in a deliberate fashion. Doing the 4 by 4 confusion matrix helped me to get a better understanding of what a confusion matrix represents as well as what recall and precision mean in that context. While I did not have a chance to implement geographically weighted logistic regression, I at least gained an understanding of what it means. Implementing a tournament approach was not new to me; I have used a similar technique in genetic algorithms, but it was good to confirm the potential power of such a simple method.

Other packages and methods

Through the term I learned quite a bit about various software packages and tools. I’ve used Arc quite a bit from time to time in the past, but delving into a couple of the spatial statistics tools – hotspot analysis and Moran’s I – reminded me of the power Arc has to offer, as well as some of its quirks. For example my dataset was too large for Arc to handle with Moran’s I. Since I did most of my work in R, I did not have occasion to reacquaint myself with ModelBuilder. I have done a lot of raster processing in Python, but the clipping I had to do to get the Blue Mountain Ecoregion out of the larger dataset forced me to implement a flexible and efficient clipping script that will serve me well in the future.

Since R was the package that was best for my large dataset and had the algorithms I needed, it was the program I learned the most about. I learned how to read and write NetCDF files in R, do cluster analysis, and do stepwise logistic regression in R. In addition to these specific methods, I learned the differences between R datasets, vectors, and matrices. R handles each of these in a different manner. I have found the nuances tricky in the past and tricky still, but I have certainly become more facile with the program.

Statistics

What I have learned about statistics in this class starts with the existence of analytical methods I was not previously aware of. I was unaware of hotspot analysis, but going through the exercise provided me with a good foundational understanding. Spatial autocorrelation is something I had not had formal exposure to, but the basic concept made perfect sense. Its importance, though, and the complexity of dealing with it had not occurred to me. I had been exposed to spectral analysis with temporal data and its use with spatial data makes sense. I had heard of wavelets, but got a better understanding of the method. Correlograms were a new concept for me, presentations on them helped me to understand them. For the most part, regression methods presented in the class reinforced or expanded on concepts I already knew. The application of a kernel for geographically weighted regression was the technique that stretched my understanding the furthest. I had been exposed to the concept of applying a kernel to a dataset, but for some reason, its use in GWR made the concept click for me. Principle components analysis was another method I had heard of, but I did not understand it until it was presented in class, and the concept now makes perfect sense.

Conclusion

Diving into my own project in this class has been a valuable experience, but probably just as valuable has been the interactions with other students and their projects. The broad scope of approaches has exposed me to a wide variety of techniques. Going forward in my research, I will be able to investigate methods I was not aware of in the past.

References

Abatzoglou, J.T., Brown, T.J., 2012. A comparison of statistical downscaling methods suited for wildfire applications. Int. J. Climatol. 32, 772–780.

Bachelet, D., Ferschweiler, K., Sheehan, T.J., Sleeter, B., Zhu, Z., 2015. Projected carbon stocks in the conterminous US with land use and variable fire regimes. Global Change Biol., http://dx.doi.org/10.1111/gcb.13048.

Sheehan, T., Bachelet, D., Ferschweiler, K., 2015. Projected major fire and vegetation changes in the Pacific Northwest of the conterminous United States under selected CMIP5 climate futures. Ecological Modelling, 317, 16-29.

Introduction

I have done a cluster analysis on a fire regimes produced by the MC2 dynamic global vegetation model (DGVM) run across the Pacific Northwest (PNW) over the time period 1895-2100 (see my previous Cluster Analysis in R post for details). Clusters were based on fire return interval (FRI), carbon consumed by fire, and fraction of cell burned. MC2 is a process-based model with modules for biogeochemistry, fire, and biogeography. Process models are generally complicated models with high computational costs. I wanted to explore the question “How well could a statistical model reproduce the results from a process-based model?”

A complete answer to that question would take much work, of course, as MC2 produces data on vegetation types, carbon pools, carbon fluxes, water flows, and, of course, fire. But this investigation provides a starting point and has the potential to show that at least one aspect of the model could have a statistical alternative to process-based computation.

Method description

Overview

Inputs into the MC2 include monthly weather in the form of maximum temperature, minimum temperature, precipitation, and mean dew point temperature, as well as elevation, and soil depth. I chose to use these inputs with a set of logistic regression models (described below) to predict the fire regime cluster of each cell in the results from my cluster analysis. To simplify the number of regression variables, I summarized the monthly variables by “summer” (April-September), “winter” (October-January), and annually.

Logistic regression

Conceptually, logistic regression differs from linear regression in that logistic regression is designed to predict the probability of an outcome or occurrence in a binary data set. It is commonly used in models predicting the presence or absence of a phenomenon (e.g. fire occurrence, species occurrence) over a spatial domain. Linear regression uses one or more continuous or binary explanatory variables to calculate the binary response variable.

The formula used in linear regression is:

e^t / (e^t + 1)

which is equivalent to

1 / (1 + e^-t)

where t is the logit, a linear function of the explanatory values. The graph of the function is S-shaped.

An example in R should make things clearer:

Make a dataset with an independent variable and a dependent binary variable:

myDS = data.frame(
iv = 1:50, # independent variable
dv = c(rep(0,25),rep(1,25))
)

plot(myDS$iv,myDS$dv)

Your graph should look like this:

SheehanLogRegPlot1

Scramble the dependent variables between 20 and 30 to make the plot messier, like real life:

myDS$dv[21:30] = sample(1:0,10,replace=TRUE)

plot(myDS$iv,myDS$dv)

Your graph should look similar to this one, with the regions of 1 and 0 values overlapping:

SheehanLogRegPlot2

Now run a logistic regression on the data and look at the summary of the result:

tstGlm = glm(dv ~ iv, data = myDS, family = binomial)

summary(tstGlm)

Your summary should look something like this:

Call:
glm(formula = dv ~ iv, family = binomial, data = myDS)

Deviance Residuals:
Min       1Q   Median       3Q       Max
-1.55618 -0.15882 -0.01953   0.15884   2.01358

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.0868     3.0071 -3.022 0.00251 **
iv           0.3429     0.1113   3.080 0.00207 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 69.235 on 49 degrees of freedom
Residual deviance: 19.165 on 48 degrees of freedom

AIC: 23.165

Number of Fisher Scoring iterations: 7

There is a good explanation of these in this video: https://www.youtube.com/watch?v=xl5dZo_BSJk. A good reference for formulas used can be found at http://documentation.statsoft.com/portals/0/formula%20guide/Logistic%20Regression%20Formula%20Guide.pdf

But the short explanation is:

Call: The call that was made to the glm function

Deviance residuals: Not really useful in a binary response regression

Coefficients:

  • Intercept: The intercept, or constant coefficient, used in the logit linear function
  • Estimate: These are the coefficients used with each variable in the logit (in this case we have one variable in the logit)
  • Std. Error: The standard error associated with each coefficient.
  • Z Value: The z value for the coefficient. This is the Estimate divided by the Std. Error.
  • Pr(>z): The p value associated with each z value.
  • Null deviance: A measure of deviance between the model and a model that only has the constant coefficient

Residual deviance: A measure of deviance to the model produced.

AIC: Akaike information criterion. This is a measure of the quality of the model based on goodness of fit and model simplicity. For models created using the same number of data points, a model with a lower AIC is considered better than one with a higher AIC.

Make a plot of the regression function along with the original points. You have to use a data frame in which the column has the same name as the column you used to create the regression:

tmpData = data.frame(iv = seq(0,50,0.2))

plot(seq(0,50,0.2), predict(tstGlm, newdata = tmpData, type = 'response'), type='l')

points(myDS$iv,myDS$dv)

SheehanLogRegPlot3

Notice how the logistic curve, which predicts the probability of a positive outcome, moves from a y value of 0 to a y value of 1 where the dependent values from the original dataset are mixed between 1 and 0. From this, it is easy to see how logistic regression predicts probabilities for binary data.

Logistic regression can be used with multivariate regression, too. Each explanatory value has a term in the logit. Each term’s coefficient determines its influence on the final prediction.

Stepwise Logistic Regression

Finding the best combination of explanatory values to use can present a challenge. Often, there are more variables than desired, and using too many can result in an overfitted model. One method used to determine which variables to use is stepwise regression. There are two approaches to this method: forward and backward.

In forward stepwise regression, the algorithm starts with the constant coefficient. Then it tests all of the explanatory values to find the one that, by itself, has the greatest explanatory power. It computes the AIC for the model and then tests the remaining explanatory variables to find which adds the greatest explanatory power. It then computes the AIC with the added variable. If the AIC is lower (better) with the new explanatory variable, it repeats the process. If the AIC is greater, it discards the new variable and stops.

Backward stepwise regression takes the opposite approach. It starts with all the variables and removes the one with the least explanatory value and then calculates AIC. It stops at the point when removing a variable would raise the AIC.

One approach to finding a model with stepwise regression is to do both forward and backward regression and see if one returns a better result than the other. Additionally, the results can be modified by removing any explanatory variables that have a high p-value to see if removing them yields a better model. I used both forward and backward regression and in two cases, found model improvement by removing an explanatory variable with a low p value.

A problem with stepwise regression is that it does not guarantee the best possible regression function. To find that, one has to test all possible combinations of explanatory variables, but that is generally a very compute-intensive proposition. So, even though it is easy to find criticisms of stepwise regression, it is commonly used.

Without going into too much detail, here is some R code that performs stepwise regression:

# Generate some independent, explanatory variables. Note how some are more “scrambled” than others

indepVs = data.frame(
v1 = c(1:50),
v2 = c(1:50),
v3 = c(1:50),
v4 = c(1:50),
v5 = c(1:50),
v6 = c(1:50)
)

indepVs$v2[23:27] = sample(c(23:27),5)
indepVs$v3[21:30] = sample(c(21:30),10)
indepVs$v4[16:35] = sample(c(16:35),20)
indepVs$v5[11:40] = sample(c(11:40),30)
indepVs$v6[1:50] = sample(c(1:50),50)

# The dependent variable, scrambled in the middle

depV = rep(0,50)
depV[26:50] = 1
depV[21:30] = sample(c(0:1),10,replace=TRUE)

# Generate a full model (using all the explanatory variables)

fullGlm = glm(
depV ~ v1+v2+v3+v4+v5+v6,
data = indepVs,
family = binomial()
)

# Generate an empty model

emptyGlm = glm(depV ~ 1, data = indepVs,family=binomial)

# Do forward stepwise regression

forwardGlm = step(
emptyGlm,
scope = list(lower=formula(emptyGlm),upper=formula(fullGlm)),
direction='forward'
)

# Do a backward stepwise regression

backGlm = step(fullGlm)

Plotting the results and examining the resulting models using summary() will give you insights into how the process is working.

Geographically weighted logistic regression

I did some searching for geographically weighted logistic regression and found a couple of papers that used the method. I did not come across any easily implemented solutions for it (which does not mean one does not exist). I think doing geographically weighted logistic regression could have a lot of value, especially over large and varied geographic areas, like the one I’m studying.

I did a logistic regression on a single EPA Level III ecoregion (The Blue Mountains) within my study area and compared the results of that to results from the model produced using the whole study region and found it had a higher predictive value for the ecoregion. This is an indication that geographically weighted regression could provide advantages.

The Analysis

I used logistic regression as part of a process to predict which fire regime each cell in my dataset would have. My explanatory values were derived from the input dataset I used for modeling, these included: annual, summer, and winter means for maximum temperature, minimum temperature, precipitation, and mean dewpoint temperature; elevation; and soil depth. For each of my four fire regimes, I generated a logistic regression predicting whether a cell was in the fire regime or not. Based on recommendations from https://www.medcalc.org/manual/logistic_regression.php, I used this formula to determine the number of data points to use in each regression:

10 * num_exp_vars / min(frac_pos_cases, frac_neg_cases)

where num_ind_vars is the number of explanatory variables, frac_pos_cases is the fraction of positive (presence) cells in the dataset, and frac_neg_cases is the fraction of negative cases in the dataset.

This gave me four logistic regression functions. For each cell in the study area, I ran each of the logistic functions and recorded the probability returned. The predicted fire regime for a cell was assigned based on the logistic function that produced the highest probability. I repeated this entire procedure for the Blue Mountain Ecoregion only.

Results

Across the full study area, this process correctly predicted the fire regime with 61% accuracy (Keep in mind random assignment would be expected to return 25% accuracy). Within the Blue Mountain Ecoregion, the process predicted the fire regime with 46% accuracy using the logistic functions derived from the data for the entire study area, and with 74% accuracy using the logistic functions derived for the Blue Mountain Ecoregion. This is s a strong indication that geographic influences exist and that geographically weighted logistic regression has the potential to improve results.

Method Critique

I was very pleased with the results of this method, and frankly, a little surprised at how successful the method was at predicting fire regime. This investigation leads me to believe that statistical models should be considered as alternatives to process-based models for some vegetation modeling tasks, especially considering the lower computational overhead of statistical models.

The biggest downside to the method is the manual procedures involved in finding the best logistic function. This process could be automated with some programming, however, as could a geographically weighted version of logistic regression.

Introduction

I am working with a dataset from a dynamic global vegetation model (DGVM) run across the Pacific Northwest (PNW) over the time period 1895-2100. This is a process-based model that includes a dynamic fire model. I am particularly interested in drivers of the fire regimes (the frequency and intensity of wildfire) produced by the model. The general question I am studying is “what, if any, is the relationship between the climatic and elevation inputs of the model and the fire regimes produced by the model?” A more specific pair of questions for this method is “Do fire regimes cluster locally?” and “How do fire regimes change over time?”

Method description

Overview

I divided the data into two time periods: 1901-2000 and 2001-2100 to capture the historical and projected climates and fire regimes over those periods. To define a fire regime, I chose the two fire-related output variables – carbon consumed by fire and percent of model pixel burned – along with fire return interval (FRI) over the century time period. I used k-means cluster analysis in R to define four fire regimes.

K-means clustering divides a dataset into a specified number of data point clusters and calculates centroids and cluster membership such that the Euclidean distance between each cluster’s centroids its members is minimized. The relationships between scales of the input variables should be taken into consideration in this type of analysis as it affects the values of Euclidean distances calculated. The steps in the analysis are outlined below. An appendix of the actual commands used for the analysis are in Appendix A.

Outline of steps

For each of the two input files:

  1. Open the NetCDF file
  2. Reverse the order of the latitude values (change from highest to lowest to lowest to highest)
  3. Filter out the na (no data) values
  4. Copy data (carbon consumed by fire, fraction pixel burned, FRI) into a combined data set

For the combined dataset:

  1. Normalize values for each variable to 0.0 to 1.0 using the minimum and maximum of each variable in the combined dataset.
  2. Generate the four clusters
  3. Output value distributions of the four clusters
  4. Distribute resulting data into two output datasets, one for each of the two time periods
    1. Divide into two separate datasets
    2. Distribute results into non-na data locations

For each of the output datasets:

  1. Create output NetCDF file
  2. Write data to NetCDF file
  3. Close NetCDF file

 

Results

Clusters

With my input data, the FRI values range from a minimum of 1 to a maximum of 100 years. The mean annual fraction area burned has a theoretical range from 0 to 1, and the mean annual carbon consumed ranges from 0 to 266.5 gm-2. Performing the cluster analysis using the original input values resulted in 3 of the 4 clusters driven primarily by the values of carbon consumed (Fig. 1a-c). Normalizing the values of each input variable using the variable’s minimum and maximum, resulted in clusters driven by different variables (Fig. 2). For the remainder of the project I am using normalized variables.

SheehanCluster01

Fig. 1: Z-score distributions of input variables within clusters for clusters computed without normalizing input data. Clusters 1, 2, and 3 (A-C) exhibit strong overlaps in both fraction area burned, and time between fires, and strong shifts in carbon consumed between clusters. Cluster 4 (D) is the only cluster in which all three factors differ substantially with other clusters.

SheehanCluster02

Fig. 2: Distributions of normalized input variables within clusters. Each of the clusters was created by performing cluster analysis using normalized input data. All clusters have substantial shifts in the distributions of at least two factors compared to the other clusters.

Result maps

The clustering of fire regimes can easily be seen in maps of the fire regimes (Fig. 3). Areas without fire tend to be at higher elevations: in the Coast Range and Cascades of Oregon and Washington and the Rockies of western Montana and northwestern Idaho. High frequency fires (Low FRI) are common in the plains and plateaus of southern Idaho, south and central Oregon, and the Columbia Plateau in central Washington. The other two fire regimes, both of which have a Medium FRI are somewhat intermingled, but are present in mid elevations in the Rockies, in southwest Oregon, and in the Willamette Valley and Puget Trough regions of Oregon and Washington, respectively.

SheehanCluster03

Fig. 3: Fire regime maps for (A) the historical period (1901-2000) and (B) the future period (2001-2100). (C: carbon consumed; Med: medium; Frac: fraction of grid cell; No Fire: no fire occurred in the cell over the time period)

Fire regime change from the historical (Fig. 3A) to future (Fig. 3B) period include the appearance of the High Carbon Consumed, Medium Fraction Burned, Medium FRI fire regime around the edges of the plains and plateaus as well as into the Blue Mountains of northeastern Oregon as well as the spread of both Medium FRI fire regimes into the Coast, Cascade, and Rocky Mountain ranges.

Method critique

I found this method to be very useful for my study. Fire and vegetation modeling results are commonly expressed using single variables over time or area and fail to take into consideration the relationships between multiple variables. This work indicates that breaking sets of multiple fire-related variables into clusters provides a way to better characterize regional characteristics as well as spatial changes through time.

I did find the data manipulation required for the method a bit tricky. Having to work around the na values within the data required a bit of labor. Converting matrix data to vector data for the methods was another minor inconvenience. All that said, however, I believe such manipulations will become second nature as my experience with R grows.

Appendix A: Annotated R commands used for the analysis of normalized input data.
## Cluster analysis on normalized data.
## The dataset is drawn from two files, one for the 20th
## century and one for the 21st
## Install the necessary libraries

library(“DescTools”) # for displaying data
library(“ncdf4″) # reading and writing netCDF files

## Set the current working directory for data and results

setwd(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis’)

## Open the NetCDF files for reading

histnc = nc_open(‘SummaryOver_1901-2000.nc’)
futnc = nc_open(‘SummaryOver_2001-2100.nc’)

## Get the lat and lon variables

lon = ncvar_get(histnc,’lon’)
lat = ncvar_get(histnc,’lat’)

## The lat dimension is highest to lowest in the input
## file, so its vector needs to be reversed.
## Reverse lat into ascending order (lower case r rev)

lat = rev(lat)
## Create the data frames for historical, future, and difference data

normHist = expand.grid(lon=lon, lat=lat)
normFut = expand.grid(lon=lon, lat=lat)
normDiff = expand.grid(lon=lon, lat=lat)

## Get the fields from the input files
## Note the lat dimension comes in from high to low, so
## it needs to be reversed so that R can display it (upper case R Rev)

## Consumed is the annual mean for carbon consumed
## PBurn is the annual mean of fraction of area burned
## FRI is the mean number of years between fires

normHist$Consumed = c(Rev(ncvar_get(histnc,’CONSUMED_mean’),margin = 2))
normHist$PBurn = c(Rev(ncvar_get(histnc,’PART_BURN_mean’),margin = 2))
normHist$FRI = c(Rev(ncvar_get(histnc,’PART_BURN_intvl’),margin = 2))

normFut$Consumed = c(Rev(ncvar_get(futnc,’CONSUMED_mean’),margin = 2))
normFut$PBurn = c(Rev(ncvar_get(futnc,’PART_BURN_mean’),margin = 2))
normFut$FRI = c(Rev(ncvar_get(futnc,’PART_BURN_intvl’),margin = 2))

## Normalize the values prior to doing the analysis
## Also get the z scores for distribution plotting later
## Note that NA values from the data must be omitted before
## taking the statistics

## Loop over the three variables used in the clusters
## foreach one filter out cells with no data using na.omit
## Normalize the data based on the min and max value of each
## variable.

for(nm in c(‘Consumed’,’PBurn’,’FRI’)) {
## Temporary data for efficient processing
tmpVect = c(na.omit(normHist[[nm]]),na.omit(normFut[[nm]]))
tmpMin = min(tmpVect)
tmpMax = max(tmpVect)
tmpDiff = tmpMax – tmpMin
tmpMean = mean(tmpVect)
tmpSD = sd(tmpVect)

## Normalize the data
normHist[[nm]] = (normHist[[nm]] – tmpMin) / tmpDiff
normFut[[nm]] = (normFut[[nm]] – tmpMin) / tmpDiff

## Z scores
normHist[[paste(nm,’_Z’,sep=”)]] = (normHist[[nm]] – tmpMean) / tmpSD
normFut[[paste(nm,’_Z’,sep=”)]] = (normFut[[nm]] – tmpMean) / tmpSD
}

## Create the data frame for clustering
## Again, we omit data with no value using na.omit

normWorkDF = data.frame(‘Consumed’ = c(na.omit(normHist$Consumed),na.omit(normFut$Consumed)))
normWorkDF$PBurn = c(na.omit(normHist$PBurn),na.omit(normFut$PBurn))
normWorkDF$FRI = c(na.omit(normHist$FRI),na.omit(normFut$FRI))

## Perform the clustering analysis
## This is the point of all the work being done here
## The Lloyd algorithm runs efficiently on the large dataset
## (~100k points)

normCluster = kmeans(
data.frame(normWorkDF$Consumed,normWorkDF$PBurn,normWorkDF$FRI),
4,
algorithm=”Lloyd”,
iter.max=500
)

## Copy the clusters back to the work data frame

normWorkDF$Clust = normCluster$cluster

# Plot the cluster distributions

for(ndx in 1:4) {
fNm = paste(
‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormClust_’,
ndx,
‘.png’,
sep=”
)
png(fNm)
tmpDF = subset(normWorkDF,normWorkDF$Clust == ndx)
plot(
density(tmpDF$Consumed),
xlim=c(0,1),
ylim=c(0.0,6.0),
col=’darkblue’,
xlab=’Normalized Value’,
main=paste(‘Cluster’,ndx,’Distribution’)
)
lines(density(tmpDF$PBurn),xlim=c(0,1),col=’green’)
lines(density(tmpDF$FRI),xlim=c(0,1),col=’red’)
legend(
‘topright’,
legend=c(‘C Consumed’,’Fract. Area Burned’,’Time Betw. Fires’),
lwd=1,
col=c(‘darkblue’,’green’,’red’)
)
dev.off()
}

## Add the cluster numbers into the original data frames
## Note that NA needs to be taken into account using
## [!is.na(hist$Consumed)] to get the indexes of those data
## items that are not NA

## Historical values are in first half of data frame, future in second
## Calculate indexes for clarity

histClustStartNdx = 1
histClustEndNdx = length(normCluster$cluster)/2
futClustStartNdx = length(normCluster$cluster)/2 + 1
futClustEndNdx = length(normCluster$cluster)

normHist$Clust = NA
normHist$Clust[!is.na(normHist$Consumed)] = normWorkDF$Clust[histClustStartNdx:histClustEndNdx]

normFut$Clust = NA
normFut$Clust[!is.na(normFut$Consumed)] = normWorkDF$Clust[futClustStartNdx:futClustEndNdx]

## Create matrices for results and display
## png() tells R what file to store the results to

png(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormHistClustMap.png’)
normHistClustArr = matrix(normHist$Clust,nrow=331,ncol=169)
image(lon,lat,normHistClustArr)
dev.off()

png(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormFutClustMap.png’)
normFutClustArr = matrix(normFut$Clust,nrow=331,ncol=169)
image(lon,lat,normFutClustArr)
dev.off()

## Calculate Euclidean distance between cluster centers

normClustCtrEucDist = data.frame(cbind(rep(0,4),rep(0,4),rep(0,4),rep(0,4)))

for(ndx_1 in 1:4) {
for(ndx_2 in 1:4) {
normClustCtrEucDist[ndx_1,ndx_2] =
sqrt(
(normCluster$centers[ndx_1,1] – normCluster$centers[ndx_2,1]) ^ 2 +
(normCluster$centers[ndx_1,2] – normCluster$centers[ndx_2,2]) ^ 2 +
(normCluster$centers[ndx_1,3] – normCluster$centers[ndx_2,3]) ^ 2
)
}
}

## Create Fire Regime Euclidean distance map between historical and future

normDiff$HistClust = normHist$Clust
normDiff$FutClust = normFut$Clust
normDiff$EucDist = NA

normDiff$EucDist = mapply(
function(x,y) ifelse(is.na(y), NA, normClustCtrEucDist[x,y]),
normDiff$HistClust,
normDiff$FutClust
)

normDiffEucDistArr = matrix(normDiff$EucDist,nrow=331,ncol=169)
image(lon,lat,normDiffEucDistArr)

## Create NetCDF file and store results there. This yields a
## NetCDF file that can be used independently of R

x = ncdim_def(‘lon’,’degrees’,lon)
## Reverse the order of latitude values
y = ncdim_def(‘lat’,’degrees’,rev(lat))

histFCNCVar = ncvar_def(‘HistFireCluster’,’classification’,list(x,y),-9999,prec=’short’)
futFCNCVar = ncvar_def(‘FutFireCluster’,’classification’,list(x,y),-9999,prec=’short’)
eucDistNCVar = ncvar_def(‘EucDistance’,’NormalizedDistance’,list(x,y),9.9692099683868690e+36,prec=’float’)

# Reverse the lat order of the values being output
OutHistClust = c(Rev(matrix(normDiff$HistClust,ncol=length(lat),nrow=length(lon)),margin=2))
OutFutClust = c(Rev(matrix(normDiff$FutClust,ncol=length(lat),nrow=length(lon)),margin=2))
OutEucDist = c(Rev(matrix(normDiff$EucDist,ncol=length(lat),nrow=length(lon)),margin=2))

## Output the variables and close the file

ncOut = nc_create(‘NormClusters.nc’,list(histFCNCVar,futFCNCVar,eucDistNCVar))

ncvar_put(ncOut,histFCNCVar,OutHistClust)
ncvar_put(ncOut,futFCNCVar,OutFutClust)
ncvar_put(ncOut,eucDistNCVar,OutEucDist)

nc_close(ncOut)

For the ArcGIS hot spot analysis I used the tutorial available at https://www.arcgis.com/home/item.html?id=6626d5cc81a745f1b737028f7a519521. Note that the tutorial must be downloaded using the “open” button below the image on the web page. This will download a folder with pdf file of instructions and a dataset for their analysis.

I followed the tutorial, but substituted a dataset of fire ignitions logged by the Oregon Department of Forestry between the years 1960 and 2009 (Fig. 1). This dataset does not include some fires that occurred in the state during that time, for instance some on federal land and many on non-forested land.

ODFFires1960To2009HotSpot_ODFPoints

Figure 1: Oregon Department of Forestry logged fire occurrences from 1963 to 2009.

The tutorial was straightforward and easy to follow. The only “problem” I had was with the spatial autocorrelation step. My data did not produce any peaks in the z-score graph, so I arbitrarily chose a distance towards the lower end of the graph (tutorial steps 10 f and g).

My results from the hot spot analysis step (Fig. 2) showed areas with hot spots. These tended to be near roads and cities in most cases, with a few instances at high elevation in northeastern Oregon.

ODFFires1960To2009HotSpot_Points

Figure 2: Results from hot spot analysis.

Results from interpolating the hot spot results using the IDW tool (tutorial step 12) (Fig. 3) produced anomalous edge effects (Fig. 3). Note the areas of high values spreading from hot spot areas into areas with no data along the coast. A close up image of the Medford area (Fig. 4) shows edge effects spreading into areas with no data.

ODFFires1960To2009HotSpot_Surface

Figure 3: Hot spot analysis interpolated surface produced by the application of the Arc IDW tool. Note the edge effects, especially along the coast and northern state boarder.

ODFFires1960To2009HotSpot_Medford_ODFPtsAndSfc

Figure 4: Interpolated hot spot surface produced by the Arc IDW tool along with the original ODF ignitions dataset points. Note the edge effects in the southwest corner of the image spreading from the area of high fire occurrence into areas with no data.

As part of a study I did for a master’s thesis, I used the ODF data along with another dataset to do a logistic regression analysis of ignition occurrences in the Willamette watershed excluding the valley floor. Within this area, the hot spot analysis placed hot spots along some roads and population centers in several instances (Fig. 5A). These results are consistent with the results from my study, which showed that human-caused fires were more likely to occur along roads and near areas of high population, and that lightning-caused fires were likely to occur at high elevations (Fig. 5B). The lack of a match between the two methods at high elevation is due to data points that were used in the logistic regression but not in the hot spot analysis. However, one disadvantage of the hot spot analysis is that it is univariate, or strictly occurrence based. The logistic method allows for correlation and the application of that correlation to the prediction of future occurrence

HotShotVsLogistic

Figure 5: Results from A) hot spot analysis and B) logistic regression analysis for fire occurrence in the hills and mountains of the Willamette watershed. Redder areas are “hotter” in the hot spot analysis and have a higher ignition probability in the logistic regression analysis.

In conclusion, hot spot analysis is a good technique for finding areas of higher occurrence, but has no predictive power beyond analyzing past occurrences. The ArcGIS IDW can produce an interpolated surface, but it has an edge effect issue that could lead to questionable results. Regression methods, for example logistic regression, may produce results more suited to analyzing correlation of independent variables with event occurrence.

Description of the research question I am exploring.

The broad question I am exploring is, “How will climate change affect fire regimes in the Pacific Northwest in the 21st century?” or stated as an overarching hypothesis:

Over the 21st century, projected changes in climate will cause changes in fire regimes in the Pacific Northwest by influencing vegetation quantity, composition, and fuel conditions.

I am exploring this question in the context of model vegetation and fire results from the MC2 dynamic global vegetation model (DGVM). MC2 is a gridded, process model with modules for biogeochemistry, fire, and biogeography. Inputs consist of climate and soil data. Outputs from the model include vegetation type, carbon fluxes and pools, hydrologic data, and values related to fire, including carbon consumed by fire and fraction of grid cell burned.

MC2’s current fire module computes fuel conditions within each grid cell. Fire occurrence is modeled when conditions exceed a set fuel condition threshold. An ignition source is always assumed to be present. This threshold-and-assumed-ignition algorithm has the potential to underestimate fire occurrence in areas that rarely or never meet the fuel condition threshold and to overestimate fire occurrence in areas that frequently exceed the fuel condition threshold. I am currently implementing a stochastic ignitions algorithm that allows the user to set an overall daily ignition probability and applies a Chapman Richards function to a fuel condition measure to determine probability of an ignition spreading into a fire.

I will be running the model with historical climate (1895 to 2010) and future climate (2011 to 2100) to produce potential vegetation results (i.e. land use not taken into consideration). Historical data are downscaled from PRISM data, and future data are downscaled from output data produced by the CCSM4 Climate model using the CMIP 5 representative concentration pathway (RCP) 8.5. The model will be run at a 2.5 arc minute resolution (approximately 4km x 4km cell size).

I will compare the output from the 20th century to that of the 21st century and characterize differences in fire regime spatially and temporally. This will be the first run of the MC2 with the new stochastic ignitions algorithm.

(I have added several references below related to what is discussed here.)

MC2 DGVM results for mean fraction cell burned over 2001-2100 using inputs from CCSM4 outputs from RCP 8.5. MC2 fire algorithm with assumed ignitions.
MC2 DGVM results over Pacific Northwest for mean fraction cell burned over 2001-2100 using inputs from CCSM4 RCP 8.5. MC2 fire algorithm with assumed ignitions.

The dataset I will be analyzing

The dataset I will be analyzing will come from MC2 model runs described above. The extent of the dataset is from 42° to 49° latitude and from -124.75° to -111° longitude (from the southeast corner of Idaho west to the US coast and north to the Canadian border), comprising 169 x 331 spatial grid cells of size 2.5 x 2.5 arc minutes. Outputs are on an annual basis from 1895 through 2100. Water and barren land are mapped out of the dataset.

Outputs include variables for various carbon pools, fluxes, vegetation characteristics, and fire characteristics. Those I will be analyzing include carbon consumed by fire and fraction of cell burned. I will be summarizing the data over the time dimension to compute mean time between fires (essentially fire return interval, but over a shorter time period than might be appropriate for calculating a true fire return interval).

Hypotheses

  • Vegetation, elevation, and climate will cause fire regimes to cluster spatially through influences on fuel quantity, composition, and condition.
  • Projected increased temperature and change in precipitation patterns will cause fire to be more frequent and/or more severe through influences on fuel quantity, composition, and condition.
  • Shifting climate characteristics will cause regions with similar fire regimes to shift in location due to changing fuel quantity, composition, and conditions.

Kinds of analyses

The first analysis I will do is a cluster analysis using mean time between fires, carbon consumed by fire, and fraction of cell burned. I will first summarize data over six time periods to produce six datasets: four 50-year periods (1901-1950, 1951-2000, 2001-2050, and 2051-2100), and two 100-year periods (1901-2000 and 2001-2100). Then I will run a cluster analysis (type to be determined) on each dataset.

Using two or more of the resulting clustered datasets I will explore the differences among clusters within each dataset and between datasets (likely using Euclidian distance between clusters).

I will map clustering results back onto the landscape in order to explore spatial patterns within each dataset and differences in spatial patterns between datasets. I will also compare the spatial pattern of clustering results to the spatial extents of EPA Level III ecoregions to see how well or poorly they align.

If time permits, I will do further analyses to characterize the relationship between vegetation type distribution, climate factors, and fire regime clusters.

Expected outcomes

I expect that cells with the same statistical cluster will be concentrated geographically, that for historical data, these concentrations will align closely with EPA Level III ecoregions, that cluster characteristics will be different between time periods, and that geographical groupings of clusters will shift generally northward and towards higher elevation somewhat between historical and future time periods.

From previous runs of the MC2 and preliminary observations of results from the runs for this project, I know that dominant vegetation type shifts from conifer to mixed forests west of the crest of the Cascade Mountains. Within this region, I expect a large shift in fire regime, with carbon consumed falling and mean time between fires decreasing over much of this region. In other regions, I expect general decreases in the mean time between fires due to warmer temperatures and locally drier summers. I also expect carbon consumed to generally remain constant or locally increase due to more favorable growing conditions.

Importance to science and resource management

Studies using DGVMs commonly produce map and graphic results showing extent and intensity of change over uni- or bidimensional spatiotemporal domains. This approach will provide more quantifiable differences using a multidimensional analysis. The ability to characterize fire regimes this way will allow for better model parameterization and validation, which in turn may lead to greater confidence in model projections.

Model results will provide projected changes across an ecologically and economically important region. Results will help resource managers understand and plan for potential change.

Level of experience

  • Arc: Medium, a little rusty
  • ModelBuilder and Python: Expert, especially with python.
  • R: Medium, a little rusty

References

The Beginner’s Guide to Representative Concentration Pathways: http://www.skepticalscience.com/rcp.php

Bachelet, D., Ferschweiler, K., Sheehan, T.J., Sleeter, B., Zhu, Z., 2015. Projected carbon stocks in the conterminous US with land use and variable fire regimes. Global Change Biol., http://dx.doi.org/10.1111/gcb.13048

Daly, C., Halbleib, M., Smith, J.I., Gibson, W.P., Doggett, M.K., Taylor, G.H., et al., 2008. Physiographically sensitive mapping of climatological temperature and precipitation across the conterminous United States. Int. J. Climatol. 28 (15), 2031–2064, http://dx.doi.org/10.1002/joc.1688.