Tim Sheehan
GEO 584: Advanced Spatial Statistics and GIScience
Final Blog Post
Introduction
The MC2 dynamic global vegetation model
Dynamic global vegetation models (DGVMs) model vegetation dynamics over regional to global scales. DGVMs are process-based models, meaning they model the physical and chemical processes that drive vegetation growth. MC2 (Bachelet et al., 2015) is one such model. MC2 models processes on a regular latitude longitude grid using a monthly time step. Each grid cell is treated independently. That is to say, processes in one cell do not affect processes in any other cell. MC2 inputs include soil characteristics, elevation, and monthly means for minimum temperature, maximum temperature, precipitation, and dew point temperature.
Unlike many DGVMs, MC2 includes a fire module that models the effects of fire on vegetation. The current version of MC2 simulates a fire whenever fuel conditions exceed a defined threshold. In other words, ignitions are assumed. This has the potential to overestimate fire occurrence in areas where the fuel threshold is frequently exceeded, to underestimate fire occurrence in areas where the threshold is never or rarely exceeded, and to model only severe fires. Other aspects of model results, especially vegetation occurrence, age, and density as well as carbon dynamics, are driven in part by fire, so correctly modeling fire is important to correctly modeling other processes.
In an effort to make MC2 fire modeling more realistic, I have implemented a stochastic ignitions algorithm in the MC2 fire model. Rather than having a single fuel threshold that determines fire occurrence, the stochastic model uses a probabilistic method (Monte Carlo) to determine if an ignition source is present in a model grid cell. If an ignition source is present, then a second probabilistic method, based on fuel conditions, is used to determine whether the ignition leads to a fire occurrence. If a fire occurs, it is modeled in the same way fires are modeled with assumed ignitions.
A fire regime is the pattern, frequency, and intensity of wildfire in an area and is inextricably linked to an area’s ecology. Despite the multiple factors that define a fire regime, aspects of DGVM fire results, including those from MC2, are commonly presented in terms of individual metrics over time, for example an increase in fire frequency in the 21st century versus the 20th century (e.g. Sheehan et al., 2015).
EPA level III ecoregions
EPA ecoregions (https://archive.epa.gov/wed/ecoregions/web/html/ecoregions-2.html) are geographic regions for North America based on similarities of biotic and abiotic phenomena affecting ecosystem integrity and quality. EPA ecoregions are defined at 4 levels (I through IIII) with higher levels covering smaller areas. This study uses level III ecoregions which range in area from approximately 24,000 to 100,000 km2.
Research questions and hypotheses
Research questions for this study are:
- How consistent are fire regimes within ecoregions?
- How do fire regimes change under projected climate?
- Could fire regimes produced by MC2 be predicted by a statistical model?
To gain insight into these questions, I tested the following hypotheses:
- Fire-free cells will decrease in the 21st century compared to the 20th due to increasing temperatures.
- Fire regimes will group within individual ecoregions over the 20th century, but less so in the 21st century due to changes in fire regime as a result of projected changes in climate between the two centuries.
- Logistic regression will be able to predict the fire regimes produced by MC2 due to embedded statistical relationships between fire and input variables in the model.
- Logistic regression performed on an ecoregional basis will have greater predictive power than that one performed over the entire region due to local geographical relationships within each ecoregion.
Methods
Study area, resolution, and data
The spatial domain for this study covers the Pacific Northwest (PNW) (Fig. 1), defined as the portion of the conterminous United States north of 42° latitude and west of -111° longitude. The area was mapped to a 2.5 arc minute (approximately 4 km x 4 km) latitude/longitude grid. Modeling was done over the period 1895-2100. I also repeated the analysis using only the Blue Mountain Ecoregion (Fig. 1C), located within the PNW.
Figure 1: A) Index map of PNW, B) digital elevation model (DEM) of PNW, and C) EPA Level III ecoregions.
Inputs for MC2 include static soil and elevation data as well as monthly climate data. Climate data consist of minimum temperature, maximum temperature, precipitation, and dew point temperature. Inputs for 1895-2010 were from the PRISM (http://www.prism.oregonstate.edu/) 4 km dataset. This dataset is created using interpolation of observed data. Inputs for 2011-2100 came from the Coupled Model Intercomparison Project 5 (CMIP 5, http://cmip-pcmdi.llnl.gov/cmip5/) Community Climate System Model 4 (CCSM4, http://www.cesm.ucar.edu/models/ccsm4.0/) run using the Representative Concentration Pathway (RCP) 8.5 (https://en.wikipedia.org/wiki/Representative_Concentration_Pathways) CO2 concentration data. RCP 8.5 is based on a “business as usual” scenario of CO2 production through the end of the 21st century. 2011-2100 climate data was downscaled to the 2.5 arc minute resolution using the Multivariate Adaptive Constructed Analogs method (MACA; Abatzoglou and Brown, 2012; http://maca.northwestknowledge.net/). MC2 outputs include annual values for vegetation type, carbon pools and fluxes, hydrological data, and fire occurrence data including fraction of grid cell burned and carbon consumed by fire. MC2 input and output data were divided into two datasets based on time period: 20th century (1901-2000) and 21st century (2001-2100).
k-means cluster analysis
For this study, fire regime was defined as a combination of mean annual carbon consumed by fire, the mean annual fraction of cell burned, and the fire return interval (FRI) over the century. (Note that FRI is usually calculated over periods longer than a century, especially in ecosystems rarely exposed to fire). For cluster analysis, the two datasets were combined, and normalized to avoid uneven influences caused by differing data value ranges. I used K-means clustering in R to produce 4 fire regime clusters (details in my blog post on K-means clustering in R: http://blogs.oregonstate.edu/geo599spatialstatistics/2016/05/21/cluster-analysis-r/).
Logistic regression
I did stepwise logistic regression modeling for each of the fire regime clusters (details in my blog post on logistic regression in R: http://blogs.oregonstate.edu/geo599spatialstatistics/2016/05/27/logistic-regression-r/). Explanatory values were taken from inputs to the MC2 model runs (Fig. 2). Climate variables were summarized annually, by “summer” (April-September), and by “winter” (October-January). The mean of these was then taken over each century. Additional explanatory variables were soil depth and elevation. For each cluster, presence was membership in the cluster and absence was non-membership. The process resulted in four logistic functions, for each fire regime cluster.
Tournament cluster prediction
I used a tournament method to predict the fire regime cluster. For each grid cell, each cluster’s linear regression model was used to calculate the probability that the grid cell would be in that cluster. The cluster with the highest membership probability was chosen as the predicted cluster. I constructed confusion matrices and contingency tables of predicted fire regimes versus actual fire regimes. I also produced maps showing the predicted fire regime cluster, the actual fire regime cluster, and the Euclidean distance between cluster centers for the predicted and actual fire regime clusters.
Results
Changes in input variables over time
Climate models consistently project a warming climate (Sheehan et al., 2015) in the Northwest during the 21st century, but precipitation changes vary across models. The results of the CCSM4 run with the RCP 8.5 CO2 values indicate that temperature warms approximately 2 to 4 °C over the entire region, that precipitation is relatively stable with local increases, and that mean dew point temperature is stable to increasing by up to approximately 3 °C (Fig. 2).
Fig. 2: Climate variables and their differences by century: A) 20th century maximum temperature; B) 21st century maximum temperature; C) change in maximum temperature between 20th and 21st centuries; D) 20th century precipitation; E) 21st century precipitation; F) change in precipitation between 20th and 21st centuries; G) 20th century dew point temperature; H) 21st century dew point temperature; and I) change in dew point temperature between 20th and 21st centuries.
Cluster analysis
The four clusters yielded by the analysis (Table 1) (details and graphs in my blog post on K-means clustering in R) can be described in terms of the factors used to generate them. These are: 1) low carbon consumed, medium fraction burned, low FRI; 2) high carbon consumed, medium fraction burned, medium FRI; 3) low carbon consumed, low fraction burned, medium FRI; 4) no fire. Euclidean distances between cluster centers are listed in Table 2.
Table 1: Non-normalized values for centers of clusters from cluster analysis. (C: Carbon; Med: Medium; Frac: Fraction)
Table 2: Euclidean distances between cluster centers (normalized values). (C: Carbon; Med: Medium; Frac: Fraction)
Overall, there were more fire-free cells in the 20th century (31%) than in the 21st (18%). This is readily apparent in the Cascade and Rocky mountain areas (Fig. 3). Fire regimes are fairly consistent in some level III ecoregions and less so in others (Fig. 3). In some cases there is greater consistency in the 20th century (e.g. Coast Range, Cascades, and North Cascades) while others show more consistency in the 21st century (e.g. Northern Basin and Range and Columbia Plateau). Still in others are dominated by more than one but with the exclusion of others (e.g. Willamette Valley). Overall, coverage by the no-fire fire regime decreases while coverage by the high carbon consumed, medium fraction burned, medium FRI increases, especially in the Eastern Cascades Slopes and Foothills, Northern Rockies, and Blue Mountains ecoregions.
Fig. 3: Fire regime coverage determined by cluster analysis with EPA Level III Ecoregion overlay for A) the 20th century; B) the 21st century; C) Euclidean distance between cluster centers for time periods. (C: Carbon; Med: Medium; Frac: Fraction)
Logistic regression and tournament fire regime prediction
Of the fourteen explanatory variables used in the logistic regression functions, every variable was used in at least one regression, and soil depth was the only variable used in all regressions (Table 3).
Table 3: Explanatory variables used in the logistic regression functions. (D: Depth; Max: Maximum; T: Temperature; Min: Minimum; Ppt: Precipitation)
The confusion matrix (Table 4) and marginal proportions table (Table 5) for the predictions show that predictions were correct in 62% of grid cells, and that for each fire regime cluster, the correct value was predicted more than any incorrect value. Maps for 20th and 21st century predicted and actual fire regime clusters (Fig. 4, A-B, E-F) show that predictions align fairly well with actual fire regime clusters. Maps of the distances between the centers of the predicted and actual fire regime clusters (Fig. 4) show that when clusters differ, they generally differ by a small amount. The mean Euclidean distance between predicted and actual clusters is 0.21 for the 20th century and 0.18 for the 21st, far less than the minimum distance between any two cluster centers, 0.33.
Table 4: Confusion matrix for predicted and actual fire regime clusters for combined 20th and 21st centuries. (C: Carbon; Med: Medium; Frac: Fraction)
Table 5: Marginal proportions table for predicted and actual fire regime clusters for combined 20th and 21st centuries. (C: Carbon; Med: Medium; Frac: Fraction)
Fig. 4: Predicted and actual fire regime clusters and Euclidean distances between predicted and actual cluster centers: A) predicted fire regime clusters for the 20th century; B) actual fire regime clusters for the 20th century; C) Euclidean distances between predicted and actual cluster centers for the 20th century; D) predicted fire regime clusters for the 21st century; E) actual fire regime clusters for the 21st century; and F) Euclidean distances between predicted and actual cluster centers for the 21st century. (C: Carbon; Med: Medium; Frac: Fraction)
Geographical influences
Within the Blue Mountain Ecoregion a logistic regression could not be performed on the no-fire fire regime due to its low occurrence. For this reason no predictions were made for it. Logistic regression was performed for the other fire regimes and an analysis of the ecoregion was performed. Results (Tables 6 and 7) show that predictions using the regressions from the Blue Mountain ecoregion were more accurate (66% correct classification) than predictions for the ecoregion using the logistic regression for the entire PNW (Tables 8 and 9) (63% correct classification).
Table 6: Confusion matrix for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the ecoregion. (C: Carbon; Med: Medium; Frac: Fraction)
Table 7: Marginal proportions table for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the ecoregion. (C: Carbon; Med: Medium; Frac: Fraction)
Table 8: Confusion matrix for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the entire PNW. (C: Carbon; Med: Medium; Frac: Fraction)
Table 9: Marginal proportions table for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the entire PNW. (C: Carbon; Med: Medium; Frac: Fraction)
Discussion
Results and hypotheses
Fire-free cells decreased over the region while precipitation and dew point temperature increased. It is highly likely that rising temperatures were ultimately behind this, but analysis to prove this more forcefully is beyond the scope of this study. That said, I would suggest that hypothesis 1 (Fire-free cells will decrease in the 21st century compared to the 20th due to increasing temperatures) is strongly supported.
Given the strong influence of fire on ecosystem characteristics, and a warming climate expanding the area where fire is likely, I expected fire regimes to change, especially between along the edges of low elevation, flatter ecoregions and the hillier, higher elevation ecoregions surrounding them (for example the Columbia Plateau and Northern Rockies ecoregions). While this expected expansion took place in some places, the opposite effect took place within other ecoregions. The relationships between fire and ecoregions seem to hold or become stronger in as many cases as they become weaker. Thus the hypothesis 2 (Fire regimes will group within individual ecoregions over the 20th century, but less so in the 21st century due to projected changes in climate between the two centuries) is not supported.
The use of logistic regression with a tournament algorithm yielded a statistical model with very strong predictive powers. The high rate of correct prediction and the relatively small distances between mis-predicted fire regime clusters and actual fire regime clusters indicate that statistical relationships are embedded with the process-based MC2 and that logistic regression can be used to project fire regime clusters. Hypothesis 3 is supported.
The predictive power of the logistic regression developed for the Blue Mountain Ecoregion was somewhat better than that of the regression for the entire area. One should not base a conclusion on a sample of one, but this does provide some support for hypothesis 4 (Logistic regression performed on an ecoregional basis will have greater predictive power than that one performed over the entire region due to local geographical relationships within ecoregions).
Broader significance
Fire is projected to occur in areas where it has not been present in the past. The relationship between cohesiveness of fire regime and ecoregion varies from ecoregion to ecoregion. As the climate warms and fire regimes change, though, changing fire regimes will likely change characteristics of some areas, and some ecotones will likely shift as a result. The implications include that for some ecoregions, boundaries will likely shift, and land managers should consider the potential effects of not only the introduction of fire into previously fire-free areas, but also shifts in ecotones.
There is a case to be made for the use of both statistical and process-based vegetation models. Process-based models not only produce predictive results, but they also provide opportunities to find non-linear responses (so-called tipping points) as well as novel combinations of vegetation types, fire, and weather/climate. Statistical models, on the other hand, are much easier to develop and to execute. While a statistical model might not provide as detailed a relationship between predictive and resulting variables, they can often provide more timely results. My results indicate that it might be possible to substitute a statistical model for a process-based vegetation model under certain circumstances.
Statistical analyses are commonly used to analyze climate models by comparing hindcasted model results to observed data. In my experience, vegetation modelers struggle to validate hindcasted results, often using simple side-by-side comparisons of observational and model data. The results from this study point to a possible way to deepen such evaluations. Researchers could develop one set of statistical models using actual historical observations for vegetation type, fire, etc. and another set using modeled data for the same time period. Comparing the models to one another, they could evaluate the similarities of differences of explanatory values’ influences to gain insight to how well the model is emulating natural processes.
Future work
This study can best be described as a pilot study for combining multiple fire-related output variables into fire regimes instead of analyzing them individually. How to best define a fire regime becomes both a philosophical and research question. I will be researching this in the coming weeks and months as I plan to use a refined version of the methods I used in this course for one of the chapters in my dissertation. I have not come across the use of cluster analysis to characterize vegetation model fire regimes, so I am hoping it will be considered a novel approach.
The effects of the stochastic ignitions algorithm within the model cannot be ignored when doing a cluster analysis. Ideally, the model would be run long enough for stochastic effects to be evened out over model time. Another approach is to run multiple (hundreds to thousands) of iterations of the model and average the results. Time and resource constraints make this impractical. Another possible method would be to use a spatial weighting kernel to smooth results over the study area. This method would seem justified given the tendency for fire regimes to cluster geographically. This will be another avenue I investigate.
Finally, the results from the Blue Mountain Ecorgion speak to the potential for geographically weighted regression. I will also do more research on that method.
My Learning
Cluster analysis
About a year ago I was first exposed to cluster analysis. This course gave me a great venue in which to try it on my own. Going through the process taught me the method and some of the pitfalls. I did not normalize the values on my first pass using cluster analysis, and I learned that this can be very important to get meaningful results.
Logistic regression and tournament
The logistic regression portion of this study reintroduced me to a technique I had used before, but did not understand well. As I worked through the logistic regression, I gained a better understanding of the process and how it worked. Putting together the logistic regression blog post made me go through the steps in a deliberate fashion. Doing the 4 by 4 confusion matrix helped me to get a better understanding of what a confusion matrix represents as well as what recall and precision mean in that context. While I did not have a chance to implement geographically weighted logistic regression, I at least gained an understanding of what it means. Implementing a tournament approach was not new to me; I have used a similar technique in genetic algorithms, but it was good to confirm the potential power of such a simple method.
Other packages and methods
Through the term I learned quite a bit about various software packages and tools. I’ve used Arc quite a bit from time to time in the past, but delving into a couple of the spatial statistics tools – hotspot analysis and Moran’s I – reminded me of the power Arc has to offer, as well as some of its quirks. For example my dataset was too large for Arc to handle with Moran’s I. Since I did most of my work in R, I did not have occasion to reacquaint myself with ModelBuilder. I have done a lot of raster processing in Python, but the clipping I had to do to get the Blue Mountain Ecoregion out of the larger dataset forced me to implement a flexible and efficient clipping script that will serve me well in the future.
Since R was the package that was best for my large dataset and had the algorithms I needed, it was the program I learned the most about. I learned how to read and write NetCDF files in R, do cluster analysis, and do stepwise logistic regression in R. In addition to these specific methods, I learned the differences between R datasets, vectors, and matrices. R handles each of these in a different manner. I have found the nuances tricky in the past and tricky still, but I have certainly become more facile with the program.
Statistics
What I have learned about statistics in this class starts with the existence of analytical methods I was not previously aware of. I was unaware of hotspot analysis, but going through the exercise provided me with a good foundational understanding. Spatial autocorrelation is something I had not had formal exposure to, but the basic concept made perfect sense. Its importance, though, and the complexity of dealing with it had not occurred to me. I had been exposed to spectral analysis with temporal data and its use with spatial data makes sense. I had heard of wavelets, but got a better understanding of the method. Correlograms were a new concept for me, presentations on them helped me to understand them. For the most part, regression methods presented in the class reinforced or expanded on concepts I already knew. The application of a kernel for geographically weighted regression was the technique that stretched my understanding the furthest. I had been exposed to the concept of applying a kernel to a dataset, but for some reason, its use in GWR made the concept click for me. Principle components analysis was another method I had heard of, but I did not understand it until it was presented in class, and the concept now makes perfect sense.
Conclusion
Diving into my own project in this class has been a valuable experience, but probably just as valuable has been the interactions with other students and their projects. The broad scope of approaches has exposed me to a wide variety of techniques. Going forward in my research, I will be able to investigate methods I was not aware of in the past.
References
Abatzoglou, J.T., Brown, T.J., 2012. A comparison of statistical downscaling methods suited for wildfire applications. Int. J. Climatol. 32, 772–780.
Bachelet, D., Ferschweiler, K., Sheehan, T.J., Sleeter, B., Zhu, Z., 2015. Projected carbon stocks in the conterminous US with land use and variable fire regimes. Global Change Biol., http://dx.doi.org/10.1111/gcb.13048.
Sheehan, T., Bachelet, D., Ferschweiler, K., 2015. Projected major fire and vegetation changes in the Pacific Northwest of the conterminous United States under selected CMIP5 climate futures. Ecological Modelling, 317, 16-29.
Tim, thanks for a thorough and thought-provoking analysis. I like your suggestion of formal comparisons of complex process-based models with simple statistical models. Could you connect your analyses to ground-truthed data in some way?