For this final tutorial post, I’ll be describing a workflow in which FRAGSTATS is used to generate a number of continuous metrics of disturbance patches. Then, using these metrics, a Grouping Analysis is performed in ArcMap to identify groups of patches that have multiple similarities.

FRAGSTATS is a spatial pattern analysis program developed by landscape ecologists for quantifying the structure (i.e., composition and configuration) of landscapes at multiple scales (see this link to the documentation). Through the lens of a landscape ecologist, a landscape may be considered as an assemblage of patches whose spatial homo/heterogeneity characterize the landscape they comprise. While the patches of interest to landscape ecologists are often habitat types, they may represent any spatial phenomenon, including forest disturbance.

The installation of FRAGSTATS is very straightforward (link to downloads page), and the GUI is friendly! Below I outline the steps for adding data to FRAGSTATS and generating patch metrics:

1) Add a layer. FRAGSTATS accepts raster images in a variety of formats (again, see documentation here). I worked with GeoTIFF (.tif) files representing disturbance patches.frags3frag_input

2) Select metrics to calculate. Descriptions of each metric can be found in the documentation.

results

3) Generate Results. Simply click the run button on the main toolbar, and view the results.

results2

If your goal is to attach these tables back to your input data for mapping/analysis, in a GIS for example, then it is crucial to generate a “patch ID file”. To do this, simply check the box for “Generate patch ID file” under the Analysis parameters tab:

Capture

The patch ID file and associated tables will be saved to the directory you choose. Note that here I’ve checked only the box for Patch metrics. The patch ID file will have a suffix of “_id8” appended to whatever name your input file is, and it’s associated extension (“input”_id8.tif). The patch metrics file will have a .patch extension. Open the .patch file in Excel or the spreadsheet editor of your choice, delimit by comma, and save it as a file type that ArcMap will recognize, such as .csv or .txt. I suggest removing the “LID” field which contains the file path where your initial input raster resides.

4) Join output .patch file to patch ID file. In ArcMap, bring in both the patch ID and copy of the patch file in .csv or .txt format. Then, proceed with a tabular join:

joinjoin

Right-click the _id8 file, click Join…in the prompt window that appears choose “Value” for the field in the patch ID file, and PID for the field in the table of patch attributes. Click OK. The patch attributes are now joined to each patch in the patch ID file. Open the attribute table of the patch ID file to verify. Remember to export this file to make the join permanent.

*Note: if you don’t see “Value” as an option for the join field, you may have to run the “Build Raster Attribute Table” tool on your patch ID file.

5) Proceed with Mapping/Analysis!

*Note that the FRAGSTATS tabular outputs are useful in their own right. Below I’ve charted “class” level metrics (as opposed to patch level) that show trends in clear-cut patches over time. In this case, I filtered the disturbance data for clear-cuts only, and had FRAGSTATS generate outputs by year (year is the “class”).

charts

 

 

 

 

Moving forward with the disturbance centroids, I ran two other analysis tools in ArcMap that identify not only spatial clustering, but clustering of similar numeric (not necessarily continuous) variables attributed to the centroids. These tools are Cluster and Outlier Analysis (Anselin Local Moran’s I) and Hot Spot Analysis (Getis-Ord Gi*). While these tools essentially answer the same question (where are clusters of similar values?) they calculate this in slightly different ways. See this link for more information. For both tools I used year of disturbance as the input value to be evaluated:

yr_moransihotspot

The map on the left shows the results of the Local Moran’s I tool. High-High and Low-Low Clusters represent statistically significant clusters of high and low values, whereas High-Low and Low-High clusters are outliers representing either high values surrounded by low values, or vice versa. The results of the Hot Spot Analysis show similar patterns of significant clustering by year. These patterns are indicative of geographic shifts in timber harvest concentration, which could reflect management decisions. Using “Select by Location”, the land use designations associated with significant Moran’s I clusters were tabulated. Low-Low Moran’s I clusters are on the left (1985-1989) and High-High clusters (1996-2012) on the right:

moransiLLmoransiHH

Here we see an expected increase in clustering of clear-cuts and partial harvests on Non-Forest Service (private) and Matrix lands, yet it is important to keep in mind that these land-use designations did not exist until 1994.

Following this analysis of clustering by year, I was interested in attaching continuous variables to the actual disturbance patches, rather than their centroids. Toward this end, I brought in an another output from the LandTrendr algorithm: disturbance magnitude, which is a function of the slope of change in spectral values for each pixel over a yearly time step. Since magnitude is measured on a per-pixel basis, I used the Zonal Statistics tool in ArcMap to calculate the mean disturbance magnitude within each patch, and attach it to the patch attributes. I then ran Hot Spot Analysis again, with mean disturbance magnitude as the input value.

magn_hotspotsmagn

The distribution of high and low magnitude disturbances is interesting, especially when overlaid on the land use designations. As shown in the map on the right, a cluster of high magnitude disturbance is associated with a section of adjacent private lands (in grey) and a cluster of low magnitude disturbance is associated with Matrix land (in orange). This may indicate a higher proportion of clear-cutting (high magnitude disturbance) on private lands, and more partial harvesting (lower magnitude disturbance) on Matrix lands.

To begin my analysis of forest disturbance patterns, I wanted to get a broad scale understanding of how clear-cuts and partial harvests are distributed throughout Willamette National Forest, as well as the land use designations and forest districts they spatially coincide with. Using ArcMap I ran the Kernel Density tool on cumulative disturbances from 1985 to 2012. This first required converting my disturbance patch data (rasters) into polygons (vectors). Additionally, Kernel Density requires point or polyline inputs, so I also generated disturbance centroids from the polygons. Below I outline the function, results and my interpretations:

Kernel Density calculates a magnitude per unit area using a kernel function to fit a smoothly tapered surface to each input feature. Here are the results for two runs of this tool:

disturb_density3disturb_density2

Below are the parameters used for each kernel density output above. In the first run on the left, the output cell size of 250 meters results in a surface that is less smooth than that of the second run on the right, which has a much smaller output cell size. Additionally, in the second run, the optional parameter for a search radius was set to 1600 meters, a value that is roughly double that of the average nearest-neighbor distance between disturbance centroids. This creates a density surface that is more appropriate for the scale of the map (on the right). Both maps give an indication that higher densities of clear-cuts and partial harvests between 1984 and 2012 occurred on Matrix Lands and Non-Forest Service Lands, which are mostly composed of private and industrial landowners. These results are not surprising, given that timber harvest is the primary function of Matrix Lands, and that private and industrial landowners are inclined to produce timber.

CaptureCapture3

*It is important to note that after converting the disturbance patches to polygons, I used the Normal QQ Plot and Histogram functions (built in to the Geostatistical Analyst extension for ArcMap) to remove outliers using the “Shape_Area” attribute field. Polygons of extremely low area (less than eleven 30-meter Landsat pixels) were removed because they likely represent incorrectly classified pixels. Polygons of extremely high area were also removed, because they would not be accurately represented by a single centroid point.

Mapping cumulative clear-cuts and partial harvests over the 27 year period between 1985 and 2012 paints an interesting picture, but in order to better assess the effects of forest governance on landscape patterns, it is probably more interesting to map disturbances temporally. Using the Definition Query function in ArcMap, I filtered the disturbance data at 5-year intervals, and ran the Kernel Density tool again for each interval:

dens_time

Bringing in the temporal dimension reveals some interesting changes in the density of forest disturbance that may correspond with significant events in the history of forest governance. For example, 1990 shows very high density of clear-cuts and partial harvests, which indicates a peak in timber harvest activity prior to the May 29, 1991 Dwyer Injunction, which banned timber sales. Following the 1994 Record of Decision for the Northwest Forest Plan, the maps for 2000, 2005 and 2010 show the expected drop in overall disturbance density, but interesting peaks associated with certain forest districts. For one final run of the Kernel Density tool, I mapped cumulative disturbance during the period of the Dwyer Injunction (1990-1994). The result shows higher disturbance density than expected, but I assume that there is lag between the timing of timber sales and when sold forest land is actually harvested. Thus, this map likely shows clear-cuts and harvests on forest lands sold prior to the injunction, but may also be indicative of which forest districts are more inclined toward timber production.

91_94_dens

The analysis portion of this project did not prove to be effective, and ultimately resulted in no results. Let me use this tutorial to walk you through a cautionary tale about being just a little too determined to make an idea work with the dataset that you have.

 

The Question:

After doing a lot of research on both early archaeological sites, and sites that contain Pleistocene megafauna in the Willamette Valley, a few patterns seemed to emerge. Megafauna sites seemed to occur in peat bogs, and all of the earliest archaeological sites occurred on the margins of wetlands. This led me to begin to ponder that maybe there is a connection between the soil type, pH, and other factors across all of the known sites. What variables might be useful to predicting the locations of other yet to be discovered archaeological and megafauna sites throughout the valley?

 

The Tools:

In order to conduct this exploration, I decided to use the tools that were built into ArcGIS. Hotspot analysis and regression were going to be the main two tools that were going to be used.

For the data, I found a SSURGO dataset that was in vector format. It contained polygons of all of the mapped soil units in the valley, as well as a variety of factors related to slope, parent material, order, etc. Eventually I switched gears and found another SSURGO dataset that was in raster format and contained a whole lot more data, hoping that this change in dataset would make the analysis much easier.

 

The Steps:

The first step that I took when conducting this analysis was mapping different variables out and looking at them comparatively, to see if there were any obvious patterns to emerge. Three different soils popped up as looking like they were important when considering the late Pleistocene in the Willamette Valley.

 

The mapping revealed that there were three soil types that seemed to appear at most of the known sites.

 

Below is a map of all of the known Pleistocene megafauna and early Holocene archaeological sites in the valley.

 

11

There were 3 major soil types that emerged associated with local sites. The Labish, Bashaw, and McBee soil types.

12

The Labish Soil was especially interesting, as it only seemed to occur at the major peat bearing sites in the valley, most of which were drained lakes that are currently used for crops.

 

After reading about the nature of soil pH in wetland deposits, I began to hypothesize that pH would have been an important variable in the soils that I had identified, and wanted to use this knowledge to find more sites through hotspot analysis and/or regression analysis.

 

 

The Results:

 

The results are the toughest part to discuss, as there was not much to show for results.

Many attempts at successfully running regression analysis were made, using a wide variety of different combinations of data, but all of it returned an error of perfect multicolinearity, resulting in fails across the board. The analysis was attempted using both the vector and raster form of the data, using built-in pH data, pH data that was acquired elsewhere and added to the data, as well as combinations of variables.

As I began to explore the dataset further, I realized that the data, while initially appearing to be incredibly varied, was in fact quite the same. I mapped out the soil orders and Great Groups in the valley and realized that each of the maps looked strikingly similar, which was telling me that (as was mentioned in class), all of the data was likely extrapolated from a few key points.

 

Soil Order:

13

Soil Great Group:

14

Aside from a few differences, both of the maps are extremely similar, which is telling me that this data is more than likely, as mentioned above, extrapolated across a large landscape.

 

This realization made me doubt the pH data as well, so I mapped that out as well.

 

Soil pH Map:

15

The valley soils appear to be fairly neutral, and only vary between 5.7 and 6.6

This would make it very difficult to use some sort of exploratory statistical analysis on this dataset, as there wasn’t much variability.

In order to look at how the pH was distributed throught the valley, I ran a hotspot analysis as well as a Moran’s I analysis.

pH Hotspot Analysis Map:

16

pH Moran’s I Analysis Results:

17

As you can see, the data is extremely clustered, especially in my particular areas of interest, which are the valley floor.

 

 

Was this useful?

This analysis was useful, but for a different reason than was expected.  The SSURGO dataset is not the best tool for soil landscape analysis at a smaller scale. Throughout the class, I have seen other statewide projects that were a lot more successful due to higher variability in soils between the east and west sides of the state.

I became a tad too determined to run this kind of analysis, and the results were completely inconclusive in that respect, but in the end, the most beneficial part of the analysis was figuring out that there are likely connections between my sites of interest. In order to investigate these connections, physical testing is likely the most reliable source, since the SSURGO data is not reliable for this purpose.

Also, don’t rely on your data too much. It might mess with your head a bit!

Wavelet Analysis: A Tutorial

 

Woodburn High School in the northern Willamette Valley, Oregon, contains evidence of an extensive peat bog as well as evidence of extinct Pleistocene megafauna. In October of 2015, sediment cores were extracted from the site in order to better understand the underlying sediment at the site, and find the sediment that is of the right age and type to possibly contain evidence of human occupation.

1

Aerial photo of the study area, with sample locations marked with orange dots.

 

In order to further explore the project, a better understanding of wavelet analysis must be established. By testing a sample of geochemical data extracted from a core.

 

The Question:

 

Is wavelet analysis an appropriate tool for geochemically identifying stratigraphic breaks in sediment cores?

 

The Tools:

 

To conduct this analysis, I used R as well as the R ‘WaveletComp’ package for the wavelet analysis, and ‘ggplot2’ in order to graph the geochemical data.

 

‘WaveletComp” takes any form of continuous data, typically time-series data, spatial data in this case, and uses a small waveform called a wavelet to run variance calculations along the dataset. The resulting output is a power diagram, which shows (in red) the locations along the dataset where there is a great change in variance. A cross-wavelet power diagram can also be generated. This can indicate when two different variables are experiencing rises and/or drops at the same time.

 

2

Example of a wavelet.

 

There are two equations used when generating a wavelet power diagram…

3

The above equation uses the dataset to calculate the appropriate size of the wavelet according to the number of points in the dataset.

4

The above equation uses the wavelet to run variance calculations across the dataset and output the power diagram.

 

 

The Steps:

 

After the appropriate package is loaded into R, and the dataset is uploaded, and the generation of the power diagram and cross-power diagram consists of just a small block of code…

 

Code snippet for the power diagram:

Fewhs010101 = analyze.wavelet(WHS010101, “Fe”,

loess.span = 0,             # Detrending (Kept defaults)

dt = 1,                     #Time series (Kept defaults)

dj = 1/250                  # Resolution along period axis (kept defaults)

make.pval = T,              # Draws white lines that indicate significance (true or false)

n.sim = 10                  # number of simulations

wt.image(Fewhs010101, color.key = “quantile”, n.levels = 250,

legend.params = list(lab = “WHS 010101 Wavelet Power Fe”, mar = 4.7))

 

Code Snippet for a cross-power diagram:

AlFewhs010101 = analyze.coherency(WHS010101, c(“Al”,”Fe”),

loess.span = 0,

dt = 1, dj = 1/100,

make.pval = T, n.sim = 10)

 

wc.image(AlFewhs010101, n.levels = 250,

legend.params = list(lab = “WHS 010101 cross-wavelet power levels – Al/Fe”))

 

 

The output of the power diagram was then compared to line graphs generated using the ggplot2 package.

 

 

The Results:

 

The results were quite interesting, as you can clearly see three different geochemical breaks, as illustrated with the three red plumes that are rising from the bottom of the diagram. The color red indicates that at that particular “time,” there is a significant edge or change in the waveform. This is illustrated by comparing to the line graph that is below. There are “red plumes” present at all of the significant changes in the waveforms on the graph. This tells us that these locations along the transect should be considered “hotspots” for stratigraphic change.

 

5

Wavelet power diagram of Aluminum.

6

Example of graphed aluminum and Sulphur data for the core.

 

Was this useful?

For my project, the analysis seemed to show that it is very possible to spot major changes in geochemistry across a transect. This will be further explored in my forthcoming final analysis post.

As for other projects, this method could be used to spot unseen patterns in the changes of sea surface temperature over time, or changes in frequencies of oxygen isotopes, or any other data that is presented in a time-series or in equidistant measures across a landscape.

 

Boosted regression trees (BRT) are an ensemble tree-based species distribution model that iteratively grows small/simple trees based on the residuals from all previous trees.  The model is run on a random subset of your data, and ALL predictor variables are considered to produce the best splits at each node.  The tree-complexity determines how deep each individual tree will be grown to.  Anticipated interactions can be captured by setting the appropriate tree complexity.  The learning rate determines the overall weight of each individual tree.  This is an advanced form of regression methods which consists of two components. -The two elith08_fig1main components of BRT: regression trees and boosting.

 

  1. Regression trees:  Partitions the predictor space into rectangles, using a series of rules to identify regions with the most homogenous response to the predictor and fits a constant to each region.
  2. Boosting: An iterative procedure that reduces the deviance by accounting residuals of previous tree(s) by fitting another tree
    • Each individual tree inform subsequent trees and thus the final model
    • The boosting component makes boosted regression distinct from other tree-based methods.

 

Objective of this analysis:

Identify the biophysical parameters associated with giant gourami distribution in SE Asia, starting with a set of 47 global occurrence points.  This was an exploratory exercise to learn about the physical variables that might be important for the distribution of the giant gourami. 

My Data:

Pulled the 47 occurrence points for the giant gourami from fishbase.com and were used as the basis for the dataset.

ArcMap: generate points and convert to KML for import into Google Earth Engine

//Get coordinates for occurrence points for species of interest (Giant Gourami) from fishbase.com

//create random points to use as ‘pseudo absence’ points

//generate random points within study region and only in rivers

Google Earth Engine: ‘gather’ biophysical data for points from satellite imagery–used for ease of access to spatial layers

//load in image layers of interest (NDVI, CHIRPS, Population density, Flow, Surface Temp.)

//export to CSV for analysis in R Studio

 

R code for running the BRT model:

I ran the model using the gbm package in R, based on a tutorial by Jane Elith and John Leathwick (https://cran.r-project.org/web/packages/dismo/vignettes/brt.pdf)

 

>>source(“BRT/brt.functions.R”)

>>install.packages(‘gbm’)

>>library(gbm)

# define the dataset

>>gourami <- read.csv(“BRT/gourami/gourami_data.csv”)

# data consists of 39 presence and 305 pseudo absence (344)

# 5 predictor variables

>>gourami.tc3.lr005 <- gbm.step(data=gourami,

    gbm.x = 3:7, #columns in the dataset where the response variables are located

   gbm.y = 2, #column in the dataset where presence/absence data is located (0/1)

   family = “bernoulli”,

    tree.complexity = 3, #tree depth determines the number of layers in each tree

  learning.rate = 0.005, #weight of each tree in the overall model

    bag.fraction = 0.75) #fraction of the dataset used to build/train the model

 

The three main parameters to pay attention to at this point are tree complexity, learning rate, and bag fraction. The remaining fraction of the dataset not used in the bag fraction is then used for cross validation for model evaluation. These three parameters can be varied to determine the ‘best’ model.

Results

Initial output for model with tree complexity of 3, learning rate of 0.005, and bag fraction of 0.75.  Several warning messages were displayed with this particular model, which are not addressed in this tutorial:

mean total deviance = 0.707

mean residual deviance = 0.145

estimated cv deviance = 0.259 ; se = 0.058

training data correlation = 0.916

cv correlation =  0.838 ; se = 0.043

training data ROC score = 0.989

cv ROC score = 0.958 ; se = 0.02

elapsed time –  0.13 minutes

 

Warning messages:

1: glm.fit: algorithm did not converge

2: glm.fit: fitted probabilities numerically 0 or 1

 

Relative contributions of each variable in determining where the species is expected.  For this model, precipitation has the strongest pull:

> gourami.tc3.lr005$contributions

                   var   rel.inf

mean_Precip: 61.223530

mean_temp: 25.156042

pop_density: 10.299844

NDVI_mean:  1.685350

Flow:   1.635234

 

Interactions:

NDVI Precip flow Temp Pop
NDVI 0 29.62 0.11 0.07 0.08
Precip 0 0 17.00 317.66 84.51
flow 0 0 0 0 0.93
Temp 0 0 0 0 3.29
Pop 0 0 0 0 0

Partial dependence plots visualize the effect of a single variable on model response, holding all other variables constant.  Model results vary the most with precipitation as seen in the top left plot.  Mean temperature and population density appear to also play a role in giant gourami distribution based on these plots, but may be more apparent if you zoom in on the upper temperature threshold or the lower population density range.

gourami_tc3_lr005_plots
Model comparison, varying tree complexity and learning rate to evaluate the best setting. Top row illustrates model fit for a tree complexity of 3 with a learning rate of 0.01 (Left) and 0.005 (right).  The bottom row illustrates model fit for a tree complexity of 4 with learning rate 0.01(L) and 0.005(R) :Gourami_BRT_Rplot_3x4d3-4_lr0.1-0.005_wNOpop

It appears that a model with a tree complexity of 3 and a learning rate of 0.005 performs the best. This model indicates that precipitation has the largest effect on the distribution of giant gourami in SE Asia, based on the initial 34 occurrence points.  

Model critique: BRTs are not a spatially explicit model and thus relies only on the relationship between the variables and the sample points.  Additionally, due to the complex nature of the model (often outputting thousands of trees), the results can be difficult to interpret or explain.

 

Prologue

This post got kind of out-of-hand, but it represents my attempt to understand what was going on under the hood of geographically-weighted regression (GWR), especially in terms of my own data. I found it personally helpful to back up, start with the basics of multiple regression, and work my way back to GWR to see what it was and was not telling me about my data. Much of this will be review for members of our class, but I hope that you find something of use here.

Introduction

Regression analysis refers to a class of statistical procedures used to assess the degree to which one variable (y) may be predicted by one or more other, independent variables (x1, x2, … , xk). Simple, linear regression models use a single continuous variable to predict a response variable, and are written using the familiar equation y = mx + b, where y is the response variable, x is the predictor, m is the slope coefficient of x, and b is the y-intercept of the regression line on a Cartesian grid. Multiple Regression models are similar, but use more than one continuous predictor variable. These are typically written as:  y = β0 + β1x1 + β2x2 + … + βkxk + ε, where β refers to each variable coefficient and ε to the model error.

With the exception of logistic regression, variables on both sides of a regression equation must be continuous (rather than categorical) and must meet four key assumptions: (1) the probability distribution of ε must be 0; (2) the variance of the probability distribution of ε is constant; (3) the probability distribution of ε is normal; and (4) the errors associated with any two observations must be independent (Mendenhall and Sincich 2003:105-106).

The complexity of multiple regression raises a number of issues, especially when dealing with spatial data. In this tutorial, we will explore statistical approaches to three common problems in multiple regression. First, how do we deal with collinearity between our predictor variables? That is, how do we deal with violations of the assumption of independence? Second, how do we decide which predictor variables to include in our model? And finally, how do we identify geographically patterned deviations from a global regression model in spatial data?

To address the first issue, we will discuss a statistical technique for reducing the dimensionality of a dataset called Principal Components Analysis. To address the second, we will discuss Exploratory Multiple Regression. To address the third, we will discuss Geographically Weighted Regression.

Case Study

To illustrate this discussion, we will be working with three sets of data from my research in Oaxaca. Namely, we will assess the degree to which we can predict concentrations of five major and minor elements in subsurface clay deposits using data derived from remote sensing. For me, these data are of interest because understanding the distribution of compositionally distinct clay resources may help us track ancient pottery production and exchange in the region. For this tutorial, these data serve simply as an example of a complex multivariate spatial problem. Each dataset used in this study is outlined briefly below:

  1. Tlacolula Clays: These data consist of 137 point features from across the eastern arm of the Valley of Oaxaca, the Tlacolula Valley. During the summers of 2007 and 2012, clay samples were collected from each of these locations, exported to OSU, and analyzed using INAA to estimate the elemental composition of each sample across 30 elements. For this exercise, we focus on just five of major and minor elements: Aluminum (Al), Calcium (Ca), Sodium (Na), Potassium (K), and Iron (Fe). These will act as our dependent variables.
  1. ASTER Bands: These data consist of spectral reflectance and emissivity measurements for 13 bands of remote sensing imagery extracted from a single tile of ASTER data from a cloud-free day in June of 2001. These include two Very Near Infrared (VNIR) bands taken at a spatial resolution of 15 m, six Short-wave Infrared (SWIR) bands taken at a spatial resolution of 30 m, and five Thermal Infrared (TIR) bands taken at a spatial resolution of 90 m. A common use of ASTER data is the mapping and classification of surface geology. If parent material is a good predictor of clay composition in this area, ASTER imagery may serve as a good proxy measure of sediment lithology. These data will therefore act as our first set of predictor variables.
  1. DEM-derived surfaces: Beyond parent material, we may expect the chemical composition of clay deposits to be affected by processes such as soil development, erosion, and alluvial movement of sediments. All of these are difficult to measure directly, but are heavily affected by factors such as slope, which are easily measured using remote sensing. We therefore added a second set of possible predictor variables to our analysis, these derived from a single ASTER 90 m DEM. These include elevation, slope, and flow accumulation. All three of these variables were log-transformed to better approximate a normal distribution.

Software

All of the statistical methods described in this tutorial are available under multiple software packages. I used a combination of ENVI (for manipulation of ASTER data), ARCGIS (for most other spatial analysis and data manipulation), and JMP (for non-spatial data analysis). I cannot speak to how these software compare to other software with similar capabilities. In fact, I am confident you prefer your favorite software packages over mine 🙂 For this reason, I will emphasize methods and interpretation over software throughout this tutorial.

Principal Components Analysis

A primary concern in multiple regression analyses is the tendency to over-fit a model simply by including too many variables. This is problematic because it both overstates the degree to which a response variable can be predicted with a series of independent variables, and it becomes extremely difficult to interpret. Our dataset has sixteen predictor variables, which is almost certainly too many for the number of cases we are analyzing.

A common method for reducing the dimensionality of a dataset is Principal Components Analysis (PCA). PCA uses either the correlation or covariance matrix of a dataset to create an orthogonal rotation of new variables called PCs that are by definition independent. Each PC describes a portion of the overall variance in the dataset, as measured by Eigen-values, with the first PC describing the principal axis of variability, the second PC a smaller portion of variability, and so on. The contribution of individual input variables on each PC is described by their loadings as Eigen-vectors.

When choosing PCs for further analyses, it is important to consider both the Eigen-values and Eigen-vectors. How many PCs are required to describe the majority of the variability in your dataset? Do the Eigen-vector loadings on each PC make sense, knowing what you know about the structure of your data?

To reduce the number of predictor variables in our analysis, data from the thirteen ASTER bands were subjected to a Principal Components Analysis on correlations. DEM-derived predictor variables were excluded from this analysis because a preliminary evaluation of the data showed that they were sufficiently independent from the ASTER bands. All ASTER bands were screened for outliers prior to analysis to reduce the influence of extreme cases.

Results of this analysis are reported as Eigen-values in Table 1, and as Eigen-vectors in Table 2. Table 1 shows that as much as 90 percent of the variability in the ASTER data is described by the first two PCs, and 97 percent described by the first three. However, beyond providing a reduced set of variables, we want our PCs to describe interpretively meaningful axes of variability that may be useful for predicting our response variables. Just because the first PC describes the most variability in the data does not mean that it is going to be the best predictor for all of our elements of interest.

Indeed, Table 2 shows that while the first PC is nearly equally loaded on all ASTER bands, the next three PCs are more intuitively meaningful. PC2 is positively loaded on the TIR bands and negatively loaded on all others. PC3 is positively loaded on the VNIR bands and negatively loaded on SWIR bands, especially at lower wavelengths. PC4, while accounting for just 1.4 percent of the variability in the data, describes an important contrast between shorter and longer wave bands within the SWIR portion of the spectrum. For this reason, we might select the first four PCs as potential dependent variables for subsequent analyses.

Table 1: Results of Principal Components Analysis of 13 ASTER image bands covering the Tlacolula Valley showing Eigen-values for each resulting PC, the percent of variability described by each, and the cumulative percent variability described by aggregate combinations of PCs.

Pink_GWR_T1

Table 2: Results of Principal Components Analysis of 13 ASTER image bands covering the Tlacolula Valley showing Eigen-vector loadings on the first 4 PCs. Positive loadings above 0.30 are shown in red. Negative loadings below -0.30 are shown in blue. These four PCs account for 99% of the variability in the ASTER data.

Pink_GWR_T2

Exploratory Multiple Regression

Now that we have reduced the dimensionality of our dataset to five dependent variables (Al, Ca, Na, K, and Fe) and seven possible predictor variables (four ASTER PCs and three DEM-derived variables), we are faced with the problem of identifying the best possible combination of variables for the prediction of each response variable. This becomes especially problematic when one suspects that there may be some interaction between predictor variables (such as canopy cover and soil moisture content as predictors of fire susceptibility). In these cases, one might wish to consider adding interaction terms, which further adds to the dimensionality of one’s data. For the purposes of this exercise, interaction terms were not considered.

A common method for identifying an optimal combination of predictors is Stepwise Multiple Regression, which allows one to walk through a series of potential models based on iterative improvements in model fit. This method is widely accepted and robust, but carries the disadvantage of only considering a single measure of model fit, such as R2 or AICc.

A more brute-force method for identifying optimal combinations of predictor variables is Exploratory Multiple Regresssion, which takes advantage of modern computing power to calculate every possible combination of predictor variables for each response. The results of these analyses may then be quickly sorted by number of dependent variables and multiple measures of model fit.

Table 3 reports the results of an Exploratory Multiple Regression of all possible combinations of our seven variables as predictors of Al. To conserve space in the blogosphere, only best-fit models for combinations of one through seven variables are reported here. Predictably, R2 tends to increase as we add variables to the model. However, Root Mean Square Error (RMSE), Akaike Information Criteria (AICc), and Bayesian Information Criteria (BIC) scores are lowest for models with four to five variables. In multiple regression, AICc and BIC scores are used to evaluate competing models by punishing complexity while rewarding fit (lower scores are better). The models reported in Table 3 highlight the problem with using a single criteria for model selection; while both AICc and BIC are lowest for the model that uses five variables, only four of these variables are individually significant at α = 0.05. For this reason, I have selected the four variable best-fit model for Al as the final model that I will use in further analyses.

Table 4 shows the final best-fit models chosen for all five elements using Exploratory Multiple Regression. In general the relationship between our predictor variables and individual elements is weak, ranging from 0.116 for K to 0.296 for Al. All final combinations of variables were significant predictors of each element, as were individual variables included in each model.

Table 3: Best-fit least squares regression models predicting the elemental concentration of Al in Tlacolula Valley clays using one to seven variables. Predictor variables were chosen from all possible combinations of models generated using 7 variables: ASTER PC1, ASTER PC2, ASTER PC3, ASTER PC4, Log10 elevation, Log10 slope, and Log10 flow accumulation.

Pink_GWR_T3

Table 4: Best-fit least squares regression models predicting the elemental concentration of five major and minor elements in Tlacolula Valley clays. Predictor variables were chosen from all possible combinations of models generated using 7 variables: ASTER PC1, ASTER PC2, ASTER PC3, ASTER PC4, Log10 elevation, Log10 slope, and Log10 flow accumulation.

Pink_GWR_T4

Notably, different combinations of variables were identified as the best predictors for each element. ASTPC1 (weakly loaded on all ASTER bands) was included in all models except that for K. ASTPC2 (positively loaded on the TIR bands) was included in models predicting Al and Ca, but not Na, K, or Fe. Surprisingly, ASTPC3 (positively loaded on the VNIR bands) was only included in the final model for Na, while ASTPC4 (positively loaded on low wavelength SWIR bands and negatively loaded on high SWIR bands, but describing just 1.4% of the variability in the data) was included in final models for Ca, Na, and Fe. Slope was included in final models for all elements except K, but only has a positive coefficient for Ca. Flow Accumulation was included as a negative coefficient in final models for Al, K, and Fe, showing that these elements tend to decrease in areas with higher potential runoff. Elevation was not included in any of the five final models.

A troubling issue with the final models that we have selected is their relatively poor predictive power. Generally speaking, it appears that even the best combinations of our predictor variables only account for about 20 to 30 percent of the variability in the elemental composition of clay deposits in the Tlacolula Valley. This raises the question of whether the relationships we have identified between our predictor variables and individual elements hold true across the study area, or whether our models are driven by strong deviations from the global fit at finer spatial scales. To address this issue, we may turn to Geographically Weighted Regression.

Geographically Weighted Regression

Over the course of this term, a number of us have attempted to use Geographically Weighted Regression (GWR) to generate prediction surfaces for particular response variables. This was in fact the original intent of my project. However, a common problem that we ran into was the difficulty of interpreting the results of geographically-weighted regression models. Is a GWR model with a higher R2 than an OSL model that uses the same variables a better model? What does it mean when local coefficients are positive in one area and negative in another? Should areas with poor local fit be excluded from output prediction surfaces?

My primary purpose in this back-to-basics blog post on multiple regression stems from my current understanding of the intended use of GWR. Turns out, Geographically Weighted Regression is primarily intended as a technique for exploring local deviations from global fit resulting from non-stationarity in one’s data (Fotheringham et al. 2002). An implication of this is that GWR should be conducted only as a post-hoc analysis following selection of a final model using Ordinary Least Squares regression. In fact, the “help” function for Geographically Weighted Regression in ARCGIS explicitly advises one to do this. A corollary to this implication might be that GWR is not really intended to generate prediction surfaces.

As discussed by multiple members of the class, GWR operates by generating a series of local fits to each feature in a spatial dataset by passing a Gaussian kernel smoother of either a fixed or adaptive threshold distance across one’s data. For this exercise, I conducted a GWR for each of our five elements of interest using the final model variables selected through Exploratory Multiple Regression. Table 5 summarizes the quasi-global R2 fit of GWR models relative to R2 values for OLS models for the same variables. For all five elements, R2 values are greater for GWR models than the OLS models, but the degree of difference differs by element. For Al and Ca, GWR models yield minimal improvements in fit over OLS models. By contrast, Quasi-global GWR R2 values for Na and Fe are close to twice as high as their OLS counterparts.

Table 5: Quasi-global R2 fit of geographically weighted regression (GWR) models relative to those of best-fit least squares (OLS) models predicting the concentration of five major and minor elements in Tlacolula Valley clays.

Pink_GWR_T5

But how do we interpret this information? My best answer is that if GWR is primarily intended to check for stationarity, then local measures of fit and dependent coefficients are much more important to our interpretation than the global fit. If we map the local R2 for each feature included in a given model, we find that the quasi-global R2 is a kind of maximum measure of fit; local R2 values are always lower than the quasi-global R2. Examining the spatial distribution of local coefficients and measures of fit allows one to identify whether non-stationarity is a potential issue affecting the global fit of the OLS model.

Figures 1-5 show local R2 values for GWR models for Al, Ca, Na, K, and Fe respectively. As we may expect, non-stationarity is an issue for all five models. Interestingly, the results for Al (Fig. 1) and Ca (Fig. 2) are nearly identical, despite being based on different dependent variables. Both show higher levels of local fit to the west, and lower levels of fit to the east. The results for Na (Fig. 3) are similar; local fit is highest in the west, and lowest on the east side of the hill from this area. Local fit for K (Fig. 4) is low across the map, but is highest in the east and lowest in the west. Local fit for Fe (Fig. 5) is substantially greater in the west than in the central Tlacolula Valley. But again, how do we interpret this information?

GWR_Al

Figure 1 (above): Local fit of GWR for Aluminum.

GWR_Ca

Figure 2 (above): Local fit of GWR for Calcium.

GWR_Na

Figure 3 (above): Local fit of GWR for Sodium.

GWR_K

Figure 4 (above): Local fit of GWR for Potassium.

GWR_Fe

Figure 5 (above): Local fit of GWR for Iron.

Local measures of R2 simply tell us the strength of the relationship between local variability in the predictors vs. the response. If either our predictor variable or response variable is spatially auto-correlated, we may expect this to result in low local variability, which would tend to result in poor local fit. Higher local variability in either the predictor or response variables will not necessarily yield a stronger local fit, but if there is a relationship between variables at the global scale, this helps.

To illustrate, if we map raw concentrations of Ca and Fe (Fig. 6 and 7) in clay samples collected from across the Tlacolula Valley, we can see that local variability in each is highest in the west and lowest in the central and eastern portions of the study area. In both cases, high local R2 in the west simply reflect greater local variability in these elements in this area. For Ca, this appears to be driven by local contrasts between extreme values associated with limestone bedrock in close proximity to alluvial sediments with more normative values. In areas with less local variability in Ca concentrations, variability in spectral reflectance/emissivity and slope is less predictive.

GWR_CaPPM

Figure 6 (above): Calcium concentrations in Tlacolula clay samples (ppm).

GWR_FePPM

Figure 7 (above): Iron concentrations in Tlacolula clay samples (ppm).

Punchline

Results of exploratory multiple regression have shown that concentrations of individual elements in Tlacolula clay deposits are weakly correlated with spectral reflectance, emissivity, slope and flow accumulation at the global scale of the study. At a finer scale, the results of GWR suggest that clay chemistry is poorly predicted by these variables, except in areas with high local variability in a given response element. This may indicate that the weak global relationships observed through OLS regression may be driven in large part by spatially-clustered extreme values, violating the assumption of stationarity. GWR has thus provided a robust check on the assumptions of our OLS regression models, as well as a clearer view to the factors influencing Tlacolula clay chemistry.

References

Mendenhall and Sincich (2003) Regression Analysis: A Second Course in Statistics. Sixth Edition. Pearson Education LTD., London.

Fotheringham, A. S., Brunsdon, C., and Charlton, M. E. (2002). Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Wiley, Chichester.

Introduction
Hot Spot Analysis is a common method for identifying spatially significant clusters of high and low values among features. While feature values may be easily mapped using histogram distributions to identify areas or points with high or low values, it is difficult to tell whether these values are spatially significant.

In ArcGIS, the Hotspot Analysis tool calculates the Getis-Ord Gi* statistic (Figure 1) for each feature in a dataset. It then outputs a z-score and p-value for each feature, allowing one to map the spatial distribution of significant values to identify local hotspots. To be a statistically significant hotspot, a feature must not only have a high or low value, but must be surrounded by features with similarly high or low values.

Getis_Ords_G-star and Spatial Autocorrelation implementation in ArcView-1

Figure 1: Equation for the Getis-Ord Gi* statistic.

As we shall see, significance for the Getis-Ord Gi*statistic depends largely on calculation of a local sum for a feature and its neighbors. This is compared proportionally to the sum of all features to identify regions where when the observed local sum is significantly different from the expected local sum, or where the difference is too large to be the result of random chance.

But how does the Hotspot Analysis tool determine what is “local”?

When the threshold distance parameter is left blank, a default threshold equal to the Euclidean distance that ensures that every feature has at least one neighbor is computed and applied to the entire dataset. An implication of this is that Hotspot Analysis is extremely sensitive to the spatial distribution of features, making some data inappropriate for this type of analysis.
In this tutorial, we will explore how manual changes in threshold distance affect our view of what is significant using clay chemistry data from the Valley of Oaxaca, Mexico. I will argue that because clay sample locations were researcher-selected rather than the product of natural processes, Hotspot Analysis is not appropriate for this data.

Data, Analysis, Results

The Valley of Oaxaca is a broad Y-shaped valley located in the central highlands of Oaxaca, Mexico and the heart of ancient Zapotec Civilization. Over the past several years, researchers at the OSU Archaeometry Laboratory have collected over 300 clay samples from across the Valley for elemental analysis using Instrumental Neutron Activation Analysis (INAA) to establish a comparative basis for the geochemical fingerprinting of ancient pottery from the region.
OaxacaGeo

Figure 2: Clay Sampling locations relative to regional geology in the Valley of Oaxaca, Mexico.

Our ability to identify pottery produced in different part of the Valley is due in large part to the geologic complexity of the region. In each area of the Valley, clay composition is strongly conditioned by parent material, sometimes representing the product of mixed sediments from multiple sources. For this exercise, we will focus on two elements, lanthanum (La) and cesium (Cs), found in elevated concentrations in clays derived from gneiss and rhyolite respectively. La concentrations tend to be high along the western side of the valley (Figure 3), while Cs concentrations are high in the east (Figure 4).
OaxacaLa

Figure 3: La concentrations in samples from the Oaxaca Clay Survey

OaxacaCs
Figure 4: Cs concentrations in samples from the Oaxaca Clay Survey

A hotspot analysis of La concentrations from across the Valley using an unspecified (default setting) threshold distance (Figure 5) identifies significantly high hotspots of La along the central portion of the western edge of the valley – as expected – as well as significant cold spots along the eastern edge of the southern arm of the valley and along the northern side of the eastern arm. A similar analysis of Cs values identifies Cs cold spots along the western edge of the valley and Cs hot spots near the southern and eastern sides of the eastern arm (Figure 6).
OaxacaLaHotspot

Figure 5: Hotspot analysis of La concentrations using an unspecified threshold distance.

OaxacaCsHotspot

Figure 6: Hotspot analysis of Cs concentrations using an unspecified threshold distance.

Broadly speaking these patterns are in accord with our previous understanding of regional clay composition, but to what degree are the significance levels identified by hotspot analysis dependent upon our unspecified threshold distance?

To address this question, we conducted a series of three additional hotspot analyses of Cs concentrations using specified threshold distances of 100 m (Figure 7), 1,000 m (Figure 8), and 10,000 m (Figure 9). Results of these analyses show that the number of features identified as significant hot or cold spots increases as a function of threshold distance. Using a threshold of 100 m, only a few points in the far eastern portion of the Valley are identified as significant hotspots. This pattern largely holds true using a 1,000 m threshold with a few additional points in the same limited area defined as significant hotspots. However, using a threshold of 10,000 m, the majority of features on the eastern side of the Valley are classified as significant hotspots, while the majority of those in the west are classified as significant cold spots. When nearly everything is significant, what is significant? Clearly 10,000 m is an improper choice of threshold distance, but this raises the threefold issue of how the default threshold is determined, whether this distance best describes the data, and whether the assumptions of this tool are met by the data.
OaxacaCsHotspot100

Figure 7: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 100 m.
OaxacaCsHotspot1000

Figure 8: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 1000 m.

 

OaxacaCsHotspot10000
Figure 9: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 10,000 m.

 

Discussion

Results of a series of hotspot analyses of La and Cs concentrations in clay samples from the Valley of Oaxaca largely confirm what we already knew – La concentrations are high in clays along the western side of the valley and Cs concentrations are highest in clays in the eastern side of the valley. Hotspot analysis allows us to identify spatially significant clusters of points with higher or lower than expected values, but as we have seen, what gets flagged as significant is strongly dependent upon ones choice of threshold distance.

By default, ArcGIS selects a threshold distance equal to the minimum distance required to ensure that all features have at least one neighbor. This means that the default minimum distance is contingent upon the spatial distribution of features, and is likely driven by features at the edge of the dataset and/or spatial outliers.

This raises the larger issue of what factors drive the spatial distribution of features in one’s dataset. Are the location of individual features the product of the physical or social environment? Is their spatial distribution critical to your understanding of the data? Or are the locations researcher-selected according to some regular, stratified, or arbitrary sampling design that is not a product of the environment, but of how one chose to sample that environment?

In the case of the Oaxaca Clay Survey, clay sampling locations were selected opportunistically from subsurface horizons along roadcuts and streambanks, with greater sampling intensity near key archaeological sites sparse density in areas with few important sites. Because the sampling locations were arbitrarily selected, their overall distribution is not of interest. Nor are the inter-point distances used to calculate the Getis-Ord Gi* statistic.

Because the spatial distribution of clay sampling locations tells us more about researcher behavior than clay formation processes, and Hotspot Analysis uses this information in its calculation of z-scores and significance levels, Hotspot Analysis is not appropriate for this data.

Conclusion

Before using Hotspot Analysis, one should always verify (1) that the spatial locations of features of interest are the product of the physical or social environment rather than researcher-selection; and (2) that the default local threshold value is not driven by spatial outliers.

Introduction

I have done a cluster analysis on a fire regimes produced by the MC2 dynamic global vegetation model (DGVM) run across the Pacific Northwest (PNW) over the time period 1895-2100 (see my previous Cluster Analysis in R post for details). Clusters were based on fire return interval (FRI), carbon consumed by fire, and fraction of cell burned. MC2 is a process-based model with modules for biogeochemistry, fire, and biogeography. Process models are generally complicated models with high computational costs. I wanted to explore the question “How well could a statistical model reproduce the results from a process-based model?”

A complete answer to that question would take much work, of course, as MC2 produces data on vegetation types, carbon pools, carbon fluxes, water flows, and, of course, fire. But this investigation provides a starting point and has the potential to show that at least one aspect of the model could have a statistical alternative to process-based computation.

Method description

Overview

Inputs into the MC2 include monthly weather in the form of maximum temperature, minimum temperature, precipitation, and mean dew point temperature, as well as elevation, and soil depth. I chose to use these inputs with a set of logistic regression models (described below) to predict the fire regime cluster of each cell in the results from my cluster analysis. To simplify the number of regression variables, I summarized the monthly variables by “summer” (April-September), “winter” (October-January), and annually.

Logistic regression

Conceptually, logistic regression differs from linear regression in that logistic regression is designed to predict the probability of an outcome or occurrence in a binary data set. It is commonly used in models predicting the presence or absence of a phenomenon (e.g. fire occurrence, species occurrence) over a spatial domain. Linear regression uses one or more continuous or binary explanatory variables to calculate the binary response variable.

The formula used in linear regression is:

e^t / (e^t + 1)

which is equivalent to

1 / (1 + e^-t)

where t is the logit, a linear function of the explanatory values. The graph of the function is S-shaped.

An example in R should make things clearer:

Make a dataset with an independent variable and a dependent binary variable:

myDS = data.frame(
iv = 1:50, # independent variable
dv = c(rep(0,25),rep(1,25))
)

plot(myDS$iv,myDS$dv)

Your graph should look like this:

SheehanLogRegPlot1

Scramble the dependent variables between 20 and 30 to make the plot messier, like real life:

myDS$dv[21:30] = sample(1:0,10,replace=TRUE)

plot(myDS$iv,myDS$dv)

Your graph should look similar to this one, with the regions of 1 and 0 values overlapping:

SheehanLogRegPlot2

Now run a logistic regression on the data and look at the summary of the result:

tstGlm = glm(dv ~ iv, data = myDS, family = binomial)

summary(tstGlm)

Your summary should look something like this:

Call:
glm(formula = dv ~ iv, family = binomial, data = myDS)

Deviance Residuals:
Min       1Q   Median       3Q       Max
-1.55618 -0.15882 -0.01953   0.15884   2.01358

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.0868     3.0071 -3.022 0.00251 **
iv           0.3429     0.1113   3.080 0.00207 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 69.235 on 49 degrees of freedom
Residual deviance: 19.165 on 48 degrees of freedom

AIC: 23.165

Number of Fisher Scoring iterations: 7

There is a good explanation of these in this video: https://www.youtube.com/watch?v=xl5dZo_BSJk. A good reference for formulas used can be found at http://documentation.statsoft.com/portals/0/formula%20guide/Logistic%20Regression%20Formula%20Guide.pdf

But the short explanation is:

Call: The call that was made to the glm function

Deviance residuals: Not really useful in a binary response regression

Coefficients:

  • Intercept: The intercept, or constant coefficient, used in the logit linear function
  • Estimate: These are the coefficients used with each variable in the logit (in this case we have one variable in the logit)
  • Std. Error: The standard error associated with each coefficient.
  • Z Value: The z value for the coefficient. This is the Estimate divided by the Std. Error.
  • Pr(>z): The p value associated with each z value.
  • Null deviance: A measure of deviance between the model and a model that only has the constant coefficient

Residual deviance: A measure of deviance to the model produced.

AIC: Akaike information criterion. This is a measure of the quality of the model based on goodness of fit and model simplicity. For models created using the same number of data points, a model with a lower AIC is considered better than one with a higher AIC.

Make a plot of the regression function along with the original points. You have to use a data frame in which the column has the same name as the column you used to create the regression:

tmpData = data.frame(iv = seq(0,50,0.2))

plot(seq(0,50,0.2), predict(tstGlm, newdata = tmpData, type = 'response'), type='l')

points(myDS$iv,myDS$dv)

SheehanLogRegPlot3

Notice how the logistic curve, which predicts the probability of a positive outcome, moves from a y value of 0 to a y value of 1 where the dependent values from the original dataset are mixed between 1 and 0. From this, it is easy to see how logistic regression predicts probabilities for binary data.

Logistic regression can be used with multivariate regression, too. Each explanatory value has a term in the logit. Each term’s coefficient determines its influence on the final prediction.

Stepwise Logistic Regression

Finding the best combination of explanatory values to use can present a challenge. Often, there are more variables than desired, and using too many can result in an overfitted model. One method used to determine which variables to use is stepwise regression. There are two approaches to this method: forward and backward.

In forward stepwise regression, the algorithm starts with the constant coefficient. Then it tests all of the explanatory values to find the one that, by itself, has the greatest explanatory power. It computes the AIC for the model and then tests the remaining explanatory variables to find which adds the greatest explanatory power. It then computes the AIC with the added variable. If the AIC is lower (better) with the new explanatory variable, it repeats the process. If the AIC is greater, it discards the new variable and stops.

Backward stepwise regression takes the opposite approach. It starts with all the variables and removes the one with the least explanatory value and then calculates AIC. It stops at the point when removing a variable would raise the AIC.

One approach to finding a model with stepwise regression is to do both forward and backward regression and see if one returns a better result than the other. Additionally, the results can be modified by removing any explanatory variables that have a high p-value to see if removing them yields a better model. I used both forward and backward regression and in two cases, found model improvement by removing an explanatory variable with a low p value.

A problem with stepwise regression is that it does not guarantee the best possible regression function. To find that, one has to test all possible combinations of explanatory variables, but that is generally a very compute-intensive proposition. So, even though it is easy to find criticisms of stepwise regression, it is commonly used.

Without going into too much detail, here is some R code that performs stepwise regression:

# Generate some independent, explanatory variables. Note how some are more “scrambled” than others

indepVs = data.frame(
v1 = c(1:50),
v2 = c(1:50),
v3 = c(1:50),
v4 = c(1:50),
v5 = c(1:50),
v6 = c(1:50)
)

indepVs$v2[23:27] = sample(c(23:27),5)
indepVs$v3[21:30] = sample(c(21:30),10)
indepVs$v4[16:35] = sample(c(16:35),20)
indepVs$v5[11:40] = sample(c(11:40),30)
indepVs$v6[1:50] = sample(c(1:50),50)

# The dependent variable, scrambled in the middle

depV = rep(0,50)
depV[26:50] = 1
depV[21:30] = sample(c(0:1),10,replace=TRUE)

# Generate a full model (using all the explanatory variables)

fullGlm = glm(
depV ~ v1+v2+v3+v4+v5+v6,
data = indepVs,
family = binomial()
)

# Generate an empty model

emptyGlm = glm(depV ~ 1, data = indepVs,family=binomial)

# Do forward stepwise regression

forwardGlm = step(
emptyGlm,
scope = list(lower=formula(emptyGlm),upper=formula(fullGlm)),
direction='forward'
)

# Do a backward stepwise regression

backGlm = step(fullGlm)

Plotting the results and examining the resulting models using summary() will give you insights into how the process is working.

Geographically weighted logistic regression

I did some searching for geographically weighted logistic regression and found a couple of papers that used the method. I did not come across any easily implemented solutions for it (which does not mean one does not exist). I think doing geographically weighted logistic regression could have a lot of value, especially over large and varied geographic areas, like the one I’m studying.

I did a logistic regression on a single EPA Level III ecoregion (The Blue Mountains) within my study area and compared the results of that to results from the model produced using the whole study region and found it had a higher predictive value for the ecoregion. This is an indication that geographically weighted regression could provide advantages.

The Analysis

I used logistic regression as part of a process to predict which fire regime each cell in my dataset would have. My explanatory values were derived from the input dataset I used for modeling, these included: annual, summer, and winter means for maximum temperature, minimum temperature, precipitation, and mean dewpoint temperature; elevation; and soil depth. For each of my four fire regimes, I generated a logistic regression predicting whether a cell was in the fire regime or not. Based on recommendations from https://www.medcalc.org/manual/logistic_regression.php, I used this formula to determine the number of data points to use in each regression:

10 * num_exp_vars / min(frac_pos_cases, frac_neg_cases)

where num_ind_vars is the number of explanatory variables, frac_pos_cases is the fraction of positive (presence) cells in the dataset, and frac_neg_cases is the fraction of negative cases in the dataset.

This gave me four logistic regression functions. For each cell in the study area, I ran each of the logistic functions and recorded the probability returned. The predicted fire regime for a cell was assigned based on the logistic function that produced the highest probability. I repeated this entire procedure for the Blue Mountain Ecoregion only.

Results

Across the full study area, this process correctly predicted the fire regime with 61% accuracy (Keep in mind random assignment would be expected to return 25% accuracy). Within the Blue Mountain Ecoregion, the process predicted the fire regime with 46% accuracy using the logistic functions derived from the data for the entire study area, and with 74% accuracy using the logistic functions derived for the Blue Mountain Ecoregion. This is s a strong indication that geographic influences exist and that geographically weighted logistic regression has the potential to improve results.

Method Critique

I was very pleased with the results of this method, and frankly, a little surprised at how successful the method was at predicting fire regime. This investigation leads me to believe that statistical models should be considered as alternatives to process-based models for some vegetation modeling tasks, especially considering the lower computational overhead of statistical models.

The biggest downside to the method is the manual procedures involved in finding the best logistic function. This process could be automated with some programming, however, as could a geographically weighted version of logistic regression.

The goal for this tutorial is to expose users to interpolation methods using packages in R. I found doing these in R you get very informative insight into how these interpolation methods operate since you have to specify specific options instead of clicking the krigging function in ArcGIS.

Just a quick refresher: 1) First I need to load the necessary libraries. The main libraries I need for interpolation are gstat and raster 2) Import all the data that we need The first data I will be entering is a polygon shape file which I will use for just projection and aesthetic for displaying the maps. There is also a .csv file that I will import which has all the points I need. I will need to convert this to a point vector datatype. The last time, I was able to extrapolation the point layer pH values over each of their respective polygons. If you more detail on how exactly this was accomplished please read my previous blog post labeled: Spatial Data Management and Simple Plotting with R.

x lapply(x, library, character.only=T)

## This is a polygon shapefile
CRA

## OGR data source with driver: ESRI Shapefile
## Source: "Data/NRCS_Data", layer: "Final_CRA"
## with 60 features
## It has 4 fields

## Reading in a .csv file with all the attributes and removing all the missing pH values.
NRCS_data c

NRCS_data

## Creates a matrix of lat long coord
NRCS_mat row.names(NRCS_mat) ## Takes the projection from the CRA shapfile
CRA.projection

## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"

## Creates a spatail dataframe with the projection as our CRA shapefile
NRCS NRCS@data

## aggergating my atributes by pH finding the mean pH for each CRA
agg_NRCS_data agg_NRCS_data

## merging the 2 datasets together by CRA
CRA@data

## Joining by: "Final_CRA"

## Warning in left_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector

CRA@data

## ploting the extraploated CRA polygon layer. An indepth expliation of this code is in with the IDW code.
tm_shape(CRA)+ tm_fill("pH_water",breaks=c(-Inf, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, Inf), auto.palette.mapping=F, legend.show = F)+tm_borders()+tm_shape(NRCS)+ tm_bubbles("red", size = .1, border.col="black")+tm_layout(title="Extrapolation of pH Across CRAs", title.position = c("center", "top"))

CRA

 

In order to do any form of interpolation you need to create a form of resolution that you want the interpolation to happen. In ARCGIS this is done automatically; however, in R you need to create a raster layer that has your specified resolution that you want. Here I create a blank raster layer that has 100 rows and 100 columns. You can change our resolution to what ever you want by editing ncol and nrow variables.


## creating a raster grid usign the raster package
grd ## "Spreading" grid to ecompass the entire CRA polygon layer
extent(grd)

## giving the grid a projection
crs(grd)

## changing the variable class from a raster to a SpatialPixel. They are both rasters; however, this is the format that the package gstat preferrs.
NRCS.raster crs(NRCS.raster) gridded(NRCS.raster)

## This is jusst a quick plot ot see that the raster grid has the correct projection and matches up correctly.
plot(NRCS.raster)
lines(CRA, col="blue")

grid_plot

This is a plot of the polygon layer over the empty gridded raster layer.

Now we have all of our layers, its time to create do the inverse distance weighting (IDW) and Ordinary Kriging. IDW and krigging are both only use geographic distance from know points to unknown points. IDW is simpler since it is a weighted average between unknown points strictly based on distance. Kriging is a little more sophisticated since a model of the semivariogram is created to calculate the weights.

OK, first we are going to do the IDW, to do this we need both the points and the z-value we want to interpolate and the resolution that we want that to happen on (our blank raster layer).

## In gstat the IDW function is called idw. We want to interpolate the pH_water variable in the NRCS point shape file over the resoltion of the NRCS.raster layer.
idw.out <- gstat::idw(pH_water~1, NRCS, newdata=NRCS.raster)

## [inverse distance weighted interpolation]

## Here we are backing up the IDW raster layer and trimming it to the same width as our CRA polygon layer.
idw.out.BU <- idw.out
idw.out <- idw.out.BU[CRA,]

## Now to plot i, I know this piece of code loos indtimidating; however, that is only becuase you have create every little aspect about your plot layer by layer. First comes the ph IDW raster layer. I want the colors of the raster layer colors to be partitioned by the bins of pH that I speficfy after break. I also want to turn off the legend since it is a little cluttered. Next layer is the polygon layer and I and jsut displaying the outlines of each polygon. The last layer is the Point layer inwhich I am displaying each point as blue. tm_layout is jsut for formating the title posistioning it in the top center.

tm_shape(idw.out)+tm_raster(breaks=c(-Inf, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, Inf), auto.palette.mapping=F, legend.show = F)+tm_shape(CRA)+tm_borders()+tm_shape(NRCS)+ tm_bubbles("blue", size = .1, border.col="black")+tm_layout(title="Interpolation using IDW", title.position = c("center", "top"))

IDW_plot

Now for Kriging. To do krigging in R first you need to examine the semivariagram and then model the semi variance and use that model for your interpolation.

## Creating the semivariaogram and plotting it.
cvnrcs<-variogram(pH_water~1, locations=NRCS, data=NRCS)
plot(cvnrcs)

VAR_1_plot

## Though this variogram does not look great for the purposes of this exerice it will work. However, in reality I would be quite hesitant to do any more of interpolation over this data.
##Looking at the variogram I choose to go withe a Spherical model that has a sill of .06, a nugget of .01 and a range of 100.
nrcs.fit.sph <- fit.variogram(cvnrcs, model = vgm(psill=.06, model="Sph", nugget=.01, range=100))
plot(cvnrcs, nrcs.fit.sph)

Var_2_plot

## The fit isn't completely bad; however, it is not great.
## Now to create the Ordinary Kriged raster layer and then back it up and trim it to the right size as the polygon layer.
pH.ok <- krige(pH_water~1,NRCS, NRCS.raster, model=nrcs.fit.sph)

## [using ordinary kriging]

pH.ok.BU <- pH.ok
pH.ok <- pH.ok.BU[CRA,]

## This is essintially the same plotting code as the IDW plot; however, the raster laster I am using is the one created by the ordinary kriged layer.
tm_shape(pH.ok)+tm_raster(breaks=c(-Inf, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, Inf), title="Soil pH", auto.palette.mapping=F, legend.show = F)+tm_shape(CRA)+tm_borders()+tm_shape(NRCS)+ tm_bubbles("blue", size = .1, border.col="black")+tm_layout(title="Interpolation using Ordinary Kriging", title.position = c("center", "top"))

krig_plot

In examining the plots you can see that the pH values on the western side of Oregon pretty well conserved since there is not much heterogeneity. However, as you move east you see that each map displays something different. This is because there are more parameters dictating pH other than just distance. Also the sampling density of eastern Oregon is not dense enough to capture the heterogeneity of the landscape as you can see how the IDW and ordinary krigging blend a lot of it together.

However, now you know how to use this amazing functionality in R and this is only scratching the surface. Though R may not be as user friendly and its graphic capabilities may not be as polished, its ability to truly customize each step of the analysis makes it well worth the effort in learning how to conduct these analysis.