GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Prologue

This post got kind of out-of-hand, but it represents my attempt to understand what was going on under the hood of geographically-weighted regression (GWR), especially in terms of my own data. I found it personally helpful to back up, start with the basics of multiple regression, and work my way back to GWR to see what it was and was not telling me about my data. Much of this will be review for members of our class, but I hope that you find something of use here.

Introduction

Regression analysis refers to a class of statistical procedures used to assess the degree to which one variable (y) may be predicted by one or more other, independent variables (x₁, x₂, … , x_k). Simple, linear regression models use a single continuous variable to predict a response variable, and are written using the familiar equation y = mx + b, where y is the response variable, x is the predictor, m is the slope coefficient of x, and b is the y-intercept of the regression line on a Cartesian grid. Multiple Regression models are similar, but use more than one continuous predictor variable. These are typically written as: y = β₀ + β₁x₁ + β₂x₂ + … + β_kx_k + ε, where β refers to each variable coefficient and ε to the model error.

With the exception of logistic regression, variables on both sides of a regression equation must be continuous (rather than categorical) and must meet four key assumptions: (1) the probability distribution of ε must be 0; (2) the variance of the probability distribution of ε is constant; (3) the probability distribution of ε is normal; and (4) the errors associated with any two observations must be independent (Mendenhall and Sincich 2003:105-106).

The complexity of multiple regression raises a number of issues, especially when dealing with spatial data. In this tutorial, we will explore statistical approaches to three common problems in multiple regression. First, how do we deal with collinearity between our predictor variables? That is, how do we deal with violations of the assumption of independence? Second, how do we decide which predictor variables to include in our model? And finally, how do we identify geographically patterned deviations from a global regression model in spatial data?

To address the first issue, we will discuss a statistical technique for reducing the dimensionality of a dataset called Principal Components Analysis. To address the second, we will discuss Exploratory Multiple Regression. To address the third, we will discuss Geographically Weighted Regression.

Case Study

To illustrate this discussion, we will be working with three sets of data from my research in Oaxaca. Namely, we will assess the degree to which we can predict concentrations of five major and minor elements in subsurface clay deposits using data derived from remote sensing. For me, these data are of interest because understanding the distribution of compositionally distinct clay resources may help us track ancient pottery production and exchange in the region. For this tutorial, these data serve simply as an example of a complex multivariate spatial problem. Each dataset used in this study is outlined briefly below:

Tlacolula Clays: These data consist of 137 point features from across the eastern arm of the Valley of Oaxaca, the Tlacolula Valley. During the summers of 2007 and 2012, clay samples were collected from each of these locations, exported to OSU, and analyzed using INAA to estimate the elemental composition of each sample across 30 elements. For this exercise, we focus on just five of major and minor elements: Aluminum (Al), Calcium (Ca), Sodium (Na), Potassium (K), and Iron (Fe). These will act as our dependent variables.

ASTER Bands: These data consist of spectral reflectance and emissivity measurements for 13 bands of remote sensing imagery extracted from a single tile of ASTER data from a cloud-free day in June of 2001. These include two Very Near Infrared (VNIR) bands taken at a spatial resolution of 15 m, six Short-wave Infrared (SWIR) bands taken at a spatial resolution of 30 m, and five Thermal Infrared (TIR) bands taken at a spatial resolution of 90 m. A common use of ASTER data is the mapping and classification of surface geology. If parent material is a good predictor of clay composition in this area, ASTER imagery may serve as a good proxy measure of sediment lithology. These data will therefore act as our first set of predictor variables.

DEM-derived surfaces: Beyond parent material, we may expect the chemical composition of clay deposits to be affected by processes such as soil development, erosion, and alluvial movement of sediments. All of these are difficult to measure directly, but are heavily affected by factors such as slope, which are easily measured using remote sensing. We therefore added a second set of possible predictor variables to our analysis, these derived from a single ASTER 90 m DEM. These include elevation, slope, and flow accumulation. All three of these variables were log-transformed to better approximate a normal distribution.

Software

All of the statistical methods described in this tutorial are available under multiple software packages. I used a combination of ENVI (for manipulation of ASTER data), ARCGIS (for most other spatial analysis and data manipulation), and JMP (for non-spatial data analysis). I cannot speak to how these software compare to other software with similar capabilities. In fact, I am confident you prefer your favorite software packages over mine 🙂 For this reason, I will emphasize methods and interpretation over software throughout this tutorial.

Principal Components Analysis

A primary concern in multiple regression analyses is the tendency to over-fit a model simply by including too many variables. This is problematic because it both overstates the degree to which a response variable can be predicted with a series of independent variables, and it becomes extremely difficult to interpret. Our dataset has sixteen predictor variables, which is almost certainly too many for the number of cases we are analyzing.

A common method for reducing the dimensionality of a dataset is Principal Components Analysis (PCA). PCA uses either the correlation or covariance matrix of a dataset to create an orthogonal rotation of new variables called PCs that are by definition independent. Each PC describes a portion of the overall variance in the dataset, as measured by Eigen-values, with the first PC describing the principal axis of variability, the second PC a smaller portion of variability, and so on. The contribution of individual input variables on each PC is described by their loadings as Eigen-vectors.

When choosing PCs for further analyses, it is important to consider both the Eigen-values and Eigen-vectors. How many PCs are required to describe the majority of the variability in your dataset? Do the Eigen-vector loadings on each PC make sense, knowing what you know about the structure of your data?

To reduce the number of predictor variables in our analysis, data from the thirteen ASTER bands were subjected to a Principal Components Analysis on correlations. DEM-derived predictor variables were excluded from this analysis because a preliminary evaluation of the data showed that they were sufficiently independent from the ASTER bands. All ASTER bands were screened for outliers prior to analysis to reduce the influence of extreme cases.

Results of this analysis are reported as Eigen-values in Table 1, and as Eigen-vectors in Table 2. Table 1 shows that as much as 90 percent of the variability in the ASTER data is described by the first two PCs, and 97 percent described by the first three. However, beyond providing a reduced set of variables, we want our PCs to describe interpretively meaningful axes of variability that may be useful for predicting our response variables. Just because the first PC describes the most variability in the data does not mean that it is going to be the best predictor for all of our elements of interest.

Indeed, Table 2 shows that while the first PC is nearly equally loaded on all ASTER bands, the next three PCs are more intuitively meaningful. PC2 is positively loaded on the TIR bands and negatively loaded on all others. PC3 is positively loaded on the VNIR bands and negatively loaded on SWIR bands, especially at lower wavelengths. PC4, while accounting for just 1.4 percent of the variability in the data, describes an important contrast between shorter and longer wave bands within the SWIR portion of the spectrum. For this reason, we might select the first four PCs as potential dependent variables for subsequent analyses.

Table 1: Results of Principal Components Analysis of 13 ASTER image bands covering the Tlacolula Valley showing Eigen-values for each resulting PC, the percent of variability described by each, and the cumulative percent variability described by aggregate combinations of PCs.

Pink_GWR_T1

Table 2: Results of Principal Components Analysis of 13 ASTER image bands covering the Tlacolula Valley showing Eigen-vector loadings on the first 4 PCs. Positive loadings above 0.30 are shown in red. Negative loadings below -0.30 are shown in blue. These four PCs account for 99% of the variability in the ASTER data.

Exploratory Multiple Regression

Now that we have reduced the dimensionality of our dataset to five dependent variables (Al, Ca, Na, K, and Fe) and seven possible predictor variables (four ASTER PCs and three DEM-derived variables), we are faced with the problem of identifying the best possible combination of variables for the prediction of each response variable. This becomes especially problematic when one suspects that there may be some interaction between predictor variables (such as canopy cover and soil moisture content as predictors of fire susceptibility). In these cases, one might wish to consider adding interaction terms, which further adds to the dimensionality of one’s data. For the purposes of this exercise, interaction terms were not considered.

A common method for identifying an optimal combination of predictors is Stepwise Multiple Regression, which allows one to walk through a series of potential models based on iterative improvements in model fit. This method is widely accepted and robust, but carries the disadvantage of only considering a single measure of model fit, such as R² or AICc.

A more brute-force method for identifying optimal combinations of predictor variables is Exploratory Multiple Regresssion, which takes advantage of modern computing power to calculate every possible combination of predictor variables for each response. The results of these analyses may then be quickly sorted by number of dependent variables and multiple measures of model fit.

Table 3 reports the results of an Exploratory Multiple Regression of all possible combinations of our seven variables as predictors of Al. To conserve space in the blogosphere, only best-fit models for combinations of one through seven variables are reported here. Predictably, R² tends to increase as we add variables to the model. However, Root Mean Square Error (RMSE), Akaike Information Criteria (AICc), and Bayesian Information Criteria (BIC) scores are lowest for models with four to five variables. In multiple regression, AICc and BIC scores are used to evaluate competing models by punishing complexity while rewarding fit (lower scores are better). The models reported in Table 3 highlight the problem with using a single criteria for model selection; while both AICc and BIC are lowest for the model that uses five variables, only four of these variables are individually significant at α = 0.05. For this reason, I have selected the four variable best-fit model for Al as the final model that I will use in further analyses.

Table 4 shows the final best-fit models chosen for all five elements using Exploratory Multiple Regression. In general the relationship between our predictor variables and individual elements is weak, ranging from 0.116 for K to 0.296 for Al. All final combinations of variables were significant predictors of each element, as were individual variables included in each model.

Table 3: Best-fit least squares regression models predicting the elemental concentration of Al in Tlacolula Valley clays using one to seven variables. Predictor variables were chosen from all possible combinations of models generated using 7 variables: ASTER PC1, ASTER PC2, ASTER PC3, ASTER PC4, Log10 elevation, Log10 slope, and Log10 flow accumulation.

Pink_GWR_T3

Table 4: Best-fit least squares regression models predicting the elemental concentration of five major and minor elements in Tlacolula Valley clays. Predictor variables were chosen from all possible combinations of models generated using 7 variables: ASTER PC1, ASTER PC2, ASTER PC3, ASTER PC4, Log10 elevation, Log10 slope, and Log10 flow accumulation.

Notably, different combinations of variables were identified as the best predictors for each element. ASTPC1 (weakly loaded on all ASTER bands) was included in all models except that for K. ASTPC2 (positively loaded on the TIR bands) was included in models predicting Al and Ca, but not Na, K, or Fe. Surprisingly, ASTPC3 (positively loaded on the VNIR bands) was only included in the final model for Na, while ASTPC4 (positively loaded on low wavelength SWIR bands and negatively loaded on high SWIR bands, but describing just 1.4% of the variability in the data) was included in final models for Ca, Na, and Fe. Slope was included in final models for all elements except K, but only has a positive coefficient for Ca. Flow Accumulation was included as a negative coefficient in final models for Al, K, and Fe, showing that these elements tend to decrease in areas with higher potential runoff. Elevation was not included in any of the five final models.

A troubling issue with the final models that we have selected is their relatively poor predictive power. Generally speaking, it appears that even the best combinations of our predictor variables only account for about 20 to 30 percent of the variability in the elemental composition of clay deposits in the Tlacolula Valley. This raises the question of whether the relationships we have identified between our predictor variables and individual elements hold true across the study area, or whether our models are driven by strong deviations from the global fit at finer spatial scales. To address this issue, we may turn to Geographically Weighted Regression.

Geographically Weighted Regression

Over the course of this term, a number of us have attempted to use Geographically Weighted Regression (GWR) to generate prediction surfaces for particular response variables. This was in fact the original intent of my project. However, a common problem that we ran into was the difficulty of interpreting the results of geographically-weighted regression models. Is a GWR model with a higher R²than an OSL model that uses the same variables a better model? What does it mean when local coefficients are positive in one area and negative in another? Should areas with poor local fit be excluded from output prediction surfaces?

My primary purpose in this back-to-basics blog post on multiple regression stems from my current understanding of the intended use of GWR. Turns out, Geographically Weighted Regression is primarily intended as a technique for exploring local deviations from global fit resulting from non-stationarity in one’s data (Fotheringham et al. 2002). An implication of this is that GWR should be conducted only as a post-hoc analysis following selection of a final model using Ordinary Least Squares regression. In fact, the “help” function for Geographically Weighted Regression in ARCGIS explicitly advises one to do this. A corollary to this implication might be that GWR is not really intended to generate prediction surfaces.

As discussed by multiple members of the class, GWR operates by generating a series of local fits to each feature in a spatial dataset by passing a Gaussian kernel smoother of either a fixed or adaptive threshold distance across one’s data. For this exercise, I conducted a GWR for each of our five elements of interest using the final model variables selected through Exploratory Multiple Regression. Table 5 summarizes the quasi-global R² fit of GWR models relative to R² values for OLS models for the same variables. For all five elements, R² values are greater for GWR models than the OLS models, but the degree of difference differs by element. For Al and Ca, GWR models yield minimal improvements in fit over OLS models. By contrast, Quasi-global GWR R² values for Na and Fe are close to twice as high as their OLS counterparts.

Table 5: Quasi-global R² fit of geographically weighted regression (GWR) models relative to those of best-fit least squares (OLS) models predicting the concentration of five major and minor elements in Tlacolula Valley clays.

But how do we interpret this information? My best answer is that if GWR is primarily intended to check for stationarity, then local measures of fit and dependent coefficients are much more important to our interpretation than the global fit. If we map the local R² for each feature included in a given model, we find that the quasi-global R² is a kind of maximum measure of fit; local R² values are always lower than the quasi-global R². Examining the spatial distribution of local coefficients and measures of fit allows one to identify whether non-stationarity is a potential issue affecting the global fit of the OLS model.

Figures 1-5 show local R² values for GWR models for Al, Ca, Na, K, and Fe respectively. As we may expect, non-stationarity is an issue for all five models. Interestingly, the results for Al (Fig. 1) and Ca (Fig. 2) are nearly identical, despite being based on different dependent variables. Both show higher levels of local fit to the west, and lower levels of fit to the east. The results for Na (Fig. 3) are similar; local fit is highest in the west, and lowest on the east side of the hill from this area. Local fit for K (Fig. 4) is low across the map, but is highest in the east and lowest in the west. Local fit for Fe (Fig. 5) is substantially greater in the west than in the central Tlacolula Valley. But again, how do we interpret this information?

Figure 1 (above): Local fit of GWR for Aluminum.

Figure 2 (above): Local fit of GWR for Calcium.

Figure 3 (above): Local fit of GWR for Sodium.

Figure 4 (above): Local fit of GWR for Potassium.

Figure 5 (above): Local fit of GWR for Iron.

Local measures of R² simply tell us the strength of the relationship between local variability in the predictors vs. the response. If either our predictor variable or response variable is spatially auto-correlated, we may expect this to result in low local variability, which would tend to result in poor local fit. Higher local variability in either the predictor or response variables will not necessarily yield a stronger local fit, but if there is a relationship between variables at the global scale, this helps.

To illustrate, if we map raw concentrations of Ca and Fe (Fig. 6 and 7) in clay samples collected from across the Tlacolula Valley, we can see that local variability in each is highest in the west and lowest in the central and eastern portions of the study area. In both cases, high local R² in the west simply reflect greater local variability in these elements in this area. For Ca, this appears to be driven by local contrasts between extreme values associated with limestone bedrock in close proximity to alluvial sediments with more normative values. In areas with less local variability in Ca concentrations, variability in spectral reflectance/emissivity and slope is less predictive.

Figure 6 (above): Calcium concentrations in Tlacolula clay samples (ppm).

Figure 7 (above): Iron concentrations in Tlacolula clay samples (ppm).

Punchline

Results of exploratory multiple regression have shown that concentrations of individual elements in Tlacolula clay deposits are weakly correlated with spectral reflectance, emissivity, slope and flow accumulation at the global scale of the study. At a finer scale, the results of GWR suggest that clay chemistry is poorly predicted by these variables, except in areas with high local variability in a given response element. This may indicate that the weak global relationships observed through OLS regression may be driven in large part by spatially-clustered extreme values, violating the assumption of stationarity. GWR has thus provided a robust check on the assumptions of our OLS regression models, as well as a clearer view to the factors influencing Tlacolula clay chemistry.

References

Mendenhall and Sincich (2003) Regression Analysis: A Second Course in Statistics. Sixth Edition. Pearson Education LTD., London.

Fotheringham, A. S., Brunsdon, C., and Charlton, M. E. (2002). Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Wiley, Chichester.

Introduction

In my last post, I discussed the importance of quantifying social contacts to predict microbe transmission between individuals in our study herd of African Buffalo. This post deals with the other half of the equation: environmental transmission. My overarching hypothesis is that spatial and social behaviors are important for transmission of host-associated microbes. The goal of this pilot project is to establish a method for quantifying habitat overlap between individuals in the herd so that I can (eventually) compare habitat overlap with microbe community similarity and infection risk.

Methods/Results

Workflow:

I used the following outline to determine pairwise habitat overlap between a “test” pair of buffalo:

Project GPS points in ArcMap
Clip GPS points to the enclosure boundary layer
Use GMe to generate kernel density estimates and 95% isopleths
Back in Arc, used the identity tool to calculate area of overlap between the two individuals
Computed percent overlap using Excel

Geospatial Modelling Environment (GME):

GME is a free geostatistical software that combines the computational power of R with the functionality of ESRI ArcGIS to drive geospatial analyses. A few of the functions that are especially useful for analyzing animal movement are: kernel density, minimum convex polygons, movement parameters (step length, turn angle, etc), converting locations to paths and paths to points, and generating simulated random walks. For this analysis, I used GME to generate kdes and estimate home ranges using 95% isopleths.

Kernel density estimates and isopleth lines

Kernel densities can be conceptualized as 3-dimensional surfaces that are based on density of points. Isopleth lines can be drawn in GME based on a raster dataset (such as a kernel density) and can be set to contain a specified volume of surface area. For this analysis, I was interested in calculating 95% isopleths based on the kernel densities of GPS points. In real life, this means the area that an animal spends the majority of its time in.

Using the identity tool to compute overlap

After generating home range estimates for two individuals, I uploaded the resulting shapefiles to ArcMap and used the Identity tool to overlap the home ranges. To use the identity tool, you input one identity feature and one or more input features. Arc then computes a geometric intersection between input features and identity features and merges their attributes in the output.

The map above shows 95% isopleths from animal 1 (yellow), animal 13, (purple), and the area of overlap computed using the intersect tool (red line). I exported the output table to excel, where I calculated percent overlap between animals.

Conclusion

Overall, this method seems like it will be great for estimating habitat overlap. A few things that I’m concerned about are:

(a) Habitat use may be so similar for all animals that overlap cannot be related to differences in microbe transmission.

(b) Habitat use may correlate very strongly with contacts, in which case it will be difficult to control for the effects of contacts on microbe transmission.

(c) Percent overlap can be different for each individual in a pair. In my example, #13 overlapped #1 by ~80%, while #1 overlapped #13 by 90%.

I just want to be aware of these potential issues and start thinking about how to deal with them as they arise. Any suggestions would be appreciated, as always!

Introduction

My research goal is to determine the effects of social and spatial behavior on disease transmission and nasal microbiome similarity. The overarching hypothesis of this project is that composition of microbial communities depends on host spatial and social behavior because microbe dispersal is limited by host movement. My research group studies a herd of African Buffalo in Kruger National Park that is GPS collared and lives in a 900 hectare predator-free enclosure. In conjunction with the GPS collars, each animal has a contact collar that records contact length and ID of any animal that comes within ~1.5 meters. However, for this class I focused on the GPS collars in the effort to test whether they are sufficient for inferring contact networks.

The purpose of this step in my research was to establish a method for creating distance matrices between animals at different time points using only GPS data. I wanted to determine whether this is a viable option for inferring contact networks and could be used in lieu of the contact collars. If this method is effective, my plan was to iterate through multiple time points to determine average distances between animals over time.

Results from this pilot study showed that the method is feasible, however, GPS collars would be less effective for inferring a contact network than the contact collars because temporal resolution is sacrificed.

Data Manipulation in R

The raw data from the buffalo GPS collars is in the form of individual text files, one text file per individual per time point. Step one was to generate a csv file that included all individuals at one time point. This was done in R using the following code:

##Load necessary packages

library(reshape2)
library(ggplot2)

##Set working directory and set up loop to import data

setwd(“C:\\Users\\couchc\\Documents\\GPS_captures_1215\\GPS_capture_12_15”)
SAVE_PATH <- getwd()
BASE_PATH <- getwd()

CSV.list<-list.files(BASE_PATH,pattern=glob2rx(“*.csv”))

all.csv<-read.csv(CSV.list[1], header = TRUE)

names(all.csv)

buffalo<-data.frame(ID=all.csv$title,X=all.csv$LATITUDE,Y=all.csv$LONGITUDE, Day=all.csv$MM.DD.YY, Hour=all.csv$HH.MM.SS, s.Date=all.csv$s.date, s.Hour=all.csv$s.hour)
## Now we want to melt them and cast them based on some data within the frame.

melted.all.csv<-melt(buffalo,id=c(“ID”))
names(melted.all.csv)

casted.all.csv<-dcast(melted.all.csv, LATITUDE+LONGITUDE~id)
# Where our previous data was organized with ID as a column,
# now ID is a row, and we can see what value an ID took at a given location.

length(buffalo[,1])
ggplot(buffalo[buffalo$ID==buffalo$ID[1],], aes(x=X, y=Y, value = ID))+
geom_raster()

Distance Matrix

Because the GPS collars are not synchronized, the 30-minute time intervals can be offset quite a bit, and there are quite a few missing time intervals. To try and correct this problem, I rounded times to the nearest 30 minutes in excel. I then imported the new excel file back into R to generate a “test” distance matrix for a single time point using the following code:

-Show code and output once I’m able to run it on a lab computer (My computer doesn’t have the processing power to do it in a reasonable time).

Conclusion

This method showed some interesting results: inter-individual distances can be calculated in R using GPS collar data, and if the process were iterated over multiple time points, average pairwise distances could be computed between each individual. A mantel test could be used to determine correlation between . The problem with rounding time points to the nearest 30 minutes is that it doesn’t guarantee that the buffalo are actually ever coming within the distance that is given by the matrix. Since some of the collars are offset by up to 15 minutes, the animals could be in the recorded locations at different times, and never actually come into contact with each other. The benefit of using GPS collars to infer social behavior is that it gives us actual distances between individuals, which adds an element of spatial resolution not provided by the contact collars. Contact collars only read when another animal is within a specific distance, but no measure of actual distance is recorded. However, since we are looking for the best way to predict contact transmission of microbes for this pilot project, we are not interested in contact distances past a certain threshold. Although the distance matrix could prove useful for other behavior studies, it provides less relevant information than the contact collars. This could be a useful method if we wanted actual distances between organisms. However, to simplify we would like to set a threshold distance for microbe transmission, which is easily done with the contact collars. However, we would need much higher temporal resolution in our GPS data to be able to build distance matrices that are precise enough to infer contact networks, and this process is more easily done using contact collars.

Overall, this pilot project was useful because it demonstrated some of the potential benefits and drawbacks of using GPS intervals to infer contact networks. Although I do not plan to use the method described in this post for my dissertation research, it may be beneficial in future studies of animal behavior or herd structure. The exercise was very useful for improving my confidence in R — with lots of help from my classmates of course!

Introduction
Hot Spot Analysis is a common method for identifying spatially significant clusters of high and low values among features. While feature values may be easily mapped using histogram distributions to identify areas or points with high or low values, it is difficult to tell whether these values are spatially significant.

In ArcGIS, the Hotspot Analysis tool calculates the Getis-Ord Gi* statistic (Figure 1) for each feature in a dataset. It then outputs a z-score and p-value for each feature, allowing one to map the spatial distribution of significant values to identify local hotspots. To be a statistically significant hotspot, a feature must not only have a high or low value, but must be surrounded by features with similarly high or low values.

Figure 1: Equation for the Getis-Ord Gi* statistic.

As we shall see, significance for the Getis-Ord Gi*statistic depends largely on calculation of a local sum for a feature and its neighbors. This is compared proportionally to the sum of all features to identify regions where when the observed local sum is significantly different from the expected local sum, or where the difference is too large to be the result of random chance.

But how does the Hotspot Analysis tool determine what is “local”?

When the threshold distance parameter is left blank, a default threshold equal to the Euclidean distance that ensures that every feature has at least one neighbor is computed and applied to the entire dataset. An implication of this is that Hotspot Analysis is extremely sensitive to the spatial distribution of features, making some data inappropriate for this type of analysis.
In this tutorial, we will explore how manual changes in threshold distance affect our view of what is significant using clay chemistry data from the Valley of Oaxaca, Mexico. I will argue that because clay sample locations were researcher-selected rather than the product of natural processes, Hotspot Analysis is not appropriate for this data.

Data, Analysis, Results

The Valley of Oaxaca is a broad Y-shaped valley located in the central highlands of Oaxaca, Mexico and the heart of ancient Zapotec Civilization. Over the past several years, researchers at the OSU Archaeometry Laboratory have collected over 300 clay samples from across the Valley for elemental analysis using Instrumental Neutron Activation Analysis (INAA) to establish a comparative basis for the geochemical fingerprinting of ancient pottery from the region.

Figure 2: Clay Sampling locations relative to regional geology in the Valley of Oaxaca, Mexico.

Our ability to identify pottery produced in different part of the Valley is due in large part to the geologic complexity of the region. In each area of the Valley, clay composition is strongly conditioned by parent material, sometimes representing the product of mixed sediments from multiple sources. For this exercise, we will focus on two elements, lanthanum (La) and cesium (Cs), found in elevated concentrations in clays derived from gneiss and rhyolite respectively. La concentrations tend to be high along the western side of the valley (Figure 3), while Cs concentrations are high in the east (Figure 4).

Figure 3: La concentrations in samples from the Oaxaca Clay Survey

Figure 4: Cs concentrations in samples from the Oaxaca Clay Survey

A hotspot analysis of La concentrations from across the Valley using an unspecified (default setting) threshold distance (Figure 5) identifies significantly high hotspots of La along the central portion of the western edge of the valley – as expected – as well as significant cold spots along the eastern edge of the southern arm of the valley and along the northern side of the eastern arm. A similar analysis of Cs values identifies Cs cold spots along the western edge of the valley and Cs hot spots near the southern and eastern sides of the eastern arm (Figure 6).

Figure 5: Hotspot analysis of La concentrations using an unspecified threshold distance.

Figure 6: Hotspot analysis of Cs concentrations using an unspecified threshold distance.

Broadly speaking these patterns are in accord with our previous understanding of regional clay composition, but to what degree are the significance levels identified by hotspot analysis dependent upon our unspecified threshold distance?

To address this question, we conducted a series of three additional hotspot analyses of Cs concentrations using specified threshold distances of 100 m (Figure 7), 1,000 m (Figure 8), and 10,000 m (Figure 9). Results of these analyses show that the number of features identified as significant hot or cold spots increases as a function of threshold distance. Using a threshold of 100 m, only a few points in the far eastern portion of the Valley are identified as significant hotspots. This pattern largely holds true using a 1,000 m threshold with a few additional points in the same limited area defined as significant hotspots. However, using a threshold of 10,000 m, the majority of features on the eastern side of the Valley are classified as significant hotspots, while the majority of those in the west are classified as significant cold spots. When nearly everything is significant, what is significant? Clearly 10,000 m is an improper choice of threshold distance, but this raises the threefold issue of how the default threshold is determined, whether this distance best describes the data, and whether the assumptions of this tool are met by the data.

Figure 7: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 100 m.

Figure 8: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 1000 m.

Figure 9: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 10,000 m.

Discussion

Results of a series of hotspot analyses of La and Cs concentrations in clay samples from the Valley of Oaxaca largely confirm what we already knew – La concentrations are high in clays along the western side of the valley and Cs concentrations are highest in clays in the eastern side of the valley. Hotspot analysis allows us to identify spatially significant clusters of points with higher or lower than expected values, but as we have seen, what gets flagged as significant is strongly dependent upon ones choice of threshold distance.

By default, ArcGIS selects a threshold distance equal to the minimum distance required to ensure that all features have at least one neighbor. This means that the default minimum distance is contingent upon the spatial distribution of features, and is likely driven by features at the edge of the dataset and/or spatial outliers.

This raises the larger issue of what factors drive the spatial distribution of features in one’s dataset. Are the location of individual features the product of the physical or social environment? Is their spatial distribution critical to your understanding of the data? Or are the locations researcher-selected according to some regular, stratified, or arbitrary sampling design that is not a product of the environment, but of how one chose to sample that environment?

In the case of the Oaxaca Clay Survey, clay sampling locations were selected opportunistically from subsurface horizons along roadcuts and streambanks, with greater sampling intensity near key archaeological sites sparse density in areas with few important sites. Because the sampling locations were arbitrarily selected, their overall distribution is not of interest. Nor are the inter-point distances used to calculate the Getis-Ord Gi* statistic.

Because the spatial distribution of clay sampling locations tells us more about researcher behavior than clay formation processes, and Hotspot Analysis uses this information in its calculation of z-scores and significance levels, Hotspot Analysis is not appropriate for this data.

Conclusion

Before using Hotspot Analysis, one should always verify (1) that the spatial locations of features of interest are the product of the physical or social environment rather than researcher-selection; and (2) that the default local threshold value is not driven by spatial outliers.

Relationships that vary in space and time: A challenge for simple linear regression

Regression is a fundamental analytical tool for any researcher interested in relating cause and effect. A basic assumption of regression is that of stationarity, the principal that the relationship between a predictor variable and its response is constant across its sample space; that a relationship is true in all regions where it is being applied. This assumption is a particularly poor assumption in spatially based analyses, where we know that interactions may exist between known and often unknown factors in how response variable relates to a given explanatory variable. While this is a challenge to simple linear regression, it is also what generally makes spatial problems interesting: the fact that relationships are not constant across space and time.

Spatially weighted regression challenges the assumption of stationary in that where simple linear regression develops a single relationship to describes a phenomena, spatially weighted regression allows the relationship to vary spatially. Unhinging the relationship between a explanatory variable and its response spatially creates a set of local coefficient for each instance where an explanatory variable is offered. This is done through the use of a weighting function. Wherein simple linear regression, each data point assumes equal weight with regards to the final relationship, a weighting function applies greater import to values closer to where a regression point would be calculated.

Spatial weighting function (1)

Fig 1: A spatial weighting function weights data points closer to a regression point. In this way bandwidths can vary across a feature space, such that two local regression values may be constructed of a different number of data points.

Fig 2: Spatially weighted regression allows the relationship between a response and explanatory to vary across a study region.

NDVI and weed density: A case study in spatially weighted regression

Normalized difference vegetation index (NDVI) has been used in remote sensing as a proxy for phenology in many remote sensing and cropping systems studies. NDVI is calculated as the ratio of red to near-infrared light, and is generally related to the amount of green photo-synthetically active tissue. In principal, weeds and crops should be distinguishable based on how their NDVI response varies in time.

Question: Can NDVI be used to predict weed density? Does the relationship between NDVI and weed density vary spatially?

Fig 3: A general hypothesis for how weeds and crop may in their NDVI response. Weeds may mature earlier or later than a crop, but this relationship may also vary spatially.

Here, spatially weighted regression is offered as a method for distinguishing relationships between weed density and NDVI. Allowing the relationship between NDVI and weed density to vary spatially may allow one classify areas of the field based on the nature of these relationships.

Fig 4: NDVI over the course of the growing season at a 17 acre field located NE of Pendleton, OR, Summer 2015. Note that the field does not increase or decrease in NDVI evenly, but rather peak NDVI passes as a wave across the field. [ Note: GIF does not appear animated in preview. Click on the GIF directly if animation is not working ]

An important first step in deciding if a data set is suitable for spatially weighted regression is to look at the residuals of the linear relationship you are choosing to model. Here we examine the following function for predicting weed density:

This function uses 3 samples of NDVI over the course of the growing season, centered around the time of peak NDVI. The purpose of using these 3 times was to try and emphasize time periods where weeds and crop would vary in their relative response in NDVI.

Fig 5: Weed densities were calculated based on linear transects made prior to harvest. Weed hot spots were calculated using the Getis-Ord Gi* statistic in ArcMap. Weed hot spots from a prior analysis were used as input for predicting weed density in this exercise.

Fig 6: Predicted weed density for a multiple regression model based on 3 measurements of NDVI surrounding peak NDVI. Multiple R2: 0.044, P-value < 0.001

Fig 7: Residuals from a multiple regression model based on 3 measurements of NDVI surrounding peak NDVI. Both high and low residuals cluster spatially, indicating that the relationship between NDVI and weed density may vary spatially and may be a good candidate for geographically weighted regression.

Fig 8: Predicted weed density from a spatially weighted regression. Quasi-global R2: 0.92.

By using a spatially weighted regression, we’re able to account for 92 percent of the variance that occurs in the distribution of weeds in this field. Unlike in a standard regression, the result of this process is a collection of local regression formula. In this sense, the result is not a result that can be easily extrapolated to predict weeds distributions in future data sets. However, these coefficients do offer us the opportunity to look for some spatial patterns that may yield additional information as to what the nature of these local spatial relationships might be

Fig 10: Map classified by coefficient slope.

Introduction

I have done a cluster analysis on a fire regimes produced by the MC2 dynamic global vegetation model (DGVM) run across the Pacific Northwest (PNW) over the time period 1895-2100 (see my previous Cluster Analysis in R post for details). Clusters were based on fire return interval (FRI), carbon consumed by fire, and fraction of cell burned. MC2 is a process-based model with modules for biogeochemistry, fire, and biogeography. Process models are generally complicated models with high computational costs. I wanted to explore the question “How well could a statistical model reproduce the results from a process-based model?”

A complete answer to that question would take much work, of course, as MC2 produces data on vegetation types, carbon pools, carbon fluxes, water flows, and, of course, fire. But this investigation provides a starting point and has the potential to show that at least one aspect of the model could have a statistical alternative to process-based computation.

Method description

Overview

Inputs into the MC2 include monthly weather in the form of maximum temperature, minimum temperature, precipitation, and mean dew point temperature, as well as elevation, and soil depth. I chose to use these inputs with a set of logistic regression models (described below) to predict the fire regime cluster of each cell in the results from my cluster analysis. To simplify the number of regression variables, I summarized the monthly variables by “summer” (April-September), “winter” (October-January), and annually.

Logistic regression

Conceptually, logistic regression differs from linear regression in that logistic regression is designed to predict the probability of an outcome or occurrence in a binary data set. It is commonly used in models predicting the presence or absence of a phenomenon (e.g. fire occurrence, species occurrence) over a spatial domain. Linear regression uses one or more continuous or binary explanatory variables to calculate the binary response variable.

The formula used in linear regression is:

e^t / (e^t + 1)

which is equivalent to

1 / (1 + e^-t)

where t is the logit, a linear function of the explanatory values. The graph of the function is S-shaped.

An example in R should make things clearer:

Make a dataset with an independent variable and a dependent binary variable:

myDS = data.frame( iv = 1:50, # independent variable dv = c(rep(0,25),rep(1,25)) )

plot(myDS$iv,myDS$dv)
Your graph should look like this:

Scramble the dependent variables between 20 and 30 to make the plot messier, like real life:

myDS$dv[21:30] = sample(1:0,10,replace=TRUE)
plot(myDS$iv,myDS$dv)
Your graph should look similar to this one, with the regions of 1 and 0 values overlapping:

Now run a logistic regression on the data and look at the summary of the result:

tstGlm = glm(dv ~ iv, data = myDS, family = binomial)
summary(tstGlm)
Your summary should look something like this:

Call: glm(formula = dv ~ iv, family = binomial, data = myDS)
Deviance Residuals: Min 1Q Median 3Q Max -1.55618 -0.15882 -0.01953 0.15884 2.01358

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -9.0868 3.0071 -3.022 0.00251 ** iv 0.3429 0.1113 3.080 0.00207 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)
Null deviance: 69.235 on 49 degrees of freedom Residual deviance: 19.165 on 48 degrees of freedom

AIC: 23.165
Number of Fisher Scoring iterations: 7
There is a good explanation of these in this video: https://www.youtube.com/watch?v=xl5dZo_BSJk. A good reference for formulas used can be found at http://documentation.statsoft.com/portals/0/formula%20guide/Logistic%20Regression%20Formula%20Guide.pdf

But the short explanation is:

Call: The call that was made to the glm function

Deviance residuals: Not really useful in a binary response regression

Coefficients:

Intercept: The intercept, or constant coefficient, used in the logit linear function
Estimate: These are the coefficients used with each variable in the logit (in this case we have one variable in the logit)
Std. Error: The standard error associated with each coefficient.
Z Value: The z value for the coefficient. This is the Estimate divided by the Std. Error.
Pr(>z): The p value associated with each z value.
Null deviance: A measure of deviance between the model and a model that only has the constant coefficient

Residual deviance: A measure of deviance to the model produced.

AIC: Akaike information criterion. This is a measure of the quality of the model based on goodness of fit and model simplicity. For models created using the same number of data points, a model with a lower AIC is considered better than one with a higher AIC.

Make a plot of the regression function along with the original points. You have to use a data frame in which the column has the same name as the column you used to create the regression:

tmpData = data.frame(iv = seq(0,50,0.2))
plot(seq(0,50,0.2), predict(tstGlm, newdata = tmpData, type = 'response'), type='l')
points(myDS$iv,myDS$dv)

Notice how the logistic curve, which predicts the probability of a positive outcome, moves from a y value of 0 to a y value of 1 where the dependent values from the original dataset are mixed between 1 and 0. From this, it is easy to see how logistic regression predicts probabilities for binary data.

Logistic regression can be used with multivariate regression, too. Each explanatory value has a term in the logit. Each term’s coefficient determines its influence on the final prediction.

Stepwise Logistic Regression

Finding the best combination of explanatory values to use can present a challenge. Often, there are more variables than desired, and using too many can result in an overfitted model. One method used to determine which variables to use is stepwise regression. There are two approaches to this method: forward and backward.

In forward stepwise regression, the algorithm starts with the constant coefficient. Then it tests all of the explanatory values to find the one that, by itself, has the greatest explanatory power. It computes the AIC for the model and then tests the remaining explanatory variables to find which adds the greatest explanatory power. It then computes the AIC with the added variable. If the AIC is lower (better) with the new explanatory variable, it repeats the process. If the AIC is greater, it discards the new variable and stops.

Backward stepwise regression takes the opposite approach. It starts with all the variables and removes the one with the least explanatory value and then calculates AIC. It stops at the point when removing a variable would raise the AIC.

One approach to finding a model with stepwise regression is to do both forward and backward regression and see if one returns a better result than the other. Additionally, the results can be modified by removing any explanatory variables that have a high p-value to see if removing them yields a better model. I used both forward and backward regression and in two cases, found model improvement by removing an explanatory variable with a low p value.

A problem with stepwise regression is that it does not guarantee the best possible regression function. To find that, one has to test all possible combinations of explanatory variables, but that is generally a very compute-intensive proposition. So, even though it is easy to find criticisms of stepwise regression, it is commonly used.

Without going into too much detail, here is some R code that performs stepwise regression:

# Generate some independent, explanatory variables. Note how some are more “scrambled” than others
indepVs = data.frame( v1 = c(1:50), v2 = c(1:50), v3 = c(1:50), v4 = c(1:50), v5 = c(1:50), v6 = c(1:50) )
indepVs$v2[23:27] = sample(c(23:27),5) indepVs$v3[21:30] = sample(c(21:30),10) indepVs$v4[16:35] = sample(c(16:35),20) indepVs$v5[11:40] = sample(c(11:40),30) indepVs$v6[1:50] = sample(c(1:50),50)
# The dependent variable, scrambled in the middle

depV = rep(0,50) depV[26:50] = 1 depV[21:30] = sample(c(0:1),10,replace=TRUE)

# Generate a full model (using all the explanatory variables)
fullGlm = glm( depV ~ v1+v2+v3+v4+v5+v6, data = indepVs, family = binomial() )

# Generate an empty model
emptyGlm = glm(depV ~ 1, data = indepVs,family=binomial)
# Do forward stepwise regression
forwardGlm = step( emptyGlm, scope = list(lower=formula(emptyGlm),upper=formula(fullGlm)), direction='forward' )

# Do a backward stepwise regression

backGlm = step(fullGlm)

Plotting the results and examining the resulting models using summary() will give you insights into how the process is working.

Geographically weighted logistic regression

I did some searching for geographically weighted logistic regression and found a couple of papers that used the method. I did not come across any easily implemented solutions for it (which does not mean one does not exist). I think doing geographically weighted logistic regression could have a lot of value, especially over large and varied geographic areas, like the one I’m studying.

I did a logistic regression on a single EPA Level III ecoregion (The Blue Mountains) within my study area and compared the results of that to results from the model produced using the whole study region and found it had a higher predictive value for the ecoregion. This is an indication that geographically weighted regression could provide advantages.

The Analysis

I used logistic regression as part of a process to predict which fire regime each cell in my dataset would have. My explanatory values were derived from the input dataset I used for modeling, these included: annual, summer, and winter means for maximum temperature, minimum temperature, precipitation, and mean dewpoint temperature; elevation; and soil depth. For each of my four fire regimes, I generated a logistic regression predicting whether a cell was in the fire regime or not. Based on recommendations from https://www.medcalc.org/manual/logistic_regression.php, I used this formula to determine the number of data points to use in each regression:

10 * num_exp_vars / min(frac_pos_cases, frac_neg_cases)

where num_ind_vars is the number of explanatory variables, frac_pos_cases is the fraction of positive (presence) cells in the dataset, and frac_neg_cases is the fraction of negative cases in the dataset.

This gave me four logistic regression functions. For each cell in the study area, I ran each of the logistic functions and recorded the probability returned. The predicted fire regime for a cell was assigned based on the logistic function that produced the highest probability. I repeated this entire procedure for the Blue Mountain Ecoregion only.

Results

Across the full study area, this process correctly predicted the fire regime with 61% accuracy (Keep in mind random assignment would be expected to return 25% accuracy). Within the Blue Mountain Ecoregion, the process predicted the fire regime with 46% accuracy using the logistic functions derived from the data for the entire study area, and with 74% accuracy using the logistic functions derived for the Blue Mountain Ecoregion. This is s a strong indication that geographic influences exist and that geographically weighted logistic regression has the potential to improve results.

Method Critique

I was very pleased with the results of this method, and frankly, a little surprised at how successful the method was at predicting fire regime. This investigation leads me to believe that statistical models should be considered as alternatives to process-based models for some vegetation modeling tasks, especially considering the lower computational overhead of statistical models.

The biggest downside to the method is the manual procedures involved in finding the best logistic function. This process could be automated with some programming, however, as could a geographically weighted version of logistic regression.

The goal for this tutorial is to expose users to interpolation methods using packages in R. I found doing these in R you get very informative insight into how these interpolation methods operate since you have to specify specific options instead of clicking the krigging function in ArcGIS.

Just a quick refresher: 1) First I need to load the necessary libraries. The main libraries I need for interpolation are gstat and raster 2) Import all the data that we need The first data I will be entering is a polygon shape file which I will use for just projection and aesthetic for displaying the maps. There is also a .csv file that I will import which has all the points I need. I will need to convert this to a point vector datatype. The last time, I was able to extrapolation the point layer pH values over each of their respective polygons. If you more detail on how exactly this was accomplished please read my previous blog post labeled: Spatial Data Management and Simple Plotting with R.

x lapply(x, library, character.only=T)


## This is a polygon shapefile

CRA
## OGR data source with driver: ESRI Shapefile

## Source: "Data/NRCS_Data", layer: "Final_CRA"

## with 60 features

## It has 4 fields
## Reading in a .csv file with all the attributes and removing all the missing pH values.

NRCS_data c
NRCS_data
## Creates a matrix of lat long coord

NRCS_mat row.names(NRCS_mat) ## Takes the projection from the CRA shapfile

CRA.projection
## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"
## Creates a spatail dataframe with the projection as our CRA shapefile

NRCS NRCS@data
## aggergating my atributes by pH finding the mean pH for each CRA

agg_NRCS_data agg_NRCS_data
## merging the 2 datasets together by CRA

CRA@data
## Joining by: "Final_CRA"
## Warning in left_join_impl(x, y, by$x, by$y): joining character vector and

## factor, coercing into character vector
CRA@data

## ploting the extraploated CRA polygon layer. An indepth expliation of this code is in with the IDW code. tm_shape(CRA)+ tm_fill("pH_water",breaks=c(-Inf, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, Inf), auto.palette.mapping=F, legend.show = F)+tm_borders()+tm_shape(NRCS)+ tm_bubbles("red", size = .1, border.col="black")+tm_layout(title="Extrapolation of pH Across CRAs", title.position = c("center", "top"))

In order to do any form of interpolation you need to create a form of resolution that you want the interpolation to happen. In ARCGIS this is done automatically; however, in R you need to create a raster layer that has your specified resolution that you want. Here I create a blank raster layer that has 100 rows and 100 columns. You can change our resolution to what ever you want by editing ncol and nrow variables.

## creating a raster grid usign the raster package grd ## "Spreading" grid to ecompass the entire CRA polygon layer extent(grd)


## giving the grid a projection

crs(grd)
## changing the variable class from a raster to a SpatialPixel. They are both rasters; however, this is the format that the package gstat preferrs.

NRCS.raster crs(NRCS.raster) gridded(NRCS.raster)

## This is jusst a quick plot ot see that the raster grid has the correct projection and matches up correctly. plot(NRCS.raster) lines(CRA, col="blue")

This is a plot of the polygon layer over the empty gridded raster layer.

Now we have all of our layers, its time to create do the inverse distance weighting (IDW) and Ordinary Kriging. IDW and krigging are both only use geographic distance from know points to unknown points. IDW is simpler since it is a weighted average between unknown points strictly based on distance. Kriging is a little more sophisticated since a model of the semivariogram is created to calculate the weights.

OK, first we are going to do the IDW, to do this we need both the points and the z-value we want to interpolate and the resolution that we want that to happen on (our blank raster layer).

## In gstat the IDW function is called idw. We want to interpolate the pH_water variable in the NRCS point shape file over the resoltion of the NRCS.raster layer. idw.out <- gstat::idw(pH_water~1, NRCS, newdata=NRCS.raster)


## [inverse distance weighted interpolation]
## Here we are backing up the IDW raster layer and trimming it to the same width as our CRA polygon layer.

idw.out.BU <- idw.out

idw.out <- idw.out.BU[CRA,]
## Now to plot i, I know this piece of code loos indtimidating; however, that is only becuase you have create every little aspect about your plot layer by layer.  First comes the ph IDW raster layer.  I want the colors of the raster layer colors to be partitioned by the bins of pH that I speficfy after break.  I also want to turn off the legend since it is a little cluttered.  Next layer is the polygon layer and I and jsut displaying the outlines of each polygon.  The last layer is the Point layer inwhich I am displaying each point as blue.  tm_layout is jsut for formating the title posistioning it in the top center.

tm_shape(idw.out)+tm_raster(breaks=c(-Inf, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, Inf), auto.palette.mapping=F, legend.show = F)+tm_shape(CRA)+tm_borders()+tm_shape(NRCS)+ tm_bubbles("blue", size = .1, border.col="black")+tm_layout(title="Interpolation using IDW", title.position = c("center", "top"))

Now for Kriging. To do krigging in R first you need to examine the semivariagram and then model the semi variance and use that model for your interpolation.

## Creating the semivariaogram and plotting it. cvnrcs<-variogram(pH_water~1, locations=NRCS, data=NRCS) plot(cvnrcs)

## Though this variogram does not look great for the purposes of this exerice it will work. However, in reality I would be quite hesitant to do any more of interpolation over this data. ##Looking at the variogram I choose to go withe a Spherical model that has a sill of .06, a nugget of .01 and a range of 100. nrcs.fit.sph <- fit.variogram(cvnrcs, model = vgm(psill=.06, model="Sph", nugget=.01, range=100)) plot(cvnrcs, nrcs.fit.sph)

## The fit isn't completely bad; however, it is not great. ## Now to create the Ordinary Kriged raster layer and then back it up and trim it to the right size as the polygon layer. pH.ok <- krige(pH_water~1,NRCS, NRCS.raster, model=nrcs.fit.sph)


## [using ordinary kriging]
pH.ok.BU <- pH.ok

pH.ok <- pH.ok.BU[CRA,]

## This is essintially the same plotting code as the IDW plot; however, the raster laster I am using is the one created by the ordinary kriged layer. tm_shape(pH.ok)+tm_raster(breaks=c(-Inf, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, Inf), title="Soil pH", auto.palette.mapping=F, legend.show = F)+tm_shape(CRA)+tm_borders()+tm_shape(NRCS)+ tm_bubbles("blue", size = .1, border.col="black")+tm_layout(title="Interpolation using Ordinary Kriging", title.position = c("center", "top"))

In examining the plots you can see that the pH values on the western side of Oregon pretty well conserved since there is not much heterogeneity. However, as you move east you see that each map displays something different. This is because there are more parameters dictating pH other than just distance. Also the sampling density of eastern Oregon is not dense enough to capture the heterogeneity of the landscape as you can see how the IDW and ordinary krigging blend a lot of it together.

However, now you know how to use this amazing functionality in R and this is only scratching the surface. Though R may not be as user friendly and its graphic capabilities may not be as polished, its ability to truly customize each step of the analysis makes it well worth the effort in learning how to conduct these analysis.

Introduction

I am working with a dataset from a dynamic global vegetation model (DGVM) run across the Pacific Northwest (PNW) over the time period 1895-2100. This is a process-based model that includes a dynamic fire model. I am particularly interested in drivers of the fire regimes (the frequency and intensity of wildfire) produced by the model. The general question I am studying is “what, if any, is the relationship between the climatic and elevation inputs of the model and the fire regimes produced by the model?” A more specific pair of questions for this method is “Do fire regimes cluster locally?” and “How do fire regimes change over time?”

Method description

Overview

I divided the data into two time periods: 1901-2000 and 2001-2100 to capture the historical and projected climates and fire regimes over those periods. To define a fire regime, I chose the two fire-related output variables – carbon consumed by fire and percent of model pixel burned – along with fire return interval (FRI) over the century time period. I used k-means cluster analysis in R to define four fire regimes.

K-means clustering divides a dataset into a specified number of data point clusters and calculates centroids and cluster membership such that the Euclidean distance between each cluster’s centroids its members is minimized. The relationships between scales of the input variables should be taken into consideration in this type of analysis as it affects the values of Euclidean distances calculated. The steps in the analysis are outlined below. An appendix of the actual commands used for the analysis are in Appendix A.

Outline of steps

For each of the two input files:

Open the NetCDF file
Reverse the order of the latitude values (change from highest to lowest to lowest to highest)
Filter out the na (no data) values
Copy data (carbon consumed by fire, fraction pixel burned, FRI) into a combined data set

For the combined dataset:

Normalize values for each variable to 0.0 to 1.0 using the minimum and maximum of each variable in the combined dataset.
Generate the four clusters
Output value distributions of the four clusters
Distribute resulting data into two output datasets, one for each of the two time periods
1. Divide into two separate datasets
2. Distribute results into non-na data locations

For each of the output datasets:

Create output NetCDF file
Write data to NetCDF file
Close NetCDF file

Results

Clusters

With my input data, the FRI values range from a minimum of 1 to a maximum of 100 years. The mean annual fraction area burned has a theoretical range from 0 to 1, and the mean annual carbon consumed ranges from 0 to 266.5 gm^-2. Performing the cluster analysis using the original input values resulted in 3 of the 4 clusters driven primarily by the values of carbon consumed (Fig. 1a-c). Normalizing the values of each input variable using the variable’s minimum and maximum, resulted in clusters driven by different variables (Fig. 2). For the remainder of the project I am using normalized variables.

Fig. 1: Z-score distributions of input variables within clusters for clusters computed without normalizing input data. Clusters 1, 2, and 3 (A-C) exhibit strong overlaps in both fraction area burned, and time between fires, and strong shifts in carbon consumed between clusters. Cluster 4 (D) is the only cluster in which all three factors differ substantially with other clusters.

Fig. 2: Distributions of normalized input variables within clusters. Each of the clusters was created by performing cluster analysis using normalized input data. All clusters have substantial shifts in the distributions of at least two factors compared to the other clusters.

Result maps

The clustering of fire regimes can easily be seen in maps of the fire regimes (Fig. 3). Areas without fire tend to be at higher elevations: in the Coast Range and Cascades of Oregon and Washington and the Rockies of western Montana and northwestern Idaho. High frequency fires (Low FRI) are common in the plains and plateaus of southern Idaho, south and central Oregon, and the Columbia Plateau in central Washington. The other two fire regimes, both of which have a Medium FRI are somewhat intermingled, but are present in mid elevations in the Rockies, in southwest Oregon, and in the Willamette Valley and Puget Trough regions of Oregon and Washington, respectively.

Fig. 3: Fire regime maps for (A) the historical period (1901-2000) and (B) the future period (2001-2100). (C: carbon consumed; Med: medium; Frac: fraction of grid cell; No Fire: no fire occurred in the cell over the time period)

Fire regime change from the historical (Fig. 3A) to future (Fig. 3B) period include the appearance of the High Carbon Consumed, Medium Fraction Burned, Medium FRI fire regime around the edges of the plains and plateaus as well as into the Blue Mountains of northeastern Oregon as well as the spread of both Medium FRI fire regimes into the Coast, Cascade, and Rocky Mountain ranges.

Method critique

I found this method to be very useful for my study. Fire and vegetation modeling results are commonly expressed using single variables over time or area and fail to take into consideration the relationships between multiple variables. This work indicates that breaking sets of multiple fire-related variables into clusters provides a way to better characterize regional characteristics as well as spatial changes through time.

I did find the data manipulation required for the method a bit tricky. Having to work around the na values within the data required a bit of labor. Converting matrix data to vector data for the methods was another minor inconvenience. All that said, however, I believe such manipulations will become second nature as my experience with R grows.

Appendix A: Annotated R commands used for the analysis of normalized input data.
## Cluster analysis on normalized data.
## The dataset is drawn from two files, one for the 20th
## century and one for the 21st
## Install the necessary libraries

library(“DescTools”) # for displaying data
library(“ncdf4″) # reading and writing netCDF files

## Set the current working directory for data and results

setwd(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis’)

## Open the NetCDF files for reading

histnc = nc_open(‘SummaryOver_1901-2000.nc’)
futnc = nc_open(‘SummaryOver_2001-2100.nc’)

## Get the lat and lon variables

lon = ncvar_get(histnc,’lon’)
lat = ncvar_get(histnc,’lat’)

## The lat dimension is highest to lowest in the input
## file, so its vector needs to be reversed.
## Reverse lat into ascending order (lower case r rev)

lat = rev(lat)
## Create the data frames for historical, future, and difference data

normHist = expand.grid(lon=lon, lat=lat)
normFut = expand.grid(lon=lon, lat=lat)
normDiff = expand.grid(lon=lon, lat=lat)

## Get the fields from the input files
## Note the lat dimension comes in from high to low, so
## it needs to be reversed so that R can display it (upper case R Rev)

## Consumed is the annual mean for carbon consumed
## PBurn is the annual mean of fraction of area burned
## FRI is the mean number of years between fires

normHist$Consumed = c(Rev(ncvar_get(histnc,’CONSUMED_mean’),margin = 2))
normHist$PBurn = c(Rev(ncvar_get(histnc,’PART_BURN_mean’),margin = 2))
normHist$FRI = c(Rev(ncvar_get(histnc,’PART_BURN_intvl’),margin = 2))

normFut$Consumed = c(Rev(ncvar_get(futnc,’CONSUMED_mean’),margin = 2))
normFut$PBurn = c(Rev(ncvar_get(futnc,’PART_BURN_mean’),margin = 2))
normFut$FRI = c(Rev(ncvar_get(futnc,’PART_BURN_intvl’),margin = 2))

## Normalize the values prior to doing the analysis
## Also get the z scores for distribution plotting later
## Note that NA values from the data must be omitted before
## taking the statistics

## Loop over the three variables used in the clusters
## foreach one filter out cells with no data using na.omit
## Normalize the data based on the min and max value of each
## variable.

for(nm in c(‘Consumed’,’PBurn’,’FRI’)) {
## Temporary data for efficient processing
tmpVect = c(na.omit(normHist[[nm]]),na.omit(normFut[[nm]]))
tmpMin = min(tmpVect)
tmpMax = max(tmpVect)
tmpDiff = tmpMax – tmpMin
tmpMean = mean(tmpVect)
tmpSD = sd(tmpVect)

## Normalize the data
normHist[[nm]] = (normHist[[nm]] – tmpMin) / tmpDiff
normFut[[nm]] = (normFut[[nm]] – tmpMin) / tmpDiff

## Z scores
normHist[[paste(nm,’_Z’,sep=”)]] = (normHist[[nm]] – tmpMean) / tmpSD
normFut[[paste(nm,’_Z’,sep=”)]] = (normFut[[nm]] – tmpMean) / tmpSD
}

## Create the data frame for clustering
## Again, we omit data with no value using na.omit

normWorkDF = data.frame(‘Consumed’ = c(na.omit(normHist$Consumed),na.omit(normFut$Consumed)))
normWorkDF$PBurn = c(na.omit(normHist$PBurn),na.omit(normFut$PBurn))
normWorkDF$FRI = c(na.omit(normHist$FRI),na.omit(normFut$FRI))

## Perform the clustering analysis
## This is the point of all the work being done here
## The Lloyd algorithm runs efficiently on the large dataset
## (~100k points)

normCluster = kmeans(
data.frame(normWorkDF$Consumed,normWorkDF$PBurn,normWorkDF$FRI),
4,
algorithm=”Lloyd”,
iter.max=500
)

## Copy the clusters back to the work data frame

normWorkDF$Clust = normCluster$cluster

# Plot the cluster distributions

for(ndx in 1:4) {
fNm = paste(
‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormClust_’,
ndx,
‘.png’,
sep=”
)
png(fNm)
tmpDF = subset(normWorkDF,normWorkDF$Clust == ndx)
plot(
density(tmpDF$Consumed),
xlim=c(0,1),
ylim=c(0.0,6.0),
col=’darkblue’,
xlab=’Normalized Value’,
main=paste(‘Cluster’,ndx,’Distribution’)
)
lines(density(tmpDF$PBurn),xlim=c(0,1),col=’green’)
lines(density(tmpDF$FRI),xlim=c(0,1),col=’red’)
legend(
‘topright’,
legend=c(‘C Consumed’,’Fract. Area Burned’,’Time Betw. Fires’),
lwd=1,
col=c(‘darkblue’,’green’,’red’)
)
dev.off()
}

## Add the cluster numbers into the original data frames
## Note that NA needs to be taken into account using
## [!is.na(hist$Consumed)] to get the indexes of those data
## items that are not NA

## Historical values are in first half of data frame, future in second
## Calculate indexes for clarity

histClustStartNdx = 1
histClustEndNdx = length(normCluster$cluster)/2
futClustStartNdx = length(normCluster$cluster)/2 + 1
futClustEndNdx = length(normCluster$cluster)

normHist$Clust = NA
normHist$Clust[!is.na(normHist$Consumed)] = normWorkDF$Clust[histClustStartNdx:histClustEndNdx]

normFut$Clust = NA
normFut$Clust[!is.na(normFut$Consumed)] = normWorkDF$Clust[futClustStartNdx:futClustEndNdx]

## Create matrices for results and display
## png() tells R what file to store the results to

png(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormHistClustMap.png’)
normHistClustArr = matrix(normHist$Clust,nrow=331,ncol=169)
image(lon,lat,normHistClustArr)
dev.off()

png(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormFutClustMap.png’)
normFutClustArr = matrix(normFut$Clust,nrow=331,ncol=169)
image(lon,lat,normFutClustArr)
dev.off()

## Calculate Euclidean distance between cluster centers

normClustCtrEucDist = data.frame(cbind(rep(0,4),rep(0,4),rep(0,4),rep(0,4)))

for(ndx_1 in 1:4) {
for(ndx_2 in 1:4) {
normClustCtrEucDist[ndx_1,ndx_2] =
sqrt(
(normCluster$centers[ndx_1,1] – normCluster$centers[ndx_2,1]) ^ 2 +
(normCluster$centers[ndx_1,2] – normCluster$centers[ndx_2,2]) ^ 2 +
(normCluster$centers[ndx_1,3] – normCluster$centers[ndx_2,3]) ^ 2
)
}
}

## Create Fire Regime Euclidean distance map between historical and future

normDiff$HistClust = normHist$Clust
normDiff$FutClust = normFut$Clust
normDiff$EucDist = NA

normDiff$EucDist = mapply(
function(x,y) ifelse(is.na(y), NA, normClustCtrEucDist[x,y]),
normDiff$HistClust,
normDiff$FutClust
)

normDiffEucDistArr = matrix(normDiff$EucDist,nrow=331,ncol=169)
image(lon,lat,normDiffEucDistArr)

## Create NetCDF file and store results there. This yields a
## NetCDF file that can be used independently of R

x = ncdim_def(‘lon’,’degrees’,lon)
## Reverse the order of latitude values
y = ncdim_def(‘lat’,’degrees’,rev(lat))

histFCNCVar = ncvar_def(‘HistFireCluster’,’classification’,list(x,y),-9999,prec=’short’)
futFCNCVar = ncvar_def(‘FutFireCluster’,’classification’,list(x,y),-9999,prec=’short’)
eucDistNCVar = ncvar_def(‘EucDistance’,’NormalizedDistance’,list(x,y),9.9692099683868690e+36,prec=’float’)

# Reverse the lat order of the values being output
OutHistClust = c(Rev(matrix(normDiff$HistClust,ncol=length(lat),nrow=length(lon)),margin=2))
OutFutClust = c(Rev(matrix(normDiff$FutClust,ncol=length(lat),nrow=length(lon)),margin=2))
OutEucDist = c(Rev(matrix(normDiff$EucDist,ncol=length(lat),nrow=length(lon)),margin=2))

## Output the variables and close the file

ncOut = nc_create(‘NormClusters.nc’,list(histFCNCVar,futFCNCVar,eucDistNCVar))

ncvar_put(ncOut,histFCNCVar,OutHistClust)
ncvar_put(ncOut,futFCNCVar,OutFutClust)
ncvar_put(ncOut,eucDistNCVar,OutEucDist)

nc_close(ncOut)

Background & data used

As a practice, geneocological studies such as this one, utilize common gardens to monitor phenotypic measurements of plants from a dispersed set of populations in a common environment. By collecting seed from a variety of populations in the range of the species, and growing them up in a common garden, the effects of site to site variation are more controlled. Ultimately, the phenotypic traits of each plant observed at all of the common gardens are meant to be tied back to the seed source population so that we might learn about how population-level selective pressures translate to divergent traits (even when these plants are grown in a variety of climates). In theory, planting the same seed source population in multiple climate regimes will illuminate what traits are plastic, (i.e. change within one lifetime of a plant) and what traits change over generational time. For example, if you observe that no matter what common garden a seed source population is planted in, you always see wide leaves, and populations from other seed source populations always have narrow leaves, then it is possible to reach the conclusion that leaf width changes over generational /evolutionary time. This is important because if plants have been adapting to selective pressures over many generations and they are moved to a differing conditions, the traits they have developed might be maladapated to the new conditions. My dataset includes 39 population means (4-5 individuals) for each of sixteen gardens for several phenotypic traits such as leaf width, crown width and others.

One of the problems I hope to deal with is how to both quantify and visualize bluebunch wheatgrass phenotypic trait variability across soil gradients. One technique that may be useful is multiple linear regression. One drawback of my dataset however, is the fact that many soils variables are spatially auto correlated. This lack of independence leads to a violation of one of the assumptions of multiple linear regression. In order to deal with the spatially related variables in my dataset, it seemed appropriate to use a regression method that accounts for this spatial relationship.

Question

Is there a relationship between available water storage (AWS) in the top 25cm of the soil surface and the leaf width in bluebunch wheatgrass populations?

Approach and tools used

In order to make my results easier to interpret, I have restricted my analysis to a single common garden that contains 39 different bluebunch wheatgrass populations.

Tools:

geographically weighted regression analysis in ArcMap.
Spatial join tool in ArcMap
Join field tool in ArcMap.
Hot Spot analysis tool in ArcMap

Steps followed to complete the analysis

Surveyed gssurgo database for continuously variable soil metrics that might be relevant to plant growth (i.e. available water storage).
Joined AWS tabular data to gssurgo raster data for the entire study area.
Spatially joined soils tabular data to common garden source population x y coordinates, and plant phenotypic traits data from common garden monitoring.
Selected a common garden that had existing soils data for all seed source populations.
Reduced dataset to only include soils and trait data for the Wahluke common garden in seed zone 1.
Performed geographically weighted regression analysis.
1. Analysis failed initially due to the wide geographic spread of populations.
2. Set the minimum number of populations to be used in analysis to 8 (5 was too few, and 10 produced a smaller p-value).
Exported regression results to a .csv table
Joined local regression coefficients to seed source population shape file attribute table.
Symbolized local regression coefficients according to sign (negative = blue, positive = red)
Performed a hotspot analysis on mapped regression coefficients to look for significant clustering

Results

Positive relationships between leaf width and available water storage occur in moderately wet areas
Negative relationships between leaf width and available water storage occur in moderately wet to dry areas.
Negative coefficients tend to be near other negative coefficients spatially
Positive coefficients tend to be near other positive coefficients as well

Critique

Limited to the analysis of one plant trait and one common garden at a time.
The resolution and accuracy of the AWS data is questionable.
The AWS values may not vary enough to be biologically significant.

Though diving into R can be a bit intimidating, only a basic understanding of R is required to beginning analysis your spatial data. First you need to install the right packages, there are a lot of spatial analysis packages in R which can be found here: https://cran.r-project.org/web/views/Spatial.html . However, the ones I’ll be using are, “rgdal”, “rgeos”, “maptools”, “dplyr”, “tidyr”, “tmap”.
You can install these by (without the # sign): #install.packages(c("rgdal", "rgeos", "maptools", "dplyr", "tidyr", "tmap"))
In this example we’ll be playing with my data which you will not have access too so here is a tutorial that will definitely help you get started if you are really interested in it. https://cran.r-project.org/doc/contrib/intro-spatial-rl.pdf
First we need to make sure the right libraries(packages) are loaded. If you Google each one of these with followed by “in R” it will come up with the help documentation behind the package and what it does. However, I will mention that the package ‘sp’ is quite important, it is the bases behind all other spatial analysis packages here since it defines the spatial data class in R. It allows R to recognize and deal with spatial vector data and raster data.
x <- c("rgdal", "rgeos", "maptools", "dplyr", "tidyr", "tmap") lapply(x, library, character.only=T)
Now we have the packages lets go ahead and read in my data. The first file I have is just a simple .csv file on different soil properties. read.csv is one way you import data into R that is a .csv.
##reading in a .csv file with all the attributes ##reading in a .csv file with all the attributes NRCS_data <- read.csv ("Data/NRCS_Data/Agg_NRCS_Data.csv", header=T, stringsAsFactors = F)
Next we are going to use readOGR function in the package rgdal.
##This is a polygon shapefile CRA <- readOGR(dsn="Data/NRCS_Data", layer="Final_CRA") ## OGR data source with driver: ESRI Shapefile ## Source: "Data/NRCS_Data", layer: "Final_CRA" ## with 60 features ## It has 4 fields CRA ## class : SpatialPolygonsDataFrame ## features : 60 ## extent : -124.566, -116.463, 41.99208, 46.28576 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0 ## variables : 4 ## names : OBJECTID, Final_CRA, Shape_Leng, Shape_Area ## min values : 1, 1.1, 2.338222, 0.03185210 ## max values : 60, 9.9, 53.866012, 3.83380590
It is important to note that it takes 2 parameters first dsn= is just where the folder with the shapefile is saved. layer= is the name of the shapefiles that you want to read in. We are reading in the shapefile “Final_CRA”, which is a polygon shapefile and storing it in R under the name CRA. You can see it is a spatialpolygonDataframe. There are 2 real important aspects to this data format first CRA@polygons and CRA@data. @polygons refers the the information on the polygons and their coordinates. What @data refers to their attribute table.
head(CRA@data, n=3) ## OBJECTID Final_CRA Shape_Leng Shape_Area ## 0 1 1.1 8.832671 0.458048 ## 1 2 1.6 17.469240 1.406368 ## 2 3 10.1 24.579325 1.115243
you can see there isn’t any useful information associated with each polygon about soil parameters; however, the NRCS_data has interesting information that we want to link to these polygons.
##aggergating my atributes by pH finding the mean pH for each CRA agg_NRCS_data <- aggregate(ph_1_to_1_H20~CRA, FUN=mean, data=NRCS_data) agg_NRCS_data <- rename(agg_NRCS_data, Final_CRA = CRA) head(agg_NRCS_data, n=3) ## Final_CRA ph_1_to_1_H20 ## 1 1.1 5.150000 ## 2 1.6 5.153333 ## 3 10.1 7.153333
This data has pH values and their corresponding CRA.
Using a package called dplyr, I am able to attach the pH data to the CRA polygon data.
##merging the 2 datasets together by CRA CRA@data <- left_join(CRA@data, agg_NRCS_data) ## Joining by: "Final_CRA" ## Warning in left_join_impl(x, y, by$x, by$y): joining character vector and ## factor, coercing into character vector CRA@data <- rename(CRA@data, pH_water = ph_1_to_1_H20) head(CRA@data, n=3) ## OBJECTID Final_CRA Shape_Leng Shape_Area pH_water ## 1 1 1.1 8.832671 0.458048 5.150000 ## 2 2 1.6 17.469240 1.406368 5.153333 ## 3 3 10.1 24.579325 1.115243 7.153333
Simple, now the polygon dataframe has some interesting metadata associated with each polygon. However, suppose you want to just creat a point layer spatial dataframe instead of merging it to your polygon dataframe, which you can do in the sp package. The function SpatialPointDataframe is quite particular in its required parameters. It requires the coordinates to be in matrix format I am pulling the corresponding columns from the NRCS_data dataframe and converting them into a matrix.
##Creates a matrix of lat long coord NRCS_mat <- with(NRCS_data, cbind(longitude, latitude)) row.names(NRCS_mat)<- 1:nrow(NRCS_mat) ##Takes the projection from the CRA shapfile CRA.projection <- CRS(print(proj4string(CRA))) ## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0" ##Creates a spatail dataframe with the projection as our CRA shapefile NRCS <- SpatialPointsDataFrame(coords= NRCS_mat, data = NRCS_data, proj4string = CRA.projection)
These last 2 lines of code take the projection my CRA polygon data has and uses it form my newly formed point data dataframe.
Since both spatial dataframes are not projected I use a code from epsg (googled it) that corresponds with Oregon to project them I am creating new spatial dataframes with the projection with the function spTransform()
NRCS.projected <- spTransform(NRCS, CRS("+init=epsg:2992")) CRA.projected <- spTransform(CRA, CRS("+init=epsg:2992"))
Lastly, we want to see what these layers look like so I use the package tmap for plotting which works a lot like ggplot2. qtm is just like qplot in ggplot2. I tell it the layer I want to plot and I want to fill the polygons with the different pH values in each
qtm(shp=CRA, fill="pH_water")

tm_shape(CRA.projected)+ tm_fill("pH_water")+tm_borders()+tm_shape(NRCS.projected)+ tm_bubbles("red", size = .1, border.col="black")

The second line of code actually plots 2 layers the first layer is the polygon dataframe and the second one is the point dataframe. You do this by playing with the customization of tmap typing ?tmap will help show just how customization tmap package is.
Hopefully, this tutorial helped you get started with spatial data in R.

Just another blogs.oregonstate.edu site

A back-to-basics blogpost on multiple regression, from ordinary to exploratory to geographically-weighted

Estimating Habitat Overlap Using ArcGIS and GME

Toward a method for evaluating variation in social contacts using GPS collars

Hotspot Analysis of Elemental Clay Chemistry from the Valley of Oaxaca, Mexico

Spatially weighted regression: A case study

Logistic Regression in R

Comparing Interpolation Methods in R

K-means Clustering in R

Geographically Weighted Regression Analysis of Available Soil Moisture and Leaf Width

Background & data used

Question

Is there a relationship between available water storage (AWS) in the top 25cm of the soil surface and the leaf width in bluebunch wheatgrass populations?

Approach and tools used

Tools:

Steps followed to complete the analysis

Results

Critique

Spatial Data Management and Simple Plotting with R

Contact Info