Prologue

This post got kind of out-of-hand, but it represents my attempt to understand what was going on under the hood of geographically-weighted regression (GWR), especially in terms of my own data. I found it personally helpful to back up, start with the basics of multiple regression, and work my way back to GWR to see what it was and was not telling me about my data. Much of this will be review for members of our class, but I hope that you find something of use here.

Introduction

Regression analysis refers to a class of statistical procedures used to assess the degree to which one variable (y) may be predicted by one or more other, independent variables (x1, x2, … , xk). Simple, linear regression models use a single continuous variable to predict a response variable, and are written using the familiar equation y = mx + b, where y is the response variable, x is the predictor, m is the slope coefficient of x, and b is the y-intercept of the regression line on a Cartesian grid. Multiple Regression models are similar, but use more than one continuous predictor variable. These are typically written as:  y = β0 + β1x1 + β2x2 + … + βkxk + ε, where β refers to each variable coefficient and ε to the model error.

With the exception of logistic regression, variables on both sides of a regression equation must be continuous (rather than categorical) and must meet four key assumptions: (1) the probability distribution of ε must be 0; (2) the variance of the probability distribution of ε is constant; (3) the probability distribution of ε is normal; and (4) the errors associated with any two observations must be independent (Mendenhall and Sincich 2003:105-106).

The complexity of multiple regression raises a number of issues, especially when dealing with spatial data. In this tutorial, we will explore statistical approaches to three common problems in multiple regression. First, how do we deal with collinearity between our predictor variables? That is, how do we deal with violations of the assumption of independence? Second, how do we decide which predictor variables to include in our model? And finally, how do we identify geographically patterned deviations from a global regression model in spatial data?

To address the first issue, we will discuss a statistical technique for reducing the dimensionality of a dataset called Principal Components Analysis. To address the second, we will discuss Exploratory Multiple Regression. To address the third, we will discuss Geographically Weighted Regression.

Case Study

To illustrate this discussion, we will be working with three sets of data from my research in Oaxaca. Namely, we will assess the degree to which we can predict concentrations of five major and minor elements in subsurface clay deposits using data derived from remote sensing. For me, these data are of interest because understanding the distribution of compositionally distinct clay resources may help us track ancient pottery production and exchange in the region. For this tutorial, these data serve simply as an example of a complex multivariate spatial problem. Each dataset used in this study is outlined briefly below:

  1. Tlacolula Clays: These data consist of 137 point features from across the eastern arm of the Valley of Oaxaca, the Tlacolula Valley. During the summers of 2007 and 2012, clay samples were collected from each of these locations, exported to OSU, and analyzed using INAA to estimate the elemental composition of each sample across 30 elements. For this exercise, we focus on just five of major and minor elements: Aluminum (Al), Calcium (Ca), Sodium (Na), Potassium (K), and Iron (Fe). These will act as our dependent variables.
  1. ASTER Bands: These data consist of spectral reflectance and emissivity measurements for 13 bands of remote sensing imagery extracted from a single tile of ASTER data from a cloud-free day in June of 2001. These include two Very Near Infrared (VNIR) bands taken at a spatial resolution of 15 m, six Short-wave Infrared (SWIR) bands taken at a spatial resolution of 30 m, and five Thermal Infrared (TIR) bands taken at a spatial resolution of 90 m. A common use of ASTER data is the mapping and classification of surface geology. If parent material is a good predictor of clay composition in this area, ASTER imagery may serve as a good proxy measure of sediment lithology. These data will therefore act as our first set of predictor variables.
  1. DEM-derived surfaces: Beyond parent material, we may expect the chemical composition of clay deposits to be affected by processes such as soil development, erosion, and alluvial movement of sediments. All of these are difficult to measure directly, but are heavily affected by factors such as slope, which are easily measured using remote sensing. We therefore added a second set of possible predictor variables to our analysis, these derived from a single ASTER 90 m DEM. These include elevation, slope, and flow accumulation. All three of these variables were log-transformed to better approximate a normal distribution.

Software

All of the statistical methods described in this tutorial are available under multiple software packages. I used a combination of ENVI (for manipulation of ASTER data), ARCGIS (for most other spatial analysis and data manipulation), and JMP (for non-spatial data analysis). I cannot speak to how these software compare to other software with similar capabilities. In fact, I am confident you prefer your favorite software packages over mine 🙂 For this reason, I will emphasize methods and interpretation over software throughout this tutorial.

Principal Components Analysis

A primary concern in multiple regression analyses is the tendency to over-fit a model simply by including too many variables. This is problematic because it both overstates the degree to which a response variable can be predicted with a series of independent variables, and it becomes extremely difficult to interpret. Our dataset has sixteen predictor variables, which is almost certainly too many for the number of cases we are analyzing.

A common method for reducing the dimensionality of a dataset is Principal Components Analysis (PCA). PCA uses either the correlation or covariance matrix of a dataset to create an orthogonal rotation of new variables called PCs that are by definition independent. Each PC describes a portion of the overall variance in the dataset, as measured by Eigen-values, with the first PC describing the principal axis of variability, the second PC a smaller portion of variability, and so on. The contribution of individual input variables on each PC is described by their loadings as Eigen-vectors.

When choosing PCs for further analyses, it is important to consider both the Eigen-values and Eigen-vectors. How many PCs are required to describe the majority of the variability in your dataset? Do the Eigen-vector loadings on each PC make sense, knowing what you know about the structure of your data?

To reduce the number of predictor variables in our analysis, data from the thirteen ASTER bands were subjected to a Principal Components Analysis on correlations. DEM-derived predictor variables were excluded from this analysis because a preliminary evaluation of the data showed that they were sufficiently independent from the ASTER bands. All ASTER bands were screened for outliers prior to analysis to reduce the influence of extreme cases.

Results of this analysis are reported as Eigen-values in Table 1, and as Eigen-vectors in Table 2. Table 1 shows that as much as 90 percent of the variability in the ASTER data is described by the first two PCs, and 97 percent described by the first three. However, beyond providing a reduced set of variables, we want our PCs to describe interpretively meaningful axes of variability that may be useful for predicting our response variables. Just because the first PC describes the most variability in the data does not mean that it is going to be the best predictor for all of our elements of interest.

Indeed, Table 2 shows that while the first PC is nearly equally loaded on all ASTER bands, the next three PCs are more intuitively meaningful. PC2 is positively loaded on the TIR bands and negatively loaded on all others. PC3 is positively loaded on the VNIR bands and negatively loaded on SWIR bands, especially at lower wavelengths. PC4, while accounting for just 1.4 percent of the variability in the data, describes an important contrast between shorter and longer wave bands within the SWIR portion of the spectrum. For this reason, we might select the first four PCs as potential dependent variables for subsequent analyses.

Table 1: Results of Principal Components Analysis of 13 ASTER image bands covering the Tlacolula Valley showing Eigen-values for each resulting PC, the percent of variability described by each, and the cumulative percent variability described by aggregate combinations of PCs.

Pink_GWR_T1

Table 2: Results of Principal Components Analysis of 13 ASTER image bands covering the Tlacolula Valley showing Eigen-vector loadings on the first 4 PCs. Positive loadings above 0.30 are shown in red. Negative loadings below -0.30 are shown in blue. These four PCs account for 99% of the variability in the ASTER data.

Pink_GWR_T2

Exploratory Multiple Regression

Now that we have reduced the dimensionality of our dataset to five dependent variables (Al, Ca, Na, K, and Fe) and seven possible predictor variables (four ASTER PCs and three DEM-derived variables), we are faced with the problem of identifying the best possible combination of variables for the prediction of each response variable. This becomes especially problematic when one suspects that there may be some interaction between predictor variables (such as canopy cover and soil moisture content as predictors of fire susceptibility). In these cases, one might wish to consider adding interaction terms, which further adds to the dimensionality of one’s data. For the purposes of this exercise, interaction terms were not considered.

A common method for identifying an optimal combination of predictors is Stepwise Multiple Regression, which allows one to walk through a series of potential models based on iterative improvements in model fit. This method is widely accepted and robust, but carries the disadvantage of only considering a single measure of model fit, such as R2 or AICc.

A more brute-force method for identifying optimal combinations of predictor variables is Exploratory Multiple Regresssion, which takes advantage of modern computing power to calculate every possible combination of predictor variables for each response. The results of these analyses may then be quickly sorted by number of dependent variables and multiple measures of model fit.

Table 3 reports the results of an Exploratory Multiple Regression of all possible combinations of our seven variables as predictors of Al. To conserve space in the blogosphere, only best-fit models for combinations of one through seven variables are reported here. Predictably, R2 tends to increase as we add variables to the model. However, Root Mean Square Error (RMSE), Akaike Information Criteria (AICc), and Bayesian Information Criteria (BIC) scores are lowest for models with four to five variables. In multiple regression, AICc and BIC scores are used to evaluate competing models by punishing complexity while rewarding fit (lower scores are better). The models reported in Table 3 highlight the problem with using a single criteria for model selection; while both AICc and BIC are lowest for the model that uses five variables, only four of these variables are individually significant at α = 0.05. For this reason, I have selected the four variable best-fit model for Al as the final model that I will use in further analyses.

Table 4 shows the final best-fit models chosen for all five elements using Exploratory Multiple Regression. In general the relationship between our predictor variables and individual elements is weak, ranging from 0.116 for K to 0.296 for Al. All final combinations of variables were significant predictors of each element, as were individual variables included in each model.

Table 3: Best-fit least squares regression models predicting the elemental concentration of Al in Tlacolula Valley clays using one to seven variables. Predictor variables were chosen from all possible combinations of models generated using 7 variables: ASTER PC1, ASTER PC2, ASTER PC3, ASTER PC4, Log10 elevation, Log10 slope, and Log10 flow accumulation.

Pink_GWR_T3

Table 4: Best-fit least squares regression models predicting the elemental concentration of five major and minor elements in Tlacolula Valley clays. Predictor variables were chosen from all possible combinations of models generated using 7 variables: ASTER PC1, ASTER PC2, ASTER PC3, ASTER PC4, Log10 elevation, Log10 slope, and Log10 flow accumulation.

Pink_GWR_T4

Notably, different combinations of variables were identified as the best predictors for each element. ASTPC1 (weakly loaded on all ASTER bands) was included in all models except that for K. ASTPC2 (positively loaded on the TIR bands) was included in models predicting Al and Ca, but not Na, K, or Fe. Surprisingly, ASTPC3 (positively loaded on the VNIR bands) was only included in the final model for Na, while ASTPC4 (positively loaded on low wavelength SWIR bands and negatively loaded on high SWIR bands, but describing just 1.4% of the variability in the data) was included in final models for Ca, Na, and Fe. Slope was included in final models for all elements except K, but only has a positive coefficient for Ca. Flow Accumulation was included as a negative coefficient in final models for Al, K, and Fe, showing that these elements tend to decrease in areas with higher potential runoff. Elevation was not included in any of the five final models.

A troubling issue with the final models that we have selected is their relatively poor predictive power. Generally speaking, it appears that even the best combinations of our predictor variables only account for about 20 to 30 percent of the variability in the elemental composition of clay deposits in the Tlacolula Valley. This raises the question of whether the relationships we have identified between our predictor variables and individual elements hold true across the study area, or whether our models are driven by strong deviations from the global fit at finer spatial scales. To address this issue, we may turn to Geographically Weighted Regression.

Geographically Weighted Regression

Over the course of this term, a number of us have attempted to use Geographically Weighted Regression (GWR) to generate prediction surfaces for particular response variables. This was in fact the original intent of my project. However, a common problem that we ran into was the difficulty of interpreting the results of geographically-weighted regression models. Is a GWR model with a higher R2 than an OSL model that uses the same variables a better model? What does it mean when local coefficients are positive in one area and negative in another? Should areas with poor local fit be excluded from output prediction surfaces?

My primary purpose in this back-to-basics blog post on multiple regression stems from my current understanding of the intended use of GWR. Turns out, Geographically Weighted Regression is primarily intended as a technique for exploring local deviations from global fit resulting from non-stationarity in one’s data (Fotheringham et al. 2002). An implication of this is that GWR should be conducted only as a post-hoc analysis following selection of a final model using Ordinary Least Squares regression. In fact, the “help” function for Geographically Weighted Regression in ARCGIS explicitly advises one to do this. A corollary to this implication might be that GWR is not really intended to generate prediction surfaces.

As discussed by multiple members of the class, GWR operates by generating a series of local fits to each feature in a spatial dataset by passing a Gaussian kernel smoother of either a fixed or adaptive threshold distance across one’s data. For this exercise, I conducted a GWR for each of our five elements of interest using the final model variables selected through Exploratory Multiple Regression. Table 5 summarizes the quasi-global R2 fit of GWR models relative to R2 values for OLS models for the same variables. For all five elements, R2 values are greater for GWR models than the OLS models, but the degree of difference differs by element. For Al and Ca, GWR models yield minimal improvements in fit over OLS models. By contrast, Quasi-global GWR R2 values for Na and Fe are close to twice as high as their OLS counterparts.

Table 5: Quasi-global R2 fit of geographically weighted regression (GWR) models relative to those of best-fit least squares (OLS) models predicting the concentration of five major and minor elements in Tlacolula Valley clays.

Pink_GWR_T5

But how do we interpret this information? My best answer is that if GWR is primarily intended to check for stationarity, then local measures of fit and dependent coefficients are much more important to our interpretation than the global fit. If we map the local R2 for each feature included in a given model, we find that the quasi-global R2 is a kind of maximum measure of fit; local R2 values are always lower than the quasi-global R2. Examining the spatial distribution of local coefficients and measures of fit allows one to identify whether non-stationarity is a potential issue affecting the global fit of the OLS model.

Figures 1-5 show local R2 values for GWR models for Al, Ca, Na, K, and Fe respectively. As we may expect, non-stationarity is an issue for all five models. Interestingly, the results for Al (Fig. 1) and Ca (Fig. 2) are nearly identical, despite being based on different dependent variables. Both show higher levels of local fit to the west, and lower levels of fit to the east. The results for Na (Fig. 3) are similar; local fit is highest in the west, and lowest on the east side of the hill from this area. Local fit for K (Fig. 4) is low across the map, but is highest in the east and lowest in the west. Local fit for Fe (Fig. 5) is substantially greater in the west than in the central Tlacolula Valley. But again, how do we interpret this information?

GWR_Al

Figure 1 (above): Local fit of GWR for Aluminum.

GWR_Ca

Figure 2 (above): Local fit of GWR for Calcium.

GWR_Na

Figure 3 (above): Local fit of GWR for Sodium.

GWR_K

Figure 4 (above): Local fit of GWR for Potassium.

GWR_Fe

Figure 5 (above): Local fit of GWR for Iron.

Local measures of R2 simply tell us the strength of the relationship between local variability in the predictors vs. the response. If either our predictor variable or response variable is spatially auto-correlated, we may expect this to result in low local variability, which would tend to result in poor local fit. Higher local variability in either the predictor or response variables will not necessarily yield a stronger local fit, but if there is a relationship between variables at the global scale, this helps.

To illustrate, if we map raw concentrations of Ca and Fe (Fig. 6 and 7) in clay samples collected from across the Tlacolula Valley, we can see that local variability in each is highest in the west and lowest in the central and eastern portions of the study area. In both cases, high local R2 in the west simply reflect greater local variability in these elements in this area. For Ca, this appears to be driven by local contrasts between extreme values associated with limestone bedrock in close proximity to alluvial sediments with more normative values. In areas with less local variability in Ca concentrations, variability in spectral reflectance/emissivity and slope is less predictive.

GWR_CaPPM

Figure 6 (above): Calcium concentrations in Tlacolula clay samples (ppm).

GWR_FePPM

Figure 7 (above): Iron concentrations in Tlacolula clay samples (ppm).

Punchline

Results of exploratory multiple regression have shown that concentrations of individual elements in Tlacolula clay deposits are weakly correlated with spectral reflectance, emissivity, slope and flow accumulation at the global scale of the study. At a finer scale, the results of GWR suggest that clay chemistry is poorly predicted by these variables, except in areas with high local variability in a given response element. This may indicate that the weak global relationships observed through OLS regression may be driven in large part by spatially-clustered extreme values, violating the assumption of stationarity. GWR has thus provided a robust check on the assumptions of our OLS regression models, as well as a clearer view to the factors influencing Tlacolula clay chemistry.

References

Mendenhall and Sincich (2003) Regression Analysis: A Second Course in Statistics. Sixth Edition. Pearson Education LTD., London.

Fotheringham, A. S., Brunsdon, C., and Charlton, M. E. (2002). Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Wiley, Chichester.

Introduction

In my last post, I discussed the importance of quantifying social contacts to predict microbe transmission between individuals in our study herd of African Buffalo. This post deals with the other half of the equation: environmental transmission. My overarching hypothesis is that spatial and social behaviors are important for transmission of host-associated microbes. The goal of this pilot project is to establish a method for quantifying habitat overlap between individuals in the herd so that I can (eventually) compare habitat overlap with microbe community similarity and infection risk.

Methods/Results

Workflow:

I used the following outline to determine pairwise habitat overlap between a “test” pair of buffalo:

  1. Project GPS points in ArcMap
  2. Clip GPS points to the enclosure boundary layer
  3. Use GMe to generate kernel density estimates and 95% isopleths
  4. Back in Arc, used the identity tool to calculate area of overlap between the two individuals
  5. Computed percent overlap using Excel

Geospatial Modelling Environment (GME):

GME is a free geostatistical software that combines the computational power of R with the functionality of ESRI ArcGIS to drive geospatial analyses. A few of the functions that are especially useful for analyzing animal movement are: kernel density, minimum convex polygons, movement parameters (step length, turn angle, etc), converting locations to paths and paths to points, and generating simulated random walks. For this analysis, I used GME to generate kdes and estimate home ranges using 95% isopleths.

Kernel density estimates and isopleth lines

Kernel densities can be conceptualized as 3-dimensional surfaces that are based on density of points. Isopleth lines can be drawn in GME based on a raster dataset (such as a kernel density) and can be set to contain a specified volume of surface area. For this analysis, I was interested in calculating 95% isopleths based on the kernel densities of GPS points. In real life, this means the area that an animal spends the majority of its time in.

Using the identity tool to compute overlap

After generating home range estimates for two individuals, I uploaded the resulting shapefiles to ArcMap and used the Identity tool to overlap the home ranges.  To use the identity tool, you input one identity feature and one or more input features. Arc then computes a geometric intersection between input features and identity features and merges their attributes in the output.

overlap

The map above shows 95% isopleths from animal 1 (yellow), animal 13, (purple), and the area of overlap computed using the intersect tool (red line). I exported the output table to excel, where I calculated percent overlap between animals.

Conclusion

Overall, this method seems like it will be great for estimating habitat overlap. A few things that I’m concerned about are:

(a) Habitat use may be so similar for all animals that overlap cannot be related to differences in microbe transmission.

(b) Habitat use may correlate very strongly with contacts, in which case it will be difficult to control for the effects of contacts on microbe transmission.

(c) Percent overlap can be different for each individual in a pair. In my example, #13 overlapped #1 by ~80%, while #1 overlapped #13 by 90%.

I just want to be aware of these potential issues and start thinking about how to deal with them as they arise. Any suggestions would be appreciated, as always!

Introduction 

My research goal is to determine the effects of social and spatial behavior on disease transmission and nasal microbiome similarity. The overarching hypothesis of this project is that composition of microbial communities depends on host spatial and social behavior because microbe dispersal is limited by host movement. My research group studies a herd of African Buffalo in Kruger National Park that is GPS collared and lives in a 900 hectare predator-free enclosure. In conjunction with the GPS collars, each animal has a contact collar that records contact length and ID of any animal that comes within ~1.5 meters. However, for this class I focused on the GPS collars in the effort to test whether they are sufficient for inferring contact networks.

 

The purpose of this step in my research was to establish a method for creating distance matrices between animals at different time points using only GPS data. I wanted to determine whether this is a viable option for inferring contact networks and could be used in lieu of the contact collars. If this method is effective, my plan was to iterate through multiple time points to determine average distances between animals over time.

Results from this pilot study showed that the method is feasible, however, GPS collars would be less effective for inferring a contact network than the contact collars because temporal resolution is sacrificed.

 

Data Manipulation in R

The raw data from the buffalo GPS collars is in the form of individual text files, one text file per individual per time point. Step one was to generate a csv file that included all individuals at one time point. This was done in R using the following code:

##Load necessary packages

library(reshape2)
library(ggplot2)

##Set working directory and set up loop to import data

setwd(“C:\\Users\\couchc\\Documents\\GPS_captures_1215\\GPS_capture_12_15”)
SAVE_PATH <- getwd()
BASE_PATH <- getwd()

CSV.list<-list.files(BASE_PATH,pattern=glob2rx(“*.csv”))

all.csv<-read.csv(CSV.list[1], header = TRUE)

names(all.csv)

buffalo<-data.frame(ID=all.csv$title,X=all.csv$LATITUDE,Y=all.csv$LONGITUDE, Day=all.csv$MM.DD.YY, Hour=all.csv$HH.MM.SS, s.Date=all.csv$s.date, s.Hour=all.csv$s.hour)
## Now we want to melt them and cast them based on some data within the frame.

melted.all.csv<-melt(buffalo,id=c(“ID”))
names(melted.all.csv)

casted.all.csv<-dcast(melted.all.csv, LATITUDE+LONGITUDE~id)
# Where our previous data was organized with ID as a column,
# now ID is a row, and we can see what value an ID took at a given location.

length(buffalo[,1])
ggplot(buffalo[buffalo$ID==buffalo$ID[1],], aes(x=X, y=Y, value = ID))+
geom_raster()

 

Distance Matrix

Because the GPS collars are not synchronized, the 30-minute time intervals can be offset quite a bit, and there are quite a few missing time intervals. To try and correct this problem, I rounded times to the nearest 30 minutes in excel. I then imported the new excel file back into R to generate a “test” distance matrix for a single time point using the following code:

-Show code and output once I’m able to run it on a lab computer (My computer doesn’t have the processing power to do it in a reasonable time).

 

Conclusion

This method showed some interesting results: inter-individual distances can be calculated in R using GPS collar data, and if the process were iterated over multiple time points, average pairwise distances could be computed between each individual. A mantel test could be used to determine correlation between . The problem with rounding time points to the nearest 30 minutes is that it doesn’t guarantee that the buffalo are actually ever coming within the distance that is given by the matrix. Since some of the collars are offset by up to 15 minutes, the animals could be in the recorded locations at different times, and never actually come into contact with each other. The benefit of using GPS collars to infer social behavior is that it gives us actual distances between individuals, which adds an element of spatial resolution not provided by the contact collars. Contact collars only read when another animal is within a specific distance, but no measure of actual distance is recorded. However, since we are looking for the best way to predict contact transmission of microbes for this pilot project,  we are not interested in contact distances past a certain threshold. Although the distance matrix could prove useful for other behavior studies, it provides less relevant information than the contact collars. This could be a useful method if we wanted actual distances between organisms. However, to simplify we would like to set a threshold distance for microbe transmission, which is easily done with the contact collars. However, we would need much higher temporal resolution in our GPS data to be able to build distance matrices that are precise enough to infer contact networks, and this process is more easily done using contact collars.

 

Overall, this pilot project was useful because it demonstrated some of the potential benefits and drawbacks of using GPS intervals to infer contact networks. Although I do not plan to use the method described in this post for my dissertation research, it may be beneficial in future studies of animal behavior or herd structure. The exercise was very useful for improving my confidence in R — with lots of help from my classmates of course!