A back-to-basics blogpost on multiple regression, from ordinary to exploratory to geographically-weighted

Prologue

This post got kind of out-of-hand, but it represents my attempt to understand what was going on under the hood of geographically-weighted regression (GWR), especially in terms of my own data. I found it personally helpful to back up, start with the basics of multiple regression, and work my way back to GWR to see what it was and was not telling me about my data. Much of this will be review for members of our class, but I hope that you find something of use here.

Introduction

Regression analysis refers to a class of statistical procedures used to assess the degree to which one variable (y) may be predicted by one or more other, independent variables (x₁, x₂, … , x_k). Simple, linear regression models use a single continuous variable to predict a response variable, and are written using the familiar equation y = mx + b, where y is the response variable, x is the predictor, m is the slope coefficient of x, and b is the y-intercept of the regression line on a Cartesian grid. Multiple Regression models are similar, but use more than one continuous predictor variable. These are typically written as: y = β₀ + β₁x₁ + β₂x₂ + … + β_kx_k + ε, where β refers to each variable coefficient and ε to the model error.

With the exception of logistic regression, variables on both sides of a regression equation must be continuous (rather than categorical) and must meet four key assumptions: (1) the probability distribution of ε must be 0; (2) the variance of the probability distribution of ε is constant; (3) the probability distribution of ε is normal; and (4) the errors associated with any two observations must be independent (Mendenhall and Sincich 2003:105-106).

The complexity of multiple regression raises a number of issues, especially when dealing with spatial data. In this tutorial, we will explore statistical approaches to three common problems in multiple regression. First, how do we deal with collinearity between our predictor variables? That is, how do we deal with violations of the assumption of independence? Second, how do we decide which predictor variables to include in our model? And finally, how do we identify geographically patterned deviations from a global regression model in spatial data?

To address the first issue, we will discuss a statistical technique for reducing the dimensionality of a dataset called Principal Components Analysis. To address the second, we will discuss Exploratory Multiple Regression. To address the third, we will discuss Geographically Weighted Regression.

Case Study

To illustrate this discussion, we will be working with three sets of data from my research in Oaxaca. Namely, we will assess the degree to which we can predict concentrations of five major and minor elements in subsurface clay deposits using data derived from remote sensing. For me, these data are of interest because understanding the distribution of compositionally distinct clay resources may help us track ancient pottery production and exchange in the region. For this tutorial, these data serve simply as an example of a complex multivariate spatial problem. Each dataset used in this study is outlined briefly below:

Tlacolula Clays: These data consist of 137 point features from across the eastern arm of the Valley of Oaxaca, the Tlacolula Valley. During the summers of 2007 and 2012, clay samples were collected from each of these locations, exported to OSU, and analyzed using INAA to estimate the elemental composition of each sample across 30 elements. For this exercise, we focus on just five of major and minor elements: Aluminum (Al), Calcium (Ca), Sodium (Na), Potassium (K), and Iron (Fe). These will act as our dependent variables.

ASTER Bands: These data consist of spectral reflectance and emissivity measurements for 13 bands of remote sensing imagery extracted from a single tile of ASTER data from a cloud-free day in June of 2001. These include two Very Near Infrared (VNIR) bands taken at a spatial resolution of 15 m, six Short-wave Infrared (SWIR) bands taken at a spatial resolution of 30 m, and five Thermal Infrared (TIR) bands taken at a spatial resolution of 90 m. A common use of ASTER data is the mapping and classification of surface geology. If parent material is a good predictor of clay composition in this area, ASTER imagery may serve as a good proxy measure of sediment lithology. These data will therefore act as our first set of predictor variables.

DEM-derived surfaces: Beyond parent material, we may expect the chemical composition of clay deposits to be affected by processes such as soil development, erosion, and alluvial movement of sediments. All of these are difficult to measure directly, but are heavily affected by factors such as slope, which are easily measured using remote sensing. We therefore added a second set of possible predictor variables to our analysis, these derived from a single ASTER 90 m DEM. These include elevation, slope, and flow accumulation. All three of these variables were log-transformed to better approximate a normal distribution.

Software

All of the statistical methods described in this tutorial are available under multiple software packages. I used a combination of ENVI (for manipulation of ASTER data), ARCGIS (for most other spatial analysis and data manipulation), and JMP (for non-spatial data analysis). I cannot speak to how these software compare to other software with similar capabilities. In fact, I am confident you prefer your favorite software packages over mine 🙂 For this reason, I will emphasize methods and interpretation over software throughout this tutorial.

Principal Components Analysis

A primary concern in multiple regression analyses is the tendency to over-fit a model simply by including too many variables. This is problematic because it both overstates the degree to which a response variable can be predicted with a series of independent variables, and it becomes extremely difficult to interpret. Our dataset has sixteen predictor variables, which is almost certainly too many for the number of cases we are analyzing.

A common method for reducing the dimensionality of a dataset is Principal Components Analysis (PCA). PCA uses either the correlation or covariance matrix of a dataset to create an orthogonal rotation of new variables called PCs that are by definition independent. Each PC describes a portion of the overall variance in the dataset, as measured by Eigen-values, with the first PC describing the principal axis of variability, the second PC a smaller portion of variability, and so on. The contribution of individual input variables on each PC is described by their loadings as Eigen-vectors.

When choosing PCs for further analyses, it is important to consider both the Eigen-values and Eigen-vectors. How many PCs are required to describe the majority of the variability in your dataset? Do the Eigen-vector loadings on each PC make sense, knowing what you know about the structure of your data?

To reduce the number of predictor variables in our analysis, data from the thirteen ASTER bands were subjected to a Principal Components Analysis on correlations. DEM-derived predictor variables were excluded from this analysis because a preliminary evaluation of the data showed that they were sufficiently independent from the ASTER bands. All ASTER bands were screened for outliers prior to analysis to reduce the influence of extreme cases.

Results of this analysis are reported as Eigen-values in Table 1, and as Eigen-vectors in Table 2. Table 1 shows that as much as 90 percent of the variability in the ASTER data is described by the first two PCs, and 97 percent described by the first three. However, beyond providing a reduced set of variables, we want our PCs to describe interpretively meaningful axes of variability that may be useful for predicting our response variables. Just because the first PC describes the most variability in the data does not mean that it is going to be the best predictor for all of our elements of interest.

Indeed, Table 2 shows that while the first PC is nearly equally loaded on all ASTER bands, the next three PCs are more intuitively meaningful. PC2 is positively loaded on the TIR bands and negatively loaded on all others. PC3 is positively loaded on the VNIR bands and negatively loaded on SWIR bands, especially at lower wavelengths. PC4, while accounting for just 1.4 percent of the variability in the data, describes an important contrast between shorter and longer wave bands within the SWIR portion of the spectrum. For this reason, we might select the first four PCs as potential dependent variables for subsequent analyses.

Table 1: Results of Principal Components Analysis of 13 ASTER image bands covering the Tlacolula Valley showing Eigen-values for each resulting PC, the percent of variability described by each, and the cumulative percent variability described by aggregate combinations of PCs.

Pink_GWR_T1

Table 2: Results of Principal Components Analysis of 13 ASTER image bands covering the Tlacolula Valley showing Eigen-vector loadings on the first 4 PCs. Positive loadings above 0.30 are shown in red. Negative loadings below -0.30 are shown in blue. These four PCs account for 99% of the variability in the ASTER data.

Exploratory Multiple Regression

Now that we have reduced the dimensionality of our dataset to five dependent variables (Al, Ca, Na, K, and Fe) and seven possible predictor variables (four ASTER PCs and three DEM-derived variables), we are faced with the problem of identifying the best possible combination of variables for the prediction of each response variable. This becomes especially problematic when one suspects that there may be some interaction between predictor variables (such as canopy cover and soil moisture content as predictors of fire susceptibility). In these cases, one might wish to consider adding interaction terms, which further adds to the dimensionality of one’s data. For the purposes of this exercise, interaction terms were not considered.

A common method for identifying an optimal combination of predictors is Stepwise Multiple Regression, which allows one to walk through a series of potential models based on iterative improvements in model fit. This method is widely accepted and robust, but carries the disadvantage of only considering a single measure of model fit, such as R² or AICc.

A more brute-force method for identifying optimal combinations of predictor variables is Exploratory Multiple Regresssion, which takes advantage of modern computing power to calculate every possible combination of predictor variables for each response. The results of these analyses may then be quickly sorted by number of dependent variables and multiple measures of model fit.

Table 3 reports the results of an Exploratory Multiple Regression of all possible combinations of our seven variables as predictors of Al. To conserve space in the blogosphere, only best-fit models for combinations of one through seven variables are reported here. Predictably, R² tends to increase as we add variables to the model. However, Root Mean Square Error (RMSE), Akaike Information Criteria (AICc), and Bayesian Information Criteria (BIC) scores are lowest for models with four to five variables. In multiple regression, AICc and BIC scores are used to evaluate competing models by punishing complexity while rewarding fit (lower scores are better). The models reported in Table 3 highlight the problem with using a single criteria for model selection; while both AICc and BIC are lowest for the model that uses five variables, only four of these variables are individually significant at α = 0.05. For this reason, I have selected the four variable best-fit model for Al as the final model that I will use in further analyses.

Table 4 shows the final best-fit models chosen for all five elements using Exploratory Multiple Regression. In general the relationship between our predictor variables and individual elements is weak, ranging from 0.116 for K to 0.296 for Al. All final combinations of variables were significant predictors of each element, as were individual variables included in each model.

Table 3: Best-fit least squares regression models predicting the elemental concentration of Al in Tlacolula Valley clays using one to seven variables. Predictor variables were chosen from all possible combinations of models generated using 7 variables: ASTER PC1, ASTER PC2, ASTER PC3, ASTER PC4, Log10 elevation, Log10 slope, and Log10 flow accumulation.

Pink_GWR_T3

Table 4: Best-fit least squares regression models predicting the elemental concentration of five major and minor elements in Tlacolula Valley clays. Predictor variables were chosen from all possible combinations of models generated using 7 variables: ASTER PC1, ASTER PC2, ASTER PC3, ASTER PC4, Log10 elevation, Log10 slope, and Log10 flow accumulation.

Notably, different combinations of variables were identified as the best predictors for each element. ASTPC1 (weakly loaded on all ASTER bands) was included in all models except that for K. ASTPC2 (positively loaded on the TIR bands) was included in models predicting Al and Ca, but not Na, K, or Fe. Surprisingly, ASTPC3 (positively loaded on the VNIR bands) was only included in the final model for Na, while ASTPC4 (positively loaded on low wavelength SWIR bands and negatively loaded on high SWIR bands, but describing just 1.4% of the variability in the data) was included in final models for Ca, Na, and Fe. Slope was included in final models for all elements except K, but only has a positive coefficient for Ca. Flow Accumulation was included as a negative coefficient in final models for Al, K, and Fe, showing that these elements tend to decrease in areas with higher potential runoff. Elevation was not included in any of the five final models.

A troubling issue with the final models that we have selected is their relatively poor predictive power. Generally speaking, it appears that even the best combinations of our predictor variables only account for about 20 to 30 percent of the variability in the elemental composition of clay deposits in the Tlacolula Valley. This raises the question of whether the relationships we have identified between our predictor variables and individual elements hold true across the study area, or whether our models are driven by strong deviations from the global fit at finer spatial scales. To address this issue, we may turn to Geographically Weighted Regression.

Geographically Weighted Regression

Over the course of this term, a number of us have attempted to use Geographically Weighted Regression (GWR) to generate prediction surfaces for particular response variables. This was in fact the original intent of my project. However, a common problem that we ran into was the difficulty of interpreting the results of geographically-weighted regression models. Is a GWR model with a higher R²than an OSL model that uses the same variables a better model? What does it mean when local coefficients are positive in one area and negative in another? Should areas with poor local fit be excluded from output prediction surfaces?

My primary purpose in this back-to-basics blog post on multiple regression stems from my current understanding of the intended use of GWR. Turns out, Geographically Weighted Regression is primarily intended as a technique for exploring local deviations from global fit resulting from non-stationarity in one’s data (Fotheringham et al. 2002). An implication of this is that GWR should be conducted only as a post-hoc analysis following selection of a final model using Ordinary Least Squares regression. In fact, the “help” function for Geographically Weighted Regression in ARCGIS explicitly advises one to do this. A corollary to this implication might be that GWR is not really intended to generate prediction surfaces.

As discussed by multiple members of the class, GWR operates by generating a series of local fits to each feature in a spatial dataset by passing a Gaussian kernel smoother of either a fixed or adaptive threshold distance across one’s data. For this exercise, I conducted a GWR for each of our five elements of interest using the final model variables selected through Exploratory Multiple Regression. Table 5 summarizes the quasi-global R² fit of GWR models relative to R² values for OLS models for the same variables. For all five elements, R² values are greater for GWR models than the OLS models, but the degree of difference differs by element. For Al and Ca, GWR models yield minimal improvements in fit over OLS models. By contrast, Quasi-global GWR R² values for Na and Fe are close to twice as high as their OLS counterparts.

Table 5: Quasi-global R² fit of geographically weighted regression (GWR) models relative to those of best-fit least squares (OLS) models predicting the concentration of five major and minor elements in Tlacolula Valley clays.

But how do we interpret this information? My best answer is that if GWR is primarily intended to check for stationarity, then local measures of fit and dependent coefficients are much more important to our interpretation than the global fit. If we map the local R² for each feature included in a given model, we find that the quasi-global R² is a kind of maximum measure of fit; local R² values are always lower than the quasi-global R². Examining the spatial distribution of local coefficients and measures of fit allows one to identify whether non-stationarity is a potential issue affecting the global fit of the OLS model.

Figures 1-5 show local R² values for GWR models for Al, Ca, Na, K, and Fe respectively. As we may expect, non-stationarity is an issue for all five models. Interestingly, the results for Al (Fig. 1) and Ca (Fig. 2) are nearly identical, despite being based on different dependent variables. Both show higher levels of local fit to the west, and lower levels of fit to the east. The results for Na (Fig. 3) are similar; local fit is highest in the west, and lowest on the east side of the hill from this area. Local fit for K (Fig. 4) is low across the map, but is highest in the east and lowest in the west. Local fit for Fe (Fig. 5) is substantially greater in the west than in the central Tlacolula Valley. But again, how do we interpret this information?

Figure 1 (above): Local fit of GWR for Aluminum.

Figure 2 (above): Local fit of GWR for Calcium.

Figure 3 (above): Local fit of GWR for Sodium.

Figure 4 (above): Local fit of GWR for Potassium.

Figure 5 (above): Local fit of GWR for Iron.

Local measures of R² simply tell us the strength of the relationship between local variability in the predictors vs. the response. If either our predictor variable or response variable is spatially auto-correlated, we may expect this to result in low local variability, which would tend to result in poor local fit. Higher local variability in either the predictor or response variables will not necessarily yield a stronger local fit, but if there is a relationship between variables at the global scale, this helps.

To illustrate, if we map raw concentrations of Ca and Fe (Fig. 6 and 7) in clay samples collected from across the Tlacolula Valley, we can see that local variability in each is highest in the west and lowest in the central and eastern portions of the study area. In both cases, high local R² in the west simply reflect greater local variability in these elements in this area. For Ca, this appears to be driven by local contrasts between extreme values associated with limestone bedrock in close proximity to alluvial sediments with more normative values. In areas with less local variability in Ca concentrations, variability in spectral reflectance/emissivity and slope is less predictive.

Figure 6 (above): Calcium concentrations in Tlacolula clay samples (ppm).

Figure 7 (above): Iron concentrations in Tlacolula clay samples (ppm).

Punchline

Results of exploratory multiple regression have shown that concentrations of individual elements in Tlacolula clay deposits are weakly correlated with spectral reflectance, emissivity, slope and flow accumulation at the global scale of the study. At a finer scale, the results of GWR suggest that clay chemistry is poorly predicted by these variables, except in areas with high local variability in a given response element. This may indicate that the weak global relationships observed through OLS regression may be driven in large part by spatially-clustered extreme values, violating the assumption of stationarity. GWR has thus provided a robust check on the assumptions of our OLS regression models, as well as a clearer view to the factors influencing Tlacolula clay chemistry.

References

Mendenhall and Sincich (2003) Regression Analysis: A Second Course in Statistics. Sixth Edition. Pearson Education LTD., London.

Fotheringham, A. S., Brunsdon, C., and Charlton, M. E. (2002). Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Wiley, Chichester.

GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Just another blogs.oregonstate.edu site

Leave a reply Cancel reply

Contact Info