Note – This is largely an edited transcript of my final presentation, which was pre-recorded and uploaded to the internet as a video-blog. You can see it here.

Background

The Valley of Oaxaca, Mexico was home to one of the first civilizations in the New World, the Zapotec State. Founded around 500 BC with the construction of the hill-top city of Monte Albán, the Zapotec State persisted for more than 1000 years before finally dissolving around AD 850. My research focuses on pottery production and exchange at the height of Zapotec Civilization during the Late Classic (AD 550 – 850), when we suspect that large secondary centers such Dainzú-Macuilxóchitl and Jalieza were beginning to break away from the capital at Monte Albán.

Microsoft PowerPoint - Pink_MacuilTanSack2016.pptx

Figure 1: Late Classic settlement patterns in the Valley of Oaxaca, showing major sites mentioned in the text.

Our study of pottery exchange networks is intended to address how politically integrated these communities were by assessing their degree of economic interaction. To do this, we analyze elemental composition of ancient pottery recovered from excavations at these sites, and compare this to similar data for clay samples collected from across the region during the Oaxaca Clay Survey in 2007 and 2012. Figure 2 shows clay sampling locations relative to regional geology. As you can see, the geology of the area is quite complex, contributing to stark differences in the elemental composition of clay resources from different areas. The majority of our samples however were collected from the alluvial sediments from the valley floor, where sediments derived from multiple parent materials have mixed and altered through soil formation processes. Yet we know little about how interactions between these factors influence the elemental clay composition. This is important because we cannot sample every clay deposit in the region, but wish to identify possible production locales between our sampling locations.

 OaxacaGeo

Figure 2: Oaxaca Clay Survey sampling locations relative to regional geology.

Research Questions

  1. How do concentrations of individual elements in Oaxaca clays pattern spatially across the region?
  2. How are concentrations of individual elements related to environmental factors such as parent material, soil development, and the transport of sediment through alluvial systems?

Because this is an exploratory study, analyses will be limited to five major and minor elements measured for 137 clay samples collected from the eastern, Tlacolula arm of the valley. These elements include aluminum (Al), calcium (Ca), sodium (Na), Potassium (K), and iron (Fe).

Data

This study relies on three datasets:

  1. Measurements of aluminum (Al), calcium (Ca), sodium (Na), potassium (K), and iron (Fe) concentrations in 137 clay samples form the Tlacolula Valley. These data will act as response variables in regression analyses.
  2. Spectral reflectance and emissivity data for 13 bands of ASTER imagery, as measured at each clay sampling location during a cloud-free period in June of 2001 at a band-dependent spatial resolution of 15 to 90m (Figure 3). These data will act as proxy measures of surface lithology and serve as one set of predictor variables for clay chemistry.
  3. Elevation, slope, and flow accumulation estimates for each clay sampling location derived from a 90m DEM covering the Tlacolula Valley. Elevation and slope are good proxy measures for the differential processes affecting alluvial and residual soils, while flow accumulation provides a direct measure of a given location in the valley’s stream network. These data will act as a second set of predictor variables for clay chemistry.

GWR_OCS

Figure 3: Oaxaca Clay Survey sampling locations over ASTER image of the Tlacolula Valley

Hypotheses

H1 – Concentrations of individual elements will be spatially auto-correlated due to their association with different, spatially segregate parent materials.

H2 – The elemental composition of residual and alluvial clay deposits in the Tlacolula Valley, Oaxaca, Mexico, will be strongly conditioned by geologic parent material (as measured through spectral reflectance and emissivity in ASTER imagery) in both alluvial and residual soils.

H3 – The elemental composition of clay deposits will be further affected by processes of erosion and deposition (as measured through elevation, slope, and flow accumulation rasters).

H0 – The elemental composition of Tlacolula Valley Clays may be poorly predicted by both ASTER imagery and DEM-derived variables due to (1) spectral interferences from land-cover such as vegetation and pavement, and (2) elevation, slope, and flow accumulation may be inadequate proxy measures for other factors affecting sediment lithology and chemical alteration.

Approaches

Four statistical methods were used to evaluate the hypotheses listed above:

  1. Moran’s Index, to assess spatial auto-correlation for each element in the Tlacolula clays.
  2. Principal Components Analysis (PCA), to reduce the dimensionality of the ASTER data from 13 bands to 4 Principal Components (ASTPC1 – ASTPC4).
  3. Exploratory Multiple Regression (EMR), to identify an optimal combination of dependent variables for prediction of each element using global, least squares regression.
  4. Geographically Weighted Regression (GWR), to examine local deviations from global models that may result in non-stationarity in the data.

See my earlier blog-post on multiple regression for details regarding my use of Principal Components Analysis, Exploratory Multiple Regression, and Geographically Weighted Regression for analysis of these data.

Results

Moran’s I values for elemental concentrations of Al, Ca, Na, K, and Fe are highly significant for clay sampling locations in the Tlacolula Valley, supporting our first hypothesis that concentrations of each element are spatially auto-correlated, as one might expect if elemental clay chemistry is largely driven by parent material.

To clarify which factors were most related to concentrations of each element, a series of Exploratory Multiple Regressions were run using all four ASTER PCs, Elevation, Slope, and Flow Accumulation as dependent variables. Exploratory Multiple Regression is a brute-force method for identifying optimal combinations of predictor variables in complex multivariate data that works by assessing the relative fit of all possible combinations of variables.

Table 1 shows the final models selected for each of our five elements. Four things are important to note:

  1. All elements are predicted using a combination of ASTER PCs and DEM-derived variables, confirming that both parent material and soil formation processes influence Tlacolula Valley clay chemistry;
  2. Each element was best predicted using a different combination of variables;
  3. Elevation was not included in final models for any of the five elements; and
  4. While each model is statistically significant at α = 0.05, they have relatively weak measures of fit.

Table 1: Final least squares models selected for each element using Exploratory Multiple Regression.

Pink_GWR_T4

A possible reason for the poor fit of the EMR models is that they carry the assumption of spatial stationarity. And yet, we already know from our Moran’s I tests that each of response variables is spatially auto-correlated. Geographically Weighted Regression is a good post-hoc test for evaluating whether local deviations from global fit are an issue for OLS models. Table 2 shows quasi-global measures of fit for GWR models relative those of the final OLS models selected through Exploratory Multiple Regression. Each shows slight to moderate improvement over its least squares counterpart, but what is really of interest are measures of local fit.

Table 2: Comparison of quasi-global measures of fit for GWR and OLS models predicting concentrations of five elements in Tlacolula clays. Predictor variables are those selected for final OLS models using Exploratory Multiple Regression.

Pink_GWR_T5

Space prevents an in-depth discussion of each element, but a couple examples should be sufficient to illustrate my argument. Results of a GWR for Al show much stronger measures of local fit for the western side of the study area than the east (Figure 4), suggesting a strong lack of stationarity in the data. Results for Ca are remarkably similar (Figure 5), showing nearly identical patterns of strong local fit in the west, and poor local fit in the east. As discussed in my bog-post on multiple regression, my interpretation is that these result from strong variability in both predictor and response variables in the west relative to the east. A map of raw Ca concentrations for each clay sampling location (Figure 6) shows a number of extreme values in the west resulting from the association with limestone in these locations. There are also however a number of much lower, more normative values just downslope on the floodplain where sediments are derived from mixed lithologies. That is, in the western part of the study area there is high variability in both Ca and its predictor variables.

We can see something similar in the results of a GWR for Fe. Measures of local fit are highest in the west and abysmal in the central part of the study area on the Rio Salado floodplain (Figure 7). Again, I would argue that this is due to high levels of variability in Fe concentrations in the west related to localized differences in parent material, slope, and position in the stream network (Figure 8). In the Rio Salado floodplain, by contrast, there is little local variability in either slope as a predictor variable or Fe as a response.

High variability in both predictor and response variables will not in all cases result in a strong local or global fit. There must also be a relationship between the predictor variables and the response. However, an absence of local variability in the response variable due to spatial auto-correlation will inevitably result in lower measures of local fit.

GWR_Al

Figure 4 (above): Local fit of GWR model for Aluminum.

GWR_Ca

Figure 5 (above): Local fit of GWR model for Calcium.

GWR_CaPPM

Figure 6 (above): Calcium concentrations in ppm. Size of bubbles is proportional to elemental concentrations.

GWR_Fe

Figure 7 (above): Local fit of GWR model for Iron.

GWR_FePPM

Figure 6 (above): Iron concentrations in ppm. Size of bubbles is proportional to elemental concentrations.

Significance

While this is an exploratory study, and the results are preliminary, our work so far suggests that there are significant relationships between elemental clay chemistry, sediment lithology, and processes of erosion and deposition in the Tlacolula Valley. Furthermore, it shows that these relationships can be assessed using remotely sensed data derived from ASTER imagery. While the strength of these relationships varies both spatially and between individual elements, this analysis provides a view to the complex factors influencing clay chemistry in the Valley of Oaxaca.

For archaeologist working in Oaxaca, this is significant because it informs a greater understanding of the factors influencing the spatial distribution of compositionally distinct clay resources that would have been available to pottery producers. Similar methods could be employed in studies of ceramic exchange for any region with similarly complex geology.

Outside of archaeology, these methods could conceivably be used by anyone interested in how the admixture of sediments derived from multiple parent materials influences the elemental composition of alluvial soils at a regional scale. Concentrations of elements such as K, Na, and P may have an enormous effect on agricultural productivity from field to field, and this is but one example. One can imagine a range of other potential applications in Forestry, Botany, Geology, Soil Science, and Public Health. The sky is the limit.

Lessons Learned – Software

For this project, I used a combination of ENVI (for processing of the ASTER imagery), ARCGIS (for spatial data manipulation, raster calculation, Moran’s I, and Geographically Weighted Regression), and JMP (for calculation of non-spatial statistics, including Principal Components Analysis and Exploratory Multiple Regression). I was already familiar with these software, but I would be interested in any suggestions for other programs or packages that could improve these analyses.

I did get to learn how to make a Powerpoint into a Youtube video for my final presentation, which will be invaluable in my work as an ecampus instructor.

Lessons Learned – Statistics

Before this project I was unfamiliar with geographically weighted regression. After some initially promising results, I was forced to grapple with what these models were really telling me, and whether I could (in good faith) use them to generate predicted surfaces in the way that one might with a strong OLS model. I ultimately concluded that this would be inappropriate with my data and that Geographically Weighted Regression is perhaps better used as a post-hoc test for stationarity in a least squares model. That said, I recognize that this is not always how it is employed, and that other members of the class have tried to take it further. I have written about this at length in a previous blog-post, but am no more an authority on this subject than they are and would be interested in any other thoughts on the subject.

Prologue

This post got kind of out-of-hand, but it represents my attempt to understand what was going on under the hood of geographically-weighted regression (GWR), especially in terms of my own data. I found it personally helpful to back up, start with the basics of multiple regression, and work my way back to GWR to see what it was and was not telling me about my data. Much of this will be review for members of our class, but I hope that you find something of use here.

Introduction

Regression analysis refers to a class of statistical procedures used to assess the degree to which one variable (y) may be predicted by one or more other, independent variables (x1, x2, … , xk). Simple, linear regression models use a single continuous variable to predict a response variable, and are written using the familiar equation y = mx + b, where y is the response variable, x is the predictor, m is the slope coefficient of x, and b is the y-intercept of the regression line on a Cartesian grid. Multiple Regression models are similar, but use more than one continuous predictor variable. These are typically written as:  y = β0 + β1x1 + β2x2 + … + βkxk + ε, where β refers to each variable coefficient and ε to the model error.

With the exception of logistic regression, variables on both sides of a regression equation must be continuous (rather than categorical) and must meet four key assumptions: (1) the probability distribution of ε must be 0; (2) the variance of the probability distribution of ε is constant; (3) the probability distribution of ε is normal; and (4) the errors associated with any two observations must be independent (Mendenhall and Sincich 2003:105-106).

The complexity of multiple regression raises a number of issues, especially when dealing with spatial data. In this tutorial, we will explore statistical approaches to three common problems in multiple regression. First, how do we deal with collinearity between our predictor variables? That is, how do we deal with violations of the assumption of independence? Second, how do we decide which predictor variables to include in our model? And finally, how do we identify geographically patterned deviations from a global regression model in spatial data?

To address the first issue, we will discuss a statistical technique for reducing the dimensionality of a dataset called Principal Components Analysis. To address the second, we will discuss Exploratory Multiple Regression. To address the third, we will discuss Geographically Weighted Regression.

Case Study

To illustrate this discussion, we will be working with three sets of data from my research in Oaxaca. Namely, we will assess the degree to which we can predict concentrations of five major and minor elements in subsurface clay deposits using data derived from remote sensing. For me, these data are of interest because understanding the distribution of compositionally distinct clay resources may help us track ancient pottery production and exchange in the region. For this tutorial, these data serve simply as an example of a complex multivariate spatial problem. Each dataset used in this study is outlined briefly below:

  1. Tlacolula Clays: These data consist of 137 point features from across the eastern arm of the Valley of Oaxaca, the Tlacolula Valley. During the summers of 2007 and 2012, clay samples were collected from each of these locations, exported to OSU, and analyzed using INAA to estimate the elemental composition of each sample across 30 elements. For this exercise, we focus on just five of major and minor elements: Aluminum (Al), Calcium (Ca), Sodium (Na), Potassium (K), and Iron (Fe). These will act as our dependent variables.
  1. ASTER Bands: These data consist of spectral reflectance and emissivity measurements for 13 bands of remote sensing imagery extracted from a single tile of ASTER data from a cloud-free day in June of 2001. These include two Very Near Infrared (VNIR) bands taken at a spatial resolution of 15 m, six Short-wave Infrared (SWIR) bands taken at a spatial resolution of 30 m, and five Thermal Infrared (TIR) bands taken at a spatial resolution of 90 m. A common use of ASTER data is the mapping and classification of surface geology. If parent material is a good predictor of clay composition in this area, ASTER imagery may serve as a good proxy measure of sediment lithology. These data will therefore act as our first set of predictor variables.
  1. DEM-derived surfaces: Beyond parent material, we may expect the chemical composition of clay deposits to be affected by processes such as soil development, erosion, and alluvial movement of sediments. All of these are difficult to measure directly, but are heavily affected by factors such as slope, which are easily measured using remote sensing. We therefore added a second set of possible predictor variables to our analysis, these derived from a single ASTER 90 m DEM. These include elevation, slope, and flow accumulation. All three of these variables were log-transformed to better approximate a normal distribution.

Software

All of the statistical methods described in this tutorial are available under multiple software packages. I used a combination of ENVI (for manipulation of ASTER data), ARCGIS (for most other spatial analysis and data manipulation), and JMP (for non-spatial data analysis). I cannot speak to how these software compare to other software with similar capabilities. In fact, I am confident you prefer your favorite software packages over mine 🙂 For this reason, I will emphasize methods and interpretation over software throughout this tutorial.

Principal Components Analysis

A primary concern in multiple regression analyses is the tendency to over-fit a model simply by including too many variables. This is problematic because it both overstates the degree to which a response variable can be predicted with a series of independent variables, and it becomes extremely difficult to interpret. Our dataset has sixteen predictor variables, which is almost certainly too many for the number of cases we are analyzing.

A common method for reducing the dimensionality of a dataset is Principal Components Analysis (PCA). PCA uses either the correlation or covariance matrix of a dataset to create an orthogonal rotation of new variables called PCs that are by definition independent. Each PC describes a portion of the overall variance in the dataset, as measured by Eigen-values, with the first PC describing the principal axis of variability, the second PC a smaller portion of variability, and so on. The contribution of individual input variables on each PC is described by their loadings as Eigen-vectors.

When choosing PCs for further analyses, it is important to consider both the Eigen-values and Eigen-vectors. How many PCs are required to describe the majority of the variability in your dataset? Do the Eigen-vector loadings on each PC make sense, knowing what you know about the structure of your data?

To reduce the number of predictor variables in our analysis, data from the thirteen ASTER bands were subjected to a Principal Components Analysis on correlations. DEM-derived predictor variables were excluded from this analysis because a preliminary evaluation of the data showed that they were sufficiently independent from the ASTER bands. All ASTER bands were screened for outliers prior to analysis to reduce the influence of extreme cases.

Results of this analysis are reported as Eigen-values in Table 1, and as Eigen-vectors in Table 2. Table 1 shows that as much as 90 percent of the variability in the ASTER data is described by the first two PCs, and 97 percent described by the first three. However, beyond providing a reduced set of variables, we want our PCs to describe interpretively meaningful axes of variability that may be useful for predicting our response variables. Just because the first PC describes the most variability in the data does not mean that it is going to be the best predictor for all of our elements of interest.

Indeed, Table 2 shows that while the first PC is nearly equally loaded on all ASTER bands, the next three PCs are more intuitively meaningful. PC2 is positively loaded on the TIR bands and negatively loaded on all others. PC3 is positively loaded on the VNIR bands and negatively loaded on SWIR bands, especially at lower wavelengths. PC4, while accounting for just 1.4 percent of the variability in the data, describes an important contrast between shorter and longer wave bands within the SWIR portion of the spectrum. For this reason, we might select the first four PCs as potential dependent variables for subsequent analyses.

Table 1: Results of Principal Components Analysis of 13 ASTER image bands covering the Tlacolula Valley showing Eigen-values for each resulting PC, the percent of variability described by each, and the cumulative percent variability described by aggregate combinations of PCs.

Pink_GWR_T1

Table 2: Results of Principal Components Analysis of 13 ASTER image bands covering the Tlacolula Valley showing Eigen-vector loadings on the first 4 PCs. Positive loadings above 0.30 are shown in red. Negative loadings below -0.30 are shown in blue. These four PCs account for 99% of the variability in the ASTER data.

Pink_GWR_T2

Exploratory Multiple Regression

Now that we have reduced the dimensionality of our dataset to five dependent variables (Al, Ca, Na, K, and Fe) and seven possible predictor variables (four ASTER PCs and three DEM-derived variables), we are faced with the problem of identifying the best possible combination of variables for the prediction of each response variable. This becomes especially problematic when one suspects that there may be some interaction between predictor variables (such as canopy cover and soil moisture content as predictors of fire susceptibility). In these cases, one might wish to consider adding interaction terms, which further adds to the dimensionality of one’s data. For the purposes of this exercise, interaction terms were not considered.

A common method for identifying an optimal combination of predictors is Stepwise Multiple Regression, which allows one to walk through a series of potential models based on iterative improvements in model fit. This method is widely accepted and robust, but carries the disadvantage of only considering a single measure of model fit, such as R2 or AICc.

A more brute-force method for identifying optimal combinations of predictor variables is Exploratory Multiple Regresssion, which takes advantage of modern computing power to calculate every possible combination of predictor variables for each response. The results of these analyses may then be quickly sorted by number of dependent variables and multiple measures of model fit.

Table 3 reports the results of an Exploratory Multiple Regression of all possible combinations of our seven variables as predictors of Al. To conserve space in the blogosphere, only best-fit models for combinations of one through seven variables are reported here. Predictably, R2 tends to increase as we add variables to the model. However, Root Mean Square Error (RMSE), Akaike Information Criteria (AICc), and Bayesian Information Criteria (BIC) scores are lowest for models with four to five variables. In multiple regression, AICc and BIC scores are used to evaluate competing models by punishing complexity while rewarding fit (lower scores are better). The models reported in Table 3 highlight the problem with using a single criteria for model selection; while both AICc and BIC are lowest for the model that uses five variables, only four of these variables are individually significant at α = 0.05. For this reason, I have selected the four variable best-fit model for Al as the final model that I will use in further analyses.

Table 4 shows the final best-fit models chosen for all five elements using Exploratory Multiple Regression. In general the relationship between our predictor variables and individual elements is weak, ranging from 0.116 for K to 0.296 for Al. All final combinations of variables were significant predictors of each element, as were individual variables included in each model.

Table 3: Best-fit least squares regression models predicting the elemental concentration of Al in Tlacolula Valley clays using one to seven variables. Predictor variables were chosen from all possible combinations of models generated using 7 variables: ASTER PC1, ASTER PC2, ASTER PC3, ASTER PC4, Log10 elevation, Log10 slope, and Log10 flow accumulation.

Pink_GWR_T3

Table 4: Best-fit least squares regression models predicting the elemental concentration of five major and minor elements in Tlacolula Valley clays. Predictor variables were chosen from all possible combinations of models generated using 7 variables: ASTER PC1, ASTER PC2, ASTER PC3, ASTER PC4, Log10 elevation, Log10 slope, and Log10 flow accumulation.

Pink_GWR_T4

Notably, different combinations of variables were identified as the best predictors for each element. ASTPC1 (weakly loaded on all ASTER bands) was included in all models except that for K. ASTPC2 (positively loaded on the TIR bands) was included in models predicting Al and Ca, but not Na, K, or Fe. Surprisingly, ASTPC3 (positively loaded on the VNIR bands) was only included in the final model for Na, while ASTPC4 (positively loaded on low wavelength SWIR bands and negatively loaded on high SWIR bands, but describing just 1.4% of the variability in the data) was included in final models for Ca, Na, and Fe. Slope was included in final models for all elements except K, but only has a positive coefficient for Ca. Flow Accumulation was included as a negative coefficient in final models for Al, K, and Fe, showing that these elements tend to decrease in areas with higher potential runoff. Elevation was not included in any of the five final models.

A troubling issue with the final models that we have selected is their relatively poor predictive power. Generally speaking, it appears that even the best combinations of our predictor variables only account for about 20 to 30 percent of the variability in the elemental composition of clay deposits in the Tlacolula Valley. This raises the question of whether the relationships we have identified between our predictor variables and individual elements hold true across the study area, or whether our models are driven by strong deviations from the global fit at finer spatial scales. To address this issue, we may turn to Geographically Weighted Regression.

Geographically Weighted Regression

Over the course of this term, a number of us have attempted to use Geographically Weighted Regression (GWR) to generate prediction surfaces for particular response variables. This was in fact the original intent of my project. However, a common problem that we ran into was the difficulty of interpreting the results of geographically-weighted regression models. Is a GWR model with a higher R2 than an OSL model that uses the same variables a better model? What does it mean when local coefficients are positive in one area and negative in another? Should areas with poor local fit be excluded from output prediction surfaces?

My primary purpose in this back-to-basics blog post on multiple regression stems from my current understanding of the intended use of GWR. Turns out, Geographically Weighted Regression is primarily intended as a technique for exploring local deviations from global fit resulting from non-stationarity in one’s data (Fotheringham et al. 2002). An implication of this is that GWR should be conducted only as a post-hoc analysis following selection of a final model using Ordinary Least Squares regression. In fact, the “help” function for Geographically Weighted Regression in ARCGIS explicitly advises one to do this. A corollary to this implication might be that GWR is not really intended to generate prediction surfaces.

As discussed by multiple members of the class, GWR operates by generating a series of local fits to each feature in a spatial dataset by passing a Gaussian kernel smoother of either a fixed or adaptive threshold distance across one’s data. For this exercise, I conducted a GWR for each of our five elements of interest using the final model variables selected through Exploratory Multiple Regression. Table 5 summarizes the quasi-global R2 fit of GWR models relative to R2 values for OLS models for the same variables. For all five elements, R2 values are greater for GWR models than the OLS models, but the degree of difference differs by element. For Al and Ca, GWR models yield minimal improvements in fit over OLS models. By contrast, Quasi-global GWR R2 values for Na and Fe are close to twice as high as their OLS counterparts.

Table 5: Quasi-global R2 fit of geographically weighted regression (GWR) models relative to those of best-fit least squares (OLS) models predicting the concentration of five major and minor elements in Tlacolula Valley clays.

Pink_GWR_T5

But how do we interpret this information? My best answer is that if GWR is primarily intended to check for stationarity, then local measures of fit and dependent coefficients are much more important to our interpretation than the global fit. If we map the local R2 for each feature included in a given model, we find that the quasi-global R2 is a kind of maximum measure of fit; local R2 values are always lower than the quasi-global R2. Examining the spatial distribution of local coefficients and measures of fit allows one to identify whether non-stationarity is a potential issue affecting the global fit of the OLS model.

Figures 1-5 show local R2 values for GWR models for Al, Ca, Na, K, and Fe respectively. As we may expect, non-stationarity is an issue for all five models. Interestingly, the results for Al (Fig. 1) and Ca (Fig. 2) are nearly identical, despite being based on different dependent variables. Both show higher levels of local fit to the west, and lower levels of fit to the east. The results for Na (Fig. 3) are similar; local fit is highest in the west, and lowest on the east side of the hill from this area. Local fit for K (Fig. 4) is low across the map, but is highest in the east and lowest in the west. Local fit for Fe (Fig. 5) is substantially greater in the west than in the central Tlacolula Valley. But again, how do we interpret this information?

GWR_Al

Figure 1 (above): Local fit of GWR for Aluminum.

GWR_Ca

Figure 2 (above): Local fit of GWR for Calcium.

GWR_Na

Figure 3 (above): Local fit of GWR for Sodium.

GWR_K

Figure 4 (above): Local fit of GWR for Potassium.

GWR_Fe

Figure 5 (above): Local fit of GWR for Iron.

Local measures of R2 simply tell us the strength of the relationship between local variability in the predictors vs. the response. If either our predictor variable or response variable is spatially auto-correlated, we may expect this to result in low local variability, which would tend to result in poor local fit. Higher local variability in either the predictor or response variables will not necessarily yield a stronger local fit, but if there is a relationship between variables at the global scale, this helps.

To illustrate, if we map raw concentrations of Ca and Fe (Fig. 6 and 7) in clay samples collected from across the Tlacolula Valley, we can see that local variability in each is highest in the west and lowest in the central and eastern portions of the study area. In both cases, high local R2 in the west simply reflect greater local variability in these elements in this area. For Ca, this appears to be driven by local contrasts between extreme values associated with limestone bedrock in close proximity to alluvial sediments with more normative values. In areas with less local variability in Ca concentrations, variability in spectral reflectance/emissivity and slope is less predictive.

GWR_CaPPM

Figure 6 (above): Calcium concentrations in Tlacolula clay samples (ppm).

GWR_FePPM

Figure 7 (above): Iron concentrations in Tlacolula clay samples (ppm).

Punchline

Results of exploratory multiple regression have shown that concentrations of individual elements in Tlacolula clay deposits are weakly correlated with spectral reflectance, emissivity, slope and flow accumulation at the global scale of the study. At a finer scale, the results of GWR suggest that clay chemistry is poorly predicted by these variables, except in areas with high local variability in a given response element. This may indicate that the weak global relationships observed through OLS regression may be driven in large part by spatially-clustered extreme values, violating the assumption of stationarity. GWR has thus provided a robust check on the assumptions of our OLS regression models, as well as a clearer view to the factors influencing Tlacolula clay chemistry.

References

Mendenhall and Sincich (2003) Regression Analysis: A Second Course in Statistics. Sixth Edition. Pearson Education LTD., London.

Fotheringham, A. S., Brunsdon, C., and Charlton, M. E. (2002). Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Wiley, Chichester.

Introduction
Hot Spot Analysis is a common method for identifying spatially significant clusters of high and low values among features. While feature values may be easily mapped using histogram distributions to identify areas or points with high or low values, it is difficult to tell whether these values are spatially significant.

In ArcGIS, the Hotspot Analysis tool calculates the Getis-Ord Gi* statistic (Figure 1) for each feature in a dataset. It then outputs a z-score and p-value for each feature, allowing one to map the spatial distribution of significant values to identify local hotspots. To be a statistically significant hotspot, a feature must not only have a high or low value, but must be surrounded by features with similarly high or low values.

Getis_Ords_G-star and Spatial Autocorrelation implementation in ArcView-1

Figure 1: Equation for the Getis-Ord Gi* statistic.

As we shall see, significance for the Getis-Ord Gi*statistic depends largely on calculation of a local sum for a feature and its neighbors. This is compared proportionally to the sum of all features to identify regions where when the observed local sum is significantly different from the expected local sum, or where the difference is too large to be the result of random chance.

But how does the Hotspot Analysis tool determine what is “local”?

When the threshold distance parameter is left blank, a default threshold equal to the Euclidean distance that ensures that every feature has at least one neighbor is computed and applied to the entire dataset. An implication of this is that Hotspot Analysis is extremely sensitive to the spatial distribution of features, making some data inappropriate for this type of analysis.
In this tutorial, we will explore how manual changes in threshold distance affect our view of what is significant using clay chemistry data from the Valley of Oaxaca, Mexico. I will argue that because clay sample locations were researcher-selected rather than the product of natural processes, Hotspot Analysis is not appropriate for this data.

Data, Analysis, Results

The Valley of Oaxaca is a broad Y-shaped valley located in the central highlands of Oaxaca, Mexico and the heart of ancient Zapotec Civilization. Over the past several years, researchers at the OSU Archaeometry Laboratory have collected over 300 clay samples from across the Valley for elemental analysis using Instrumental Neutron Activation Analysis (INAA) to establish a comparative basis for the geochemical fingerprinting of ancient pottery from the region.
OaxacaGeo

Figure 2: Clay Sampling locations relative to regional geology in the Valley of Oaxaca, Mexico.

Our ability to identify pottery produced in different part of the Valley is due in large part to the geologic complexity of the region. In each area of the Valley, clay composition is strongly conditioned by parent material, sometimes representing the product of mixed sediments from multiple sources. For this exercise, we will focus on two elements, lanthanum (La) and cesium (Cs), found in elevated concentrations in clays derived from gneiss and rhyolite respectively. La concentrations tend to be high along the western side of the valley (Figure 3), while Cs concentrations are high in the east (Figure 4).
OaxacaLa

Figure 3: La concentrations in samples from the Oaxaca Clay Survey

OaxacaCs
Figure 4: Cs concentrations in samples from the Oaxaca Clay Survey

A hotspot analysis of La concentrations from across the Valley using an unspecified (default setting) threshold distance (Figure 5) identifies significantly high hotspots of La along the central portion of the western edge of the valley – as expected – as well as significant cold spots along the eastern edge of the southern arm of the valley and along the northern side of the eastern arm. A similar analysis of Cs values identifies Cs cold spots along the western edge of the valley and Cs hot spots near the southern and eastern sides of the eastern arm (Figure 6).
OaxacaLaHotspot

Figure 5: Hotspot analysis of La concentrations using an unspecified threshold distance.

OaxacaCsHotspot

Figure 6: Hotspot analysis of Cs concentrations using an unspecified threshold distance.

Broadly speaking these patterns are in accord with our previous understanding of regional clay composition, but to what degree are the significance levels identified by hotspot analysis dependent upon our unspecified threshold distance?

To address this question, we conducted a series of three additional hotspot analyses of Cs concentrations using specified threshold distances of 100 m (Figure 7), 1,000 m (Figure 8), and 10,000 m (Figure 9). Results of these analyses show that the number of features identified as significant hot or cold spots increases as a function of threshold distance. Using a threshold of 100 m, only a few points in the far eastern portion of the Valley are identified as significant hotspots. This pattern largely holds true using a 1,000 m threshold with a few additional points in the same limited area defined as significant hotspots. However, using a threshold of 10,000 m, the majority of features on the eastern side of the Valley are classified as significant hotspots, while the majority of those in the west are classified as significant cold spots. When nearly everything is significant, what is significant? Clearly 10,000 m is an improper choice of threshold distance, but this raises the threefold issue of how the default threshold is determined, whether this distance best describes the data, and whether the assumptions of this tool are met by the data.
OaxacaCsHotspot100

Figure 7: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 100 m.
OaxacaCsHotspot1000

Figure 8: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 1000 m.

 

OaxacaCsHotspot10000
Figure 9: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 10,000 m.

 

Discussion

Results of a series of hotspot analyses of La and Cs concentrations in clay samples from the Valley of Oaxaca largely confirm what we already knew – La concentrations are high in clays along the western side of the valley and Cs concentrations are highest in clays in the eastern side of the valley. Hotspot analysis allows us to identify spatially significant clusters of points with higher or lower than expected values, but as we have seen, what gets flagged as significant is strongly dependent upon ones choice of threshold distance.

By default, ArcGIS selects a threshold distance equal to the minimum distance required to ensure that all features have at least one neighbor. This means that the default minimum distance is contingent upon the spatial distribution of features, and is likely driven by features at the edge of the dataset and/or spatial outliers.

This raises the larger issue of what factors drive the spatial distribution of features in one’s dataset. Are the location of individual features the product of the physical or social environment? Is their spatial distribution critical to your understanding of the data? Or are the locations researcher-selected according to some regular, stratified, or arbitrary sampling design that is not a product of the environment, but of how one chose to sample that environment?

In the case of the Oaxaca Clay Survey, clay sampling locations were selected opportunistically from subsurface horizons along roadcuts and streambanks, with greater sampling intensity near key archaeological sites sparse density in areas with few important sites. Because the sampling locations were arbitrarily selected, their overall distribution is not of interest. Nor are the inter-point distances used to calculate the Getis-Ord Gi* statistic.

Because the spatial distribution of clay sampling locations tells us more about researcher behavior than clay formation processes, and Hotspot Analysis uses this information in its calculation of z-scores and significance levels, Hotspot Analysis is not appropriate for this data.

Conclusion

Before using Hotspot Analysis, one should always verify (1) that the spatial locations of features of interest are the product of the physical or social environment rather than researcher-selection; and (2) that the default local threshold value is not driven by spatial outliers.

Background

So in class I have been talking about digitizing maps of archaeological sites in the Central Valleys of Oaxaca in order to examine changes in the distribution of sites within the region over time. April Fools! I spoke to my adviser and she thought that this project might be too much to take on this term. Instead, she suggested that I continue developing a project that I did for the remote sensing class last term that involved a spectral classification of alluvial sediments in one of Oaxaca’s Central Valleys. Before I describe that project, it may be necessary to provide some background and justification.

Over the past several years, my lab been working with archaeologists from the US and Mexico on a large collaborative research project focused on assessing changes in political and economic integration in the Central Valleys of Oaxaca, Mexico, the core cultural region of the Zapotec civilization, one of Mesoamerica’s first and most enduring complex societies. To do this, our lab takes samples of ceramics provided by each of our collaborators and analyzes them via Instrumental Neutron Activation Analysis (INAA) to determine their geochemical composition for a suite of 30 elements. We then statistically compare the compositional data from the ceramics to similar data obtained for over 300 clay samples that we have collected from across the region to identify areas where ceramic groups may have been produced. Using this data, we can identify locally produced wares for each site in our database, as well as the sources of wares that were imported to each site. This allows us to model the structure of regional economic networks for different periods of interest and examine changes in regional economic integration over time.

One of the fundamental advantages of this approach has been our comparative database of geochemical information for natural clays from the region. But while the number of samples in this database is fairly high, sampling was largely conducted on an opportunistic basis from exposures of clay in road-cuts, streambanks, and agricultural fields, leading to uneven sample coverage across the study area. To estimate the geochemical composition of clays in areas between samplimg locations, we generated a simple interpolated model of regional clay chemistry that covers the entire Central Valley System at a spatial resolution of one kilometer.

While our interpolated model of regional clay chemistry allows us to identify potential ceramic production areas between our clay sampling locations, it has a couple limitations. First, the model’s low spatial resolution glosses finer-scale differences in clay chemistry that can be readily observed in the original data. Secondly, and more importantly, the model does not account for the way that sediment actually moves through the region’s alluvial system.

The trace-element geochemistry of natural clay is largely determined by parent material. The Central Valleys of Oaxaca are flanked by a series of geologically complex mountain ranges that variously contribute to residual and alluvial sediments across the study area, resulting in discrete differences in observed clay chemistry from one sampling location to the next. When we model the clay chemistry for locations between sampling points using simple interpolation methods, we ignore crucial factors such as parent material and the directionality of sediment transport from one area to the next.

To facilitate the development of a refined model of regional clay chemistry, last term I used multispectral ASTER data from NASA’s EOS satellite to develop a spectral classification of alluvial sediments in the Tlacolula Valley, one of the three main branches of Oaxaca’s Central Valley System (see figure below). While this project allowed us to clearly visualize patterns in the valley’s sediment routing system, we have not yet compared the remote sensing data to the geochemical data for each of our sampling locations to assess its utility in developing a refined model of regional clay chemistry.

Pink_Geo544_Poster

Research Objective

This term, I will build upon my previous spectral classification of alluvial sediments in the Tlacolula Valley to assess whether remotely sensed spectral reflectance data may be used to more accurately model clay chemistry within the Tlacolula Valley. ASTER data contains 14 bands of spectral measurements. Some of these are useful for identifying differences in surface geology, while others are better for identifying vegetation cover and urban areas. Whether any of these bands (or combinations of bands) correlate with regional clay chemistry is an open question. The vast majority of our clay samples were collected not from the surface, but from B horizons in exposed soil profiles. Nevertheless, insofar as the surface of most soil profiles in this area is likely to be derived from similar sediment sources as its subsurface components, it may be possible to correlate spectral surface reflectance with our regional clay composition data. If so, we may be able to use the ASTER data to generate a new, higher resolution model of Tlacolula Valley clay chemistry.

Dataset

This study will rely on data collected by the Advanced Spaceborne Thermal Emission Radiometer (ASTER). This satellite collects data over 14 spectral regions using three subsystems: the Visible and Near Infrared (VNIR), the Shortwave Infrared (SWIR), and the Thermal Infrared (TIR). The VNIR system collects stereoscopic data over three spectral regions in the visible and near infrared spectrum at a spatial resolution of 15 m. The Shortwave infrared spectrometer collects data for six spectral regions in the near infrared at a spatial resolution of 30 m using a single nadir pointing detector. And finally, the TIR spectrometer collects data over five spectral regions at a spatial resolution of 90 m using a single nadir pointing detector.

More specifically, this study will rely on a single tile of ASTER Level 1B Precision Terrain Corrected Registered At-Sensor Radiance (AST_L1B) data collected for a region covering the Tlacolula Valley in January of 2001, a period chosen for its low cloud cover and cleared fields. ASTER Level 1B data is available as a multi-file containing calibrated at-sensor radiance that has been geometrically corrected, rotated to a north-up UTM orientation, and terrain corrected using an ASTER DEM. This will be clipped to an area encompassing the valley floor and adjacent piedmont; mountainous areas outside the study area will be excluded from analysis.

Hypotheses

This project will be more of an exploratory exercise in methods development than a hypothesis test.

To determine whether spectral measurements from the ASTER data can be correlated with Tlacolula clay chemistry, I will use the geographic locations of our clay samples from the Tlacolula Valley to extract spectral profiles corresponding to each sample location. These will then be correlated against individual elements from our geochemical data using a series of stepwise multivariate regression analyses in R or another statistical software package. Given fairly strong correlations between spectral measurements and geochemical data, a refined spatial model of Tlacolula clay chemistry will be generated using these regression formulas.

The project that I conducted last term showed that sediments in upland, piedmont areas of the Tlacolula Valley could be easily classified according to their source lithology; misclassification largely occurred only within the Rio Salado floodplain where sediments become more mixed. In our current interpolated model of regional clay chemistry, elemental estimates between sampling locations are always modeled as intermediate, without respect to parent material or topographic position. If successful, this project will yield element estimate maps that more closely reflect patterns of sediment transport seen in the ASTER imagery.

Expected outcome

If our existing model of regional clay chemistry correlates as well against the ASTER data as the geochemical data from our actual sampling locations, development of a revised model may be unnecessary. If however the original clay data correlates substantially better with the ASTER measurements, a new multi-element model of Tlacolula clay chemistry will be generated using the ASTER data.

That said, there is a very strong chance that the ASTER data will only correlate with a few elements, if any. If this is the case, we will explore other options for generating a revised spatial model of Oaxaca clay chemistry.

Significance

If successful, this project will represent a significant advance in methodology for mapping the elemental composition of alluvial sediments regionally. This has some utility in archaeology for identifying potential sources of clay used to make ancient ceramics, but it may also prove useful for soil scientists, geologists, and other researchers concerned with how the admixture of alluvial sediments may contribute to variability in sediment chemistry at a regional scale.

My level of preparation

I have been using ArcGIS for nearly ten years now, so I am thoroughly prepared for this project in that regard. I also have a very strong background in multivariate statistics, though I haven’t used R in some years. I was only introduced to the image processing software ENVI last term during my remote sensing class, but am confident that I have the skills required to complete this project.