Spatial Autocorrelation (Moran’s I): This tool measures spatial autocorrelation using feature locations and feature values simultaneously.   The spatial autocorrelation tool utilizes a multidimensional and multi-directional factors.  The Moran’s I index will be a value between -1 and 1. Positive spatial autocorrelation will show values that are clustered. Negative autocorrelation is dispersed. Random is close to zero. The tool generates a Z-score and p-value which helps evaluate the significance of the Moran’s index.

Morans_Calculations

Figure 1: Calculations used for the Moran’s I tool. (ESRI image)

The output of the Moran’s I tool can be found in the results section of ArcGIS.  Upon opening the HTML report for the Moran’s I results you will see a graph showing how the tool calculated the data and whether or not the data is dispersed, random, or clustered.  This report will also include the Moran’s Index value, z-score, p-value.  It will also provide a scale for the significance of the p-value and critical value for the z-score.

MoransI_SampleOutput

Figure 2: Sample output for the Moran’s I tool. (ESRI image)

 

Data and Analysis

Before conducting this test, I sampled the SST and the CHL-a values at each of the feature locations (sea turtle locations) using the Extract Multi Values to Points tool. This tool “Extracts cell values at locations specified in a point feature class from one or more rasters, and records the values to the attribute table of the point feature class.”

SeaTurtlePointData

Figure 3: Sea turtle locations in the Gulf of Mexico, derived from http://seamap.env.duke.edu/

 

Chl_image_GOM

Figure 4: Chlorophyll-a(mg/m^3) data for January 2005 within the Gulf of Mexico, derived from NOAA.

SST_image_GOM

Figure 5: Sea Surface Temperautre (Celsius) data for January 2005 within the Gulf of Mexico, derived from NOAA.

I tested the spatial autocorrelation of chlorophyll-a and sea surface temperature at each feature location.  The conceptualization of spatial relationships method used was the inverse distance and the Euclidean distance measure was used for the distance method.  I selected a 500km distance (smaller distances were too small for the study site).

Results:

Sea Surface Temperature: The results of the spatial autocorrelation tool suggest that the pattern of sea surface temperature at each feature location is clustered. The Moran’s Index was 0.402514, the z-score was 2.608211, and the p-value was 0.009102.  Since the critical value (z-score) was greater than 2.58 there is less than 1-percent likelihood that the clustered pattern is a result of random chance.

SeasurfaceTemp_Morans_resullts1Of2 SeasurfaceTemp_Morans_resullts2Of2

Figure 6: Sea surface temperature results for Moran’s I tool.

Chlorophyll-a :  The results of the spatial Autocorrelation tool suggest that the pattern of chlorophyll at each feature location is clustered.  The Moran’s Index was 0.346961, the z-score was 2.216243, and the p-value was 0.026675.  The critical value (z-score) was less than 2.58 but greater than 1.96 thus suggesting that there is less than 5-percent likelihood that the clustered pattern is a result of random chance.

Chlorophyll_Morans_resullts1Of2 Chlorophyll_Morans_resullts2Of2

Figure 7: Results of the spatial autocorrelation Moran’s I for chlorophyll-a at the leatherback sea turtle locations.

What does this mean:

As suggested in the hotspot analysis there is clustering of the data.  The spatial autocorrelation tool indicates that clustering is occurring with regard to the sea surface temperature and chlorophyll values at their respective locations with regard to sea turtles.   Conducting an ordinary least squared analysis may lead to more information about which factors contribute more to the clustered pattern.

Tim Sheehan

GEO 584: Advanced Spatial Statistics and GIScience

Final Blog Post

Introduction

The MC2 dynamic global vegetation model

Dynamic global vegetation models (DGVMs) model vegetation dynamics over regional to global scales. DGVMs are process-based models, meaning they model the physical and chemical processes that drive vegetation growth. MC2 (Bachelet et al., 2015) is one such model. MC2 models processes on a regular latitude longitude grid using a monthly time step. Each grid cell is treated independently. That is to say, processes in one cell do not affect processes in any other cell. MC2 inputs include soil characteristics, elevation, and monthly means for minimum temperature, maximum temperature, precipitation, and dew point temperature.

Unlike many DGVMs, MC2 includes a fire module that models the effects of fire on vegetation. The current version of MC2 simulates a fire whenever fuel conditions exceed a defined threshold. In other words, ignitions are assumed. This has the potential to overestimate fire occurrence in areas where the fuel threshold is frequently exceeded, to underestimate fire occurrence in areas where the threshold is never or rarely exceeded, and to model only severe fires. Other aspects of model results, especially vegetation occurrence, age, and density as well as carbon dynamics, are driven in part by fire, so correctly modeling fire is important to correctly modeling other processes.

In an effort to make MC2 fire modeling more realistic, I have implemented a stochastic ignitions algorithm in the MC2 fire model. Rather than having a single fuel threshold that determines fire occurrence, the stochastic model uses a probabilistic method (Monte Carlo) to determine if an ignition source is present in a model grid cell. If an ignition source is present, then a second probabilistic method, based on fuel conditions, is used to determine whether the ignition leads to a fire occurrence. If a fire occurs, it is modeled in the same way fires are modeled with assumed ignitions.

A fire regime is the pattern, frequency, and intensity of wildfire in an area and is inextricably linked to an area’s ecology. Despite the multiple factors that define a fire regime, aspects of DGVM fire results, including those from MC2, are commonly presented in terms of individual metrics over time, for example an increase in fire frequency in the 21st century versus the 20th century (e.g. Sheehan et al., 2015).

EPA level III ecoregions

EPA ecoregions (https://archive.epa.gov/wed/ecoregions/web/html/ecoregions-2.html) are geographic regions for North America based on similarities of biotic and abiotic phenomena affecting ecosystem integrity and quality. EPA ecoregions are defined at 4 levels (I through IIII) with higher levels covering smaller areas. This study uses level III ecoregions which range in area from approximately 24,000 to 100,000 km2.

Research questions and hypotheses

Research questions for this study are:

  • How consistent are fire regimes within ecoregions?
  • How do fire regimes change under projected climate?
  • Could fire regimes produced by MC2 be predicted by a statistical model?

To gain insight into these questions, I tested the following hypotheses:

  1. Fire-free cells will decrease in the 21st century compared to the 20th due to increasing temperatures.
  2. Fire regimes will group within individual ecoregions over the 20th century, but less so in the 21st century due to changes in fire regime as a result of projected changes in climate between the two centuries.
  3. Logistic regression will be able to predict the fire regimes produced by MC2 due to embedded statistical relationships between fire and input variables in the model.
  4. Logistic regression performed on an ecoregional basis will have greater predictive power than that one performed over the entire region due to local geographical relationships within each ecoregion.

Methods

Study area, resolution, and data

The spatial domain for this study covers the Pacific Northwest (PNW) (Fig. 1), defined as the portion of the conterminous United States north of 42° latitude and west of -111° longitude. The area was mapped to a 2.5 arc minute (approximately 4 km x 4 km) latitude/longitude grid. Modeling was done over the period 1895-2100. I also repeated the analysis using only the Blue Mountain Ecoregion (Fig. 1C), located within the PNW.

SheehanFinalBlogFig1

Figure 1: A) Index map of PNW, B) digital elevation model (DEM) of PNW, and C) EPA Level III ecoregions.

Inputs for MC2 include static soil and elevation data as well as monthly climate data. Climate data consist of minimum temperature, maximum temperature, precipitation, and dew point temperature. Inputs for 1895-2010 were from the PRISM (http://www.prism.oregonstate.edu/) 4 km dataset. This dataset is created using interpolation of observed data. Inputs for 2011-2100 came from the Coupled Model Intercomparison Project 5 (CMIP 5, http://cmip-pcmdi.llnl.gov/cmip5/) Community Climate System Model 4 (CCSM4, http://www.cesm.ucar.edu/models/ccsm4.0/) run using the Representative Concentration Pathway (RCP) 8.5 (https://en.wikipedia.org/wiki/Representative_Concentration_Pathways) CO2 concentration data. RCP 8.5 is based on a “business as usual” scenario of CO2 production through the end of the 21st century. 2011-2100 climate data was downscaled to the 2.5 arc minute resolution using the Multivariate Adaptive Constructed Analogs method (MACA; Abatzoglou and Brown, 2012; http://maca.northwestknowledge.net/). MC2 outputs include annual values for vegetation type, carbon pools and fluxes, hydrological data, and fire occurrence data including fraction of grid cell burned and carbon consumed by fire. MC2 input and output data were divided into two datasets based on time period: 20th century (1901-2000) and 21st century (2001-2100).

k-means cluster analysis

For this study, fire regime was defined as a combination of mean annual carbon consumed by fire, the mean annual fraction of cell burned, and the fire return interval (FRI) over the century. (Note that FRI is usually calculated over periods longer than a century, especially in ecosystems rarely exposed to fire). For cluster analysis, the two datasets were combined, and normalized to avoid uneven influences caused by differing data value ranges. I used K-means clustering in R to produce 4 fire regime clusters (details in my blog post on K-means clustering in R: http://blogs.oregonstate.edu/geo599spatialstatistics/2016/05/21/cluster-analysis-r/).

Logistic regression

I did stepwise logistic regression modeling for each of the fire regime clusters (details in my blog post on logistic regression in R: http://blogs.oregonstate.edu/geo599spatialstatistics/2016/05/27/logistic-regression-r/). Explanatory values were taken from inputs to the MC2 model runs (Fig. 2). Climate variables were summarized annually, by “summer” (April-September), and by “winter” (October-January). The mean of these was then taken over each century. Additional explanatory variables were soil depth and elevation. For each cluster, presence was membership in the cluster and absence was non-membership. The process resulted in four logistic functions, for each fire regime cluster.

Tournament cluster prediction

I used a tournament method to predict the fire regime cluster. For each grid cell, each cluster’s linear regression model was used to calculate the probability that the grid cell would be in that cluster. The cluster with the highest membership probability was chosen as the predicted cluster. I constructed confusion matrices and contingency tables of predicted fire regimes versus actual fire regimes. I also produced maps showing the predicted fire regime cluster, the actual fire regime cluster, and the Euclidean distance between cluster centers for the predicted and actual fire regime clusters.

Results

Changes in input variables over time

Climate models consistently project a warming climate (Sheehan et al., 2015) in the Northwest during the 21st century, but precipitation changes vary across models. The results of the CCSM4 run with the RCP 8.5 CO2 values indicate that temperature warms approximately 2 to 4 °C over the entire region, that precipitation is relatively stable with local increases, and that mean dew point temperature is stable to increasing by up to approximately 3 °C (Fig. 2).

SheehanFinalBlogFig2

Fig. 2: Climate variables and their differences by century: A) 20th century maximum temperature; B) 21st century maximum temperature; C) change in maximum temperature between 20th and 21st centuries; D) 20th century precipitation; E) 21st century precipitation; F) change in precipitation between 20th and 21st centuries; G) 20th century dew point temperature; H) 21st century dew point temperature; and I) change in dew point temperature between 20th and 21st centuries.

Cluster analysis

The four clusters yielded by the analysis (Table 1) (details and graphs in my blog post on K-means clustering in R) can be described in terms of the factors used to generate them. These are: 1) low carbon consumed, medium fraction burned, low FRI; 2) high carbon consumed, medium fraction burned, medium FRI; 3) low carbon consumed, low fraction burned, medium FRI; 4) no fire. Euclidean distances between cluster centers are listed in Table 2.

Table 1: Non-normalized values for centers of clusters from cluster analysis. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl1

Table 2: Euclidean distances between cluster centers (normalized values). (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl2

Overall, there were more fire-free cells in the 20th century (31%) than in the 21st (18%). This is readily apparent in the Cascade and Rocky mountain areas (Fig. 3). Fire regimes are fairly consistent in some level III ecoregions and less so in others (Fig. 3). In some cases there is greater consistency in the 20th century (e.g. Coast Range, Cascades, and North Cascades) while others show more consistency in the 21st century (e.g. Northern Basin and Range and Columbia Plateau). Still in others are dominated by more than one but with the exclusion of others (e.g. Willamette Valley). Overall, coverage by the no-fire fire regime decreases while coverage by the high carbon consumed, medium fraction burned, medium FRI increases, especially in the Eastern Cascades Slopes and Foothills, Northern Rockies, and Blue Mountains ecoregions.

SheehanFinalBlogFig3

Fig. 3: Fire regime coverage determined by cluster analysis with EPA Level III Ecoregion overlay for A) the 20th century; B) the 21st century; C) Euclidean distance between cluster centers for time periods. (C: Carbon; Med: Medium; Frac: Fraction)

Logistic regression and tournament fire regime prediction

Of the fourteen explanatory variables used in the logistic regression functions, every variable was used in at least one regression, and soil depth was the only variable used in all regressions (Table 3).

Table 3: Explanatory variables used in the logistic regression functions. (D: Depth; Max: Maximum; T: Temperature; Min: Minimum; Ppt: Precipitation)

SheehanFinalBlogTbl3

The confusion matrix (Table 4) and marginal proportions table (Table 5) for the predictions show that predictions were correct in 62% of grid cells, and that for each fire regime cluster, the correct value was predicted more than any incorrect value. Maps for 20th and 21st century predicted and actual fire regime clusters (Fig. 4, A-B, E-F) show that predictions align fairly well with actual fire regime clusters. Maps of the distances between the centers of the predicted and actual fire regime clusters (Fig. 4) show that when clusters differ, they generally differ by a small amount. The mean Euclidean distance between predicted and actual clusters is 0.21 for the 20th century and 0.18 for the 21st, far less than the minimum distance between any two cluster centers, 0.33.

Table 4: Confusion matrix for predicted and actual fire regime clusters for combined 20th and 21st centuries. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl4

Table 5: Marginal proportions table for predicted and actual fire regime clusters for combined 20th and 21st centuries. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl5

SheehanFinalBlogFig4

Fig. 4: Predicted and actual fire regime clusters and Euclidean distances between predicted and actual cluster centers: A) predicted fire regime clusters for the 20th century; B) actual fire regime clusters for the 20th century; C) Euclidean distances between predicted and actual cluster centers for the 20th century; D) predicted fire regime clusters for the 21st century; E) actual fire regime clusters for the 21st century; and F) Euclidean distances between predicted and actual cluster centers for the 21st century. (C: Carbon; Med: Medium; Frac: Fraction)

Geographical influences

Within the Blue Mountain Ecoregion a logistic regression could not be performed on the no-fire fire regime due to its low occurrence. For this reason no predictions were made for it. Logistic regression was performed for the other fire regimes and an analysis of the ecoregion was performed. Results (Tables 6 and 7) show that predictions using the regressions from the Blue Mountain ecoregion were more accurate (66% correct classification) than predictions for the ecoregion using the logistic regression for the entire PNW (Tables 8 and 9) (63% correct classification).

Table 6: Confusion matrix for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the ecoregion. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl6

Table 7: Marginal proportions table for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the ecoregion. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl7_copy

Table 8: Confusion matrix for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the entire PNW. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl8

Table 9: Marginal proportions table for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the entire PNW. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl4

Discussion

Results and hypotheses

Fire-free cells decreased over the region while precipitation and dew point temperature increased. It is highly likely that rising temperatures were ultimately behind this, but analysis to prove this more forcefully is beyond the scope of this study. That said, I would suggest that hypothesis 1 (Fire-free cells will decrease in the 21st century compared to the 20th due to increasing temperatures) is strongly supported.

Given the strong influence of fire on ecosystem characteristics, and a warming climate expanding the area where fire is likely, I expected fire regimes to change, especially between along the edges of low elevation, flatter ecoregions and the hillier, higher elevation ecoregions surrounding them (for example the Columbia Plateau and Northern Rockies ecoregions). While this expected expansion took place in some places, the opposite effect took place within other ecoregions. The relationships between fire and ecoregions seem to hold or become stronger in as many cases as they become weaker. Thus the hypothesis 2 (Fire regimes will group within individual ecoregions over the 20th century, but less so in the 21st century due to projected changes in climate between the two centuries) is not supported.

The use of logistic regression with a tournament algorithm yielded a statistical model with very strong predictive powers. The high rate of correct prediction and the relatively small distances between mis-predicted fire regime clusters and actual fire regime clusters indicate that statistical relationships are embedded with the process-based MC2 and that logistic regression can be used to project fire regime clusters. Hypothesis 3 is supported.

The predictive power of the logistic regression developed for the Blue Mountain Ecoregion was somewhat better than that of the regression for the entire area. One should not base a conclusion on a sample of one, but this does provide some support for hypothesis 4 (Logistic regression performed on an ecoregional basis will have greater predictive power than that one performed over the entire region due to local geographical relationships within ecoregions).

Broader significance

Fire is projected to occur in areas where it has not been present in the past. The relationship between cohesiveness of fire regime and ecoregion varies from ecoregion to ecoregion. As the climate warms and fire regimes change, though, changing fire regimes will likely change characteristics of some areas, and some ecotones will likely shift as a result. The implications include that for some ecoregions, boundaries will likely shift, and land managers should consider the potential effects of not only the introduction of fire into previously fire-free areas, but also shifts in ecotones.

There is a case to be made for the use of both statistical and process-based vegetation models. Process-based models not only produce predictive results, but they also provide opportunities to find non-linear responses (so-called tipping points) as well as novel combinations of vegetation types, fire, and weather/climate. Statistical models, on the other hand, are much easier to develop and to execute. While a statistical model might not provide as detailed a relationship between predictive and resulting variables, they can often provide more timely results. My results indicate that it might be possible to substitute a statistical model for a process-based vegetation model under certain circumstances.

Statistical analyses are commonly used to analyze climate models by comparing hindcasted model results to observed data. In my experience, vegetation modelers struggle to validate hindcasted results, often using simple side-by-side comparisons of observational and model data. The results from this study point to a possible way to deepen such evaluations. Researchers could develop one set of statistical models using actual historical observations for vegetation type, fire, etc. and another set using modeled data for the same time period. Comparing the models to one another, they could evaluate the similarities of differences of explanatory values’ influences to gain insight to how well the model is emulating natural processes.

Future work

This study can best be described as a pilot study for combining multiple fire-related output variables into fire regimes instead of analyzing them individually. How to best define a fire regime becomes both a philosophical and research question. I will be researching this in the coming weeks and months as I plan to use a refined version of the methods I used in this course for one of the chapters in my dissertation. I have not come across the use of cluster analysis to characterize vegetation model fire regimes, so I am hoping it will be considered a novel approach.

The effects of the stochastic ignitions algorithm within the model cannot be ignored when doing a cluster analysis. Ideally, the model would be run long enough for stochastic effects to be evened out over model time. Another approach is to run multiple (hundreds to thousands) of iterations of the model and average the results. Time and resource constraints make this impractical. Another possible method would be to use a spatial weighting kernel to smooth results over the study area. This method would seem justified given the tendency for fire regimes to cluster geographically. This will be another avenue I investigate.

Finally, the results from the Blue Mountain Ecorgion speak to the potential for geographically weighted regression. I will also do more research on that method.

My Learning

Cluster analysis

About a year ago I was first exposed to cluster analysis. This course gave me a great venue in which to try it on my own. Going through the process taught me the method and some of the pitfalls. I did not normalize the values on my first pass using cluster analysis, and I learned that this can be very important to get meaningful results.

Logistic regression and tournament

The logistic regression portion of this study reintroduced me to a technique I had used before, but did not understand well. As I worked through the logistic regression, I gained a better understanding of the process and how it worked. Putting together the logistic regression blog post made me go through the steps in a deliberate fashion. Doing the 4 by 4 confusion matrix helped me to get a better understanding of what a confusion matrix represents as well as what recall and precision mean in that context. While I did not have a chance to implement geographically weighted logistic regression, I at least gained an understanding of what it means. Implementing a tournament approach was not new to me; I have used a similar technique in genetic algorithms, but it was good to confirm the potential power of such a simple method.

Other packages and methods

Through the term I learned quite a bit about various software packages and tools. I’ve used Arc quite a bit from time to time in the past, but delving into a couple of the spatial statistics tools – hotspot analysis and Moran’s I – reminded me of the power Arc has to offer, as well as some of its quirks. For example my dataset was too large for Arc to handle with Moran’s I. Since I did most of my work in R, I did not have occasion to reacquaint myself with ModelBuilder. I have done a lot of raster processing in Python, but the clipping I had to do to get the Blue Mountain Ecoregion out of the larger dataset forced me to implement a flexible and efficient clipping script that will serve me well in the future.

Since R was the package that was best for my large dataset and had the algorithms I needed, it was the program I learned the most about. I learned how to read and write NetCDF files in R, do cluster analysis, and do stepwise logistic regression in R. In addition to these specific methods, I learned the differences between R datasets, vectors, and matrices. R handles each of these in a different manner. I have found the nuances tricky in the past and tricky still, but I have certainly become more facile with the program.

Statistics

What I have learned about statistics in this class starts with the existence of analytical methods I was not previously aware of. I was unaware of hotspot analysis, but going through the exercise provided me with a good foundational understanding. Spatial autocorrelation is something I had not had formal exposure to, but the basic concept made perfect sense. Its importance, though, and the complexity of dealing with it had not occurred to me. I had been exposed to spectral analysis with temporal data and its use with spatial data makes sense. I had heard of wavelets, but got a better understanding of the method. Correlograms were a new concept for me, presentations on them helped me to understand them. For the most part, regression methods presented in the class reinforced or expanded on concepts I already knew. The application of a kernel for geographically weighted regression was the technique that stretched my understanding the furthest. I had been exposed to the concept of applying a kernel to a dataset, but for some reason, its use in GWR made the concept click for me. Principle components analysis was another method I had heard of, but I did not understand it until it was presented in class, and the concept now makes perfect sense.

Conclusion

Diving into my own project in this class has been a valuable experience, but probably just as valuable has been the interactions with other students and their projects. The broad scope of approaches has exposed me to a wide variety of techniques. Going forward in my research, I will be able to investigate methods I was not aware of in the past.

References

Abatzoglou, J.T., Brown, T.J., 2012. A comparison of statistical downscaling methods suited for wildfire applications. Int. J. Climatol. 32, 772–780.

Bachelet, D., Ferschweiler, K., Sheehan, T.J., Sleeter, B., Zhu, Z., 2015. Projected carbon stocks in the conterminous US with land use and variable fire regimes. Global Change Biol., http://dx.doi.org/10.1111/gcb.13048.

Sheehan, T., Bachelet, D., Ferschweiler, K., 2015. Projected major fire and vegetation changes in the Pacific Northwest of the conterminous United States under selected CMIP5 climate futures. Ecological Modelling, 317, 16-29.

Introduction:  Leatherback sea turtles are endangered world wide.  Their habitat is often impeded due to the oil and gas industry in the Gulf of Mexico.  In 2010, the Deepwater Horizon Oil Spill affected much of the biota when 200million gallons of oil spilled into the Gulf.  Leatherback sea turtles are the most endangered sea turtle. My research  question is focused on the extent to which these turtles utilize the Gulf of Mexico.  Leatherback sea turtles feed almost exclusively on jellyfish. I will be assessing their range by looking at the correlation between leatherback sea turtle point data, sea surface temperature and chlorophyll as proxies for where jellyfish may occur.

„Research Question: What is the correlation between the location of leatherback sea turtles, sea surface temperature, and chlorophyll?

Approach: I used the hotpsot analysis tool in order to explore the sea turtle point data.  the hotspot analysis tool identify where statistically significant hotspots or clusters of sea turtles are located within the Gulf of Mexico.

Sea turtle data:

SeaTurtlePointData

 

Results:  The results of the hotspot analysis (shown below) suggest that the turtle locations are significantly clustered off the coast of Louisiana and Texas between approximately 2,000m to 3,000m depth of water.  However, the results of this analysis appear to be quite deceptive.  Upon taking measurements of turtles in the hotspot cluster it appears as though they may be more dispersed.  Further analysis is needed in order to determine further patterns of sea turtle locations.

hotspotAnalysis_5June

The analysis portion of this project did not prove to be effective, and ultimately resulted in no results. Let me use this tutorial to walk you through a cautionary tale about being just a little too determined to make an idea work with the dataset that you have.

 

The Question:

After doing a lot of research on both early archaeological sites, and sites that contain Pleistocene megafauna in the Willamette Valley, a few patterns seemed to emerge. Megafauna sites seemed to occur in peat bogs, and all of the earliest archaeological sites occurred on the margins of wetlands. This led me to begin to ponder that maybe there is a connection between the soil type, pH, and other factors across all of the known sites. What variables might be useful to predicting the locations of other yet to be discovered archaeological and megafauna sites throughout the valley?

 

The Tools:

In order to conduct this exploration, I decided to use the tools that were built into ArcGIS. Hotspot analysis and regression were going to be the main two tools that were going to be used.

For the data, I found a SSURGO dataset that was in vector format. It contained polygons of all of the mapped soil units in the valley, as well as a variety of factors related to slope, parent material, order, etc. Eventually I switched gears and found another SSURGO dataset that was in raster format and contained a whole lot more data, hoping that this change in dataset would make the analysis much easier.

 

The Steps:

The first step that I took when conducting this analysis was mapping different variables out and looking at them comparatively, to see if there were any obvious patterns to emerge. Three different soils popped up as looking like they were important when considering the late Pleistocene in the Willamette Valley.

 

The mapping revealed that there were three soil types that seemed to appear at most of the known sites.

 

Below is a map of all of the known Pleistocene megafauna and early Holocene archaeological sites in the valley.

 

11

There were 3 major soil types that emerged associated with local sites. The Labish, Bashaw, and McBee soil types.

12

The Labish Soil was especially interesting, as it only seemed to occur at the major peat bearing sites in the valley, most of which were drained lakes that are currently used for crops.

 

After reading about the nature of soil pH in wetland deposits, I began to hypothesize that pH would have been an important variable in the soils that I had identified, and wanted to use this knowledge to find more sites through hotspot analysis and/or regression analysis.

 

 

The Results:

 

The results are the toughest part to discuss, as there was not much to show for results.

Many attempts at successfully running regression analysis were made, using a wide variety of different combinations of data, but all of it returned an error of perfect multicolinearity, resulting in fails across the board. The analysis was attempted using both the vector and raster form of the data, using built-in pH data, pH data that was acquired elsewhere and added to the data, as well as combinations of variables.

As I began to explore the dataset further, I realized that the data, while initially appearing to be incredibly varied, was in fact quite the same. I mapped out the soil orders and Great Groups in the valley and realized that each of the maps looked strikingly similar, which was telling me that (as was mentioned in class), all of the data was likely extrapolated from a few key points.

 

Soil Order:

13

Soil Great Group:

14

Aside from a few differences, both of the maps are extremely similar, which is telling me that this data is more than likely, as mentioned above, extrapolated across a large landscape.

 

This realization made me doubt the pH data as well, so I mapped that out as well.

 

Soil pH Map:

15

The valley soils appear to be fairly neutral, and only vary between 5.7 and 6.6

This would make it very difficult to use some sort of exploratory statistical analysis on this dataset, as there wasn’t much variability.

In order to look at how the pH was distributed throught the valley, I ran a hotspot analysis as well as a Moran’s I analysis.

pH Hotspot Analysis Map:

16

pH Moran’s I Analysis Results:

17

As you can see, the data is extremely clustered, especially in my particular areas of interest, which are the valley floor.

 

 

Was this useful?

This analysis was useful, but for a different reason than was expected.  The SSURGO dataset is not the best tool for soil landscape analysis at a smaller scale. Throughout the class, I have seen other statewide projects that were a lot more successful due to higher variability in soils between the east and west sides of the state.

I became a tad too determined to run this kind of analysis, and the results were completely inconclusive in that respect, but in the end, the most beneficial part of the analysis was figuring out that there are likely connections between my sites of interest. In order to investigate these connections, physical testing is likely the most reliable source, since the SSURGO data is not reliable for this purpose.

Also, don’t rely on your data too much. It might mess with your head a bit!

Wavelet Analysis: A Tutorial

 

Woodburn High School in the northern Willamette Valley, Oregon, contains evidence of an extensive peat bog as well as evidence of extinct Pleistocene megafauna. In October of 2015, sediment cores were extracted from the site in order to better understand the underlying sediment at the site, and find the sediment that is of the right age and type to possibly contain evidence of human occupation.

1

Aerial photo of the study area, with sample locations marked with orange dots.

 

In order to further explore the project, a better understanding of wavelet analysis must be established. By testing a sample of geochemical data extracted from a core.

 

The Question:

 

Is wavelet analysis an appropriate tool for geochemically identifying stratigraphic breaks in sediment cores?

 

The Tools:

 

To conduct this analysis, I used R as well as the R ‘WaveletComp’ package for the wavelet analysis, and ‘ggplot2’ in order to graph the geochemical data.

 

‘WaveletComp” takes any form of continuous data, typically time-series data, spatial data in this case, and uses a small waveform called a wavelet to run variance calculations along the dataset. The resulting output is a power diagram, which shows (in red) the locations along the dataset where there is a great change in variance. A cross-wavelet power diagram can also be generated. This can indicate when two different variables are experiencing rises and/or drops at the same time.

 

2

Example of a wavelet.

 

There are two equations used when generating a wavelet power diagram…

3

The above equation uses the dataset to calculate the appropriate size of the wavelet according to the number of points in the dataset.

4

The above equation uses the wavelet to run variance calculations across the dataset and output the power diagram.

 

 

The Steps:

 

After the appropriate package is loaded into R, and the dataset is uploaded, and the generation of the power diagram and cross-power diagram consists of just a small block of code…

 

Code snippet for the power diagram:

Fewhs010101 = analyze.wavelet(WHS010101, “Fe”,

loess.span = 0,             # Detrending (Kept defaults)

dt = 1,                     #Time series (Kept defaults)

dj = 1/250                  # Resolution along period axis (kept defaults)

make.pval = T,              # Draws white lines that indicate significance (true or false)

n.sim = 10                  # number of simulations

wt.image(Fewhs010101, color.key = “quantile”, n.levels = 250,

legend.params = list(lab = “WHS 010101 Wavelet Power Fe”, mar = 4.7))

 

Code Snippet for a cross-power diagram:

AlFewhs010101 = analyze.coherency(WHS010101, c(“Al”,”Fe”),

loess.span = 0,

dt = 1, dj = 1/100,

make.pval = T, n.sim = 10)

 

wc.image(AlFewhs010101, n.levels = 250,

legend.params = list(lab = “WHS 010101 cross-wavelet power levels – Al/Fe”))

 

 

The output of the power diagram was then compared to line graphs generated using the ggplot2 package.

 

 

The Results:

 

The results were quite interesting, as you can clearly see three different geochemical breaks, as illustrated with the three red plumes that are rising from the bottom of the diagram. The color red indicates that at that particular “time,” there is a significant edge or change in the waveform. This is illustrated by comparing to the line graph that is below. There are “red plumes” present at all of the significant changes in the waveforms on the graph. This tells us that these locations along the transect should be considered “hotspots” for stratigraphic change.

 

5

Wavelet power diagram of Aluminum.

6

Example of graphed aluminum and Sulphur data for the core.

 

Was this useful?

For my project, the analysis seemed to show that it is very possible to spot major changes in geochemistry across a transect. This will be further explored in my forthcoming final analysis post.

As for other projects, this method could be used to spot unseen patterns in the changes of sea surface temperature over time, or changes in frequencies of oxygen isotopes, or any other data that is presented in a time-series or in equidistant measures across a landscape.

 

Research Question

Over the duration of this course, my research question has taken on a form different from that presented in my opening blog post, but still equally valuable to my research. Instead of asking how I could bring statistics into a single landslide hazard analysis, I am now asking how statistics may be used to add consistency to the entire landslide mapping process (Figure 1).

Initial_Flowchart

Figure 1: Correspondence of landslide map types and selected inputs.

Mapping landslides can be a complicated task that, along the way, incorporates a sequence of three map types:

  1. Inventory – the mapped extents of past landslides. These extents can simply be one polygon, or they can be several polygons representing the features that compose a landslide; features such as scarps and deposits.
  2. Hazard – mapped predictions of the probability of landslide occurrence or the amount of ground deformation associated with the advent of some natural hazard, such as heavy rainfall or seismic ground motions.
  3. Risk – a mapped combination of landslide hazard and the vulnerability and exposure of infrastructure. Typical risk maps attempt to express the costs incurred by a landslide’s occurrence in a particular location.

In addition to these three types, there is also susceptibility mapping. Susceptibility maps, which show where ground conditions may be conducive to landsliding, are useful for some applications, but they are not necessary in this context.

Inventory, hazard, and risk maps should be viewed as a progression, as each new map is dependent on the content of its predecessor. A lack in geotechnical design parameters (i.e. friction angle, depth to groundwater, soil stratigraphy) at the regional scale requires that landslide inventories be used to back-calculate conditions at failure. These conditions can then be interpolated across the region to improve the inputs to a hazard map. This approach has many imperfections, but it represents the most informed analysis on many occasions.

Additionally, the hazard map is a primary input for a risk map. A good way to think about the relationship between hazard and risk maps is by answering the age-old question, “If a [landslide occurs] in the woods, does anyone [care]?” The answer is typically no, but on the occasion that the landslide wipes out their driveway, railroad track, or doghouse, the answer becomes YES! A risk map considers whether the infrastructure’s location corresponds with that of high landslide hazard, and sometimes, how much the repair of the damaged infrastructure might cost. For these reasons, risk maps are the ultimate goal for land managers, like the Oregon Department of Transportation. Knowing the costs in advance allows for improved allocation of resources and better budgeting estimates.

Datasets

The datasets used for this course were:

  1. Statewide Landslide Information Database for Oregon (SLIDO) – Points representing the location of historic landslides and polylines representing the extents of landslide scarps
  2. Northwest River Forecast Center Rainfall Data – Weather station points with past rainfall amounts and future rainfall predictions.
  3. Oregon Lidar Consortium Digital Elevation Models (DEM) – 3 foot resolution bare-earth elevation rasters.

All datasets were evaluated for various locations in the Oregon Coast Range.

Hypotheses

The hypotheses related to rainfall-induced landslide mapping are as follows:

  1. Topography and soil strength account for most of a soil’s strength, but these two factors are not solely responsible for most landslides.
  2. Rainfall is the factor that most often leads to slope failure. A slope is in equilibrium until the addition of pore water pressures from rainfall induces a failure.

These hypotheses must be broken down into more specific hypotheses to address my research question. The specific hypothesis are listed below:

  1. The adequacy of any topographic data for landslide mapping is determined by its resolution and the scale at which it is evaluated.
  2. Different elevation derivatives (i.e. curvature, slope, roughness) are better for identifying specific landslide features. For example, one derivative might be better at identifying landslide deposits, while another might be better at identifying the failure surface.
  3. The intensity and timing of rainfall determines how soil strength is diminished.

Approaches

Each of the three specific hypotheses were evaluated as coursework this quarter, and their role in the landslide mapping process is shown in Figure 2. Hypothesis one was addressed using fourier transforms, hypothesis two was addressed using principal component analysis (PCA), and hypothesis three was evaluated using kriging. A non-hypothesis related approach also came in the form of a hot spot analysis performed in order to identify locations of more costly landslide activity.

Final_Flowchart

Figure 2: Relationship between hypotheses and the landslide mapping process.

Results

Documentation for the implementation and results associated with the hot spot and kriging analyses has been provided in previous blog posts, but PCA and fourier transforms will be discussed here.

Principal Component Analysis

The purpose of performing a principal component analysis was to determine which topographic derivatives were most closely associated with the crest of a landslide scarp. The values used as inputs to the PCA were the mean slope, standard deviation of elevation, profile curvature, and planform curvature that corresponded with the location of each scarp polyline. Table 1 shows the results of the PCA.

Table 1: Coefficients resulting from principal component analysis.

Principal Component Slope Profile Curvature Standard Deviation of Slope Planform Curvature
1 0.99 0.00 0.00 -0.16
2 0.00 0.80 0.59 0.04
3 0.16 -0.06 0.02 0.98
4 -0.01 -0.59 0.80 -0.05

 

Table 1 shows that the first principal component is strongly correlated with slope, while the second principal component is strongly correlated with profile curvature and the standard deviation of slope. Table 1 was not considered further because it relies assumption that the scarp polylines represent the true location of landslide scarps, which was later determined to be unlikely. The PCA results still do provide useful information, as the strong correlations of both profile curvature and standard deviation of slope with the second principal component spurred an additional investigation.

Having two variables strongly correlated with the same data implies that the two variables are also correlated with each other. To confirm this understanding, profile curvature and standard deviation of slope were compared (Figure 3). The results show a nearly linear relationship between the two variables. Based on these results, standard deviation of slope was no longer considered in future analyses related to landslide scarps.

PCA

Figure 3: Comparison of profile curvature and standard deviation of slope.

Fourier transforms

Fourier transforms use a weighted sum of pairs of sine and cosine functions to represent some finite function. Each paired sine and cosine function has a unique frequency that is plotted against its amplitude to develop what is termed the frequency domain. In common practice, the frequency domain is used to identify dominant frequencies and to remove frequencies associated with noise in the data. In the case of this project, fourier transforms were used to determine the frequency of topography (a digital elevation model, with results in Figure 4), which in turn provides its period. Knowing the period of topography is a useful way of determining the scale at which it may be identified in an entire landscape.

Frequency_domain

Figure 4: Example of the frequency domain of topography.

The primary failure of this approach is that most topography is dominated by very low frequencies (high periods), meaning that topography is inherently flat, which makes clear identification of small landslide features impossible. Future work filtering the frequency domain will be necessary before any conclusions may be drawn from this approach.

Significance

The significance of this work has two major aspects:

  1. Land managers and the public benefit from landslide maps because they show the cost of building in certain locations.
  2. The statistical framework can provide a better way to threshold continuous natural data to improve consistency in the implementation of landslide mapping procedures.

The two aspects come together in that the consistent production of landslide maps will yield products that are easier to interpret. Improved interpretation of the maps will hopefully influence future construction and mitigation for existing infrastructure.

Course Learning

The primary research-oriented lessons learned during this course are:

  1. Combinations of software programs are often necessary to efficiently complete a task. While some software may have almost infinite capabilities, the time needed to implement some approaches may favor the use of other software.
  2. Most programming languages have significant libraries of code that are already written. While an individual may have the ability to write code to perform a task, unnecessary time is spent rewriting what someone else has already done. Often, that same time can be used to explore additional opportunities that may lead to innovative ideas.

From these two lessons it should be evident that I value efficiency. The breadth of my problem is great and what I was able to accomplish during this course is only a small piece of the problem (see Figure 2). On top of only scratching the surface of my problem, many of my efforts also ended without meaningful results. Despite these failures, several approaches that I first thought would not apply to my work, surprised me. For this reason my greatest lesson learned is that it is important to try many different approaches to the same problem; some may work and some may not. Improved efficiency simply makes it possible to implement more analyses.

Statistical Learning

Of the activities performed during this course, the hot spot analysis and kriging were univariate analyses. Based on these two analyses, below are several advantages and limitations to univariate statistical approaches.

  1. Relatively easy to apply (in terms of time required and interpretation of results)
  2. May reveal patterns that are not obvious, which is most evident in the results of my hot spot analysis.
  3. Require large sample sizes, which is also most evident in the results of my hot spot analysis.
  4. Kriging was particularly sensitive to geographic outliers.
  5. Sensitive to the spatial distribution of sampling sites. Geographic biases, such as the selection of only landslides along roadways in my hot spot analysis, may produce deceptive results. I would not trust isolated sampling sites.
  6. One variable is not capable of modeling many processes.

Other statistical approaches performed during this course involved transformations that brought data into unfamiliar states. Both the fourier frequency domain and principal component loadings are abstract notions can only be interpreted with specific knowledge.

1 Research Question

Counties reacted differently towards the Great Recession (officially started from Dec.2007 to June 2009). Economic resilience is defined as the regional capacity “to absorb and resist shocks as well as to recover from them” (Han and Goetz, 2015). There are two dimensions of economic resilience, resistance and recovery. This research focuses on resistance.

The research question is what factors are associated with the resistance to the 2008 recession at the county level in the US? The variable drop developed Han and Goetz (2015) is used to measure the resistance (actually vulnerability), which is the amount of impulse that a county experiences from a shock (the percentage of deviation of the actual employment from the expected employment during the Great Recession.).

droprebound

Figure 1 Regional economic change from a major shock and concepts of drop and rebound

dropequation

Rising income inequality is considered as one structural cause of the crisis (Raghuram, 2012). A rising trend in income inequality is observed in Figure 2. Therefore prerecession rising income inequality is hypothesized to increase drop (reciprocal of resistance).

income inequality

Figure 2 Income inequality in the U.S., 1967-2014 Source: Data from U.S. Census Bureau 2014a, 2014b.

There are two values for four of the income inequality measures and income distribution indicator at year 2013. Because a portion of the 2014 CPS ASEC, about 30,000 addresses of the 98,000 addresses, received redesigned questions for income and health insurance coverage. The value based on this portion denoted 2013a. While the remaining 68,000 addresses received the income questions similar to questions used in the 2013 CPS ASEC. This portion is labeled 2013b (U.S. Census Bureau,2016a, 2016b).

Spatial autocorrelation is hypothesized to act in the relationship. The effect of income inequality or drop may not be limited within a region but attenuate with distance. Drop in a county might be affected by its own characteristics as well as the surrounding counties (spatial lag and spatial error model).

 

2 Description of the dataset:

Income inequality: the Gini coefficient

Income distribution: poverty rate and the share of aggregate income held by households earning $200,000 or more

Control variables: Population growth rate from 2001-2005, % Black or African & American (2000), % Hispanic or Latino (2000)

Capital Stock variables:

Human Capital: % population with Bachelor’s degree or higher, age group (20-29, 30-49), female civilian labor force participation rate (2000)

Natural Capital: natural amenity scale (1999)

Social Capital: social capital index (2005)

Built Capital(2000): median housing value (2000)

Financial Capital (2000): share of dividends, interest and rent(2000)

Economic structure:

Employment share of 20 two-digit NAICS industries (manufacturing, construction, etc. other services (except public administration) omitted)

 

Table 1 Summary Statistics
Variables Obs Mean Std. Dev. Min Max Description Data source
Drop, rebound and resilience
drop 2,839 0.186 0.116 -0.236 0.895   Wu estimates based on 2003-2014 BLS QCEW
Income Distribution
Gini coefficient 3,057 0.434 0.037 0.333 0.605   Census 2000 P052
Poverty rate 3,055 0.142 0.065 0.000 0.569   Census 2000 P087
% Aggregate income held by HH earning 200K or more 3,141 0.091 0.050 0.000 0.456   Census 2000 P054
Control Variables  
Population growth rate 2001 -2005 3,055 0.020 0.056 -0.203 0.428   BEA 2001-2005 CA5N
%Black or African American 3,055 0.087 0.145 0.000 0.865   Census 2000 SF1 QT-P3
%Hispanic or Latino 3,055 0.063 0.121 0.001 0.975  
Capital stocks  
% persons with Bachelor’s degree or higher 3,055 0.164 0.076 0.049 0.605 Human Capital Census 2000 SF3 P037
%Total: 20 to 29 years 3,055 0.118 0.033 0.030 0.346 Census 2000 P012
%Total: 30 to 49 years 3,055 0.288 0.026 0.158 0.420
%Female Civilian Labor Force Participation 3,055 0.547 0.065 0.266 0.809 Census 2000 SF3 QT-P24
Natural amenity scale 3,055 0.056 2.296 -6.400 11.170 Natural capital 1999 USDA- ERS Natural Amenity Index
Social capital index 2005 3,055 -0.004 1.390 -3.904 14.379 Social capital Rupasingha, Goetz and Freshwater, 2006
Median value  All owner-occupied housing units 3,055 80495.190 41744.080 12500 583500 Built capital Census 2000 H085
Share of dividend, interest and rent 3,110 0.188 0.053 0.059 0.561 Financial capital 2000 BEA CA5
Economic Structure  
Farm employment 3,107 0.090 0.087 0 0.627   BEA 2001 CA25N
Forestry, fishing, and related activities 3,107 0.006 0.016 0 0.232  
Mining, quarrying, and oil and gas extraction 3,107 0.010 0.033 0 0.839  
Utilities 3,107 0.002 0.006 0 0.200  
Construction 3,107 0.058 0.032 0 0.260  
Manufacturing 3,107 0.109 0.089 0 0.558  
Wholesale trade 3,107 0.023 0.019 0 0.239  
Retail trade 3,107 0.109 0.030 0 0.372  
Transportation and warehousing 3,107 0.021 0.023 0 0.262  
Information 3,107 0.011 0.010 0 0.137  
Finance and insurance 3,107 0.027 0.017 0 0.201  
Real estate and rental and leasing 3,107 0.022 0.016 0 0.135  
Professional, scientific, and technical services 3,107 0.025 0.028 0 0.828  
Management of companies and enterprises 3,107 0.003 0.006 0 0.147  
Administrative and support and waste management and remediation services 3,107 0.026 0.024 0 0.192  
Educational services 3,107 0.006 0.010 0 0.112  
Health care and social assistance 3,107 0.052 0.048 0 0.305  
Arts, entertainment, and recreation 3,107 0.012 0.018 0 0.726  
Accommodation and food services 3,107 0.048 0.038 0 0.386  
  3,107 0.054 0.020 0 0.162  
Government and government enterprises 3,107 0.170 0.073 0.026 0.888  
Note: ACS- American Community Survey, BEA- Bureau of Economic Analysis, BLS- Bureau of Labor Statistics, ERS-Economic Research Service, QCEW-Quarterly Census of Employment and Wages.

Analysis units: U.S. counties

3 Hypotheses

  1. a) Pre-recession income inequality and other demographic economic and industrial factors affect drop at county level. drop(reciprocal of resistance) is positively related to income inequality
  2. b) drop (reciprocal of resistance) is spatially autocorrelated.
  3. c) drop (reciprocal of resistance) is related to characteristics of a county and characteristics (here focused on drop) of neighboring counties.at county level4 Approaches

Approaches

  1. a) drop(reciprocal of resistance) is positively related to income inequality and related with other characteristics of a county

Ordinary least squares regression

  1. b) drop (reciprocal of resistance) is spatially autocorrelated.

Global Moran’s I and Anselin Local Moran’s I

  1. c) drop (reciprocal of resistance) is on drop of neighboring counties.

Spatial lag model in Geoda

5 Results

  1. a) OLS regression

From the table below, we can see the Gini coefficient, poverty rate, and the share of aggregate income held by HH earning $200, 000 or more are not significant to predict drop. Therefore the prerecession income inequality and income distribution are not correlated with drop (reciprocal to resistance).

Besides income inequality, there are several other significant variables, for example, population growth rate from 2001-2005, % Black or African American, % Hispanic or Latino, % pop with Bachelor’s degree or higher, natural amenity scale, employment share of industries like manufacturing, wholesale trade, retail trade, transportation and warehousing, information, finance and insurance, educational services, health care and social assistance, accommodation and food services, government and government enterprises.

Table 2 Full model of drop  
Drop Full Model
Gini coefficient in 2000 -0.322
  (-1.60)
Poverty rate 0.153
  (1.69)
% Aggregate income held by HH earning  200K or more 0.166
  (1.48)
Population growth rate 2001 -2005 0.205***
  (3.61)
%Black or African American 0.0542*
  (2.33)
%Hispanic or Latino -0.113***
  (-5.16)
% persons with Bachelor’s degree or higher -0.132*
  (-2.09)
%Total: 20 to 29 years -0.0218
  (-0.21)
%Total: 30 to 49 years -0.0731
  (-0.58)
%Female Civilian Labor Force Participation -0.0260
  (-0.37)
Natural amenity scale 0.00728***
  (6.13)
Social capital index 2005 -0.00386
  (-1.32)
Median value  All owner-occupied housing units -4.13e-08
  (-0.47)
Share of dividend, interest and rent 0.0632
  (0.99)
Farm employment -0.0183
  (-0.29)
Forestry, fishing, and related activities 0.163
  (0.99)
Mining, quarrying, and oil and gas extraction -0.0548
  (-0.49)
Utilities 0.0237
  (0.06)
Construction 0.0631
  (0.64)
Manufacturing -0.112*
  (-2.32)
Wholesale trade -0.580***
  (-4.58)
Retail trade -0.620***
  (-5.99)
Transportation and warehousing -0.337***
  (-3.47)
Information -0.590**
  (-2.62)
Finance and insurance -0.710***
  (-5.04)
Real estate and rental and leasing 0.203
  (0.87)
Professional, scientific, and technical services 0.0355
  (0.15)
Management of companies and enterprises -0.207
  (-0.53)
Administrative and support and waste management and remediation services -0.0719
(-0.59)
Educational services -0.894***
  (-4.85)
Health care and social assistance -0.215***
  (-4.26)
Arts, entertainment, and recreation -0.00162
  (-0.01)
Accommodation and food services -0.244**
  (-2.92)
Government and government enterprises -0.231***
  (-4.12)
_cons 0.527***
  (5.19)
N 2771
adj. R2 0.199
Log likelihood 2383.06
AICc -4696.13
Schwarz criterion -4488.7
Jarque Bera ***Non-normality
Breusch-Pagan tes ***Non-stationary
White Test *** heteroscedasticity
For weight Matrix Queen Contiguity 1st order  
Moran’s I (error) 10.5178***
Lagrange Multiplier (lag) 506.9370***
Robust LM (lag) 19.9202***
Lagrange Multiplier (error) 1.#INF ***
Robust LM (error) 1.#INF ***
Lagrange Multiplier (SARMA) 1.#INF ***
For weight Matrix Queen Contiguity 1st and 2nd order  
Moran’s I (error) 7.9411***
Lagrange Multiplier (lag) 569.0604***
Robust LM (lag) 18.1515***
Lagrange Multiplier (error) 1.#INF ***
Robust LM (error) 1.#INF ***
Lagrange Multiplier (SARMA) 1.#INF ***

Significant results for Jarque Bera, Breusch-Pagan test and White Test show non-normality, non-stationary, and heteroscedasticity exist. The value of Moran’s I shows there is spatial autocorrelation. While the five Lagrange Multiplier test statistics are reported in the diagnostic output. The first set of tests is between Lagrange Multiplier (LM), which tests for the presence of spatial dependence, and Robust LM, which tests which if either lag or error spatial dependence could be at work. The second set of tests is lag or error. Lag refers to spatially lagged dependent variable, whereas error refers to spatial autoregressive process for the error term. The first two (LM-Lag and Robust LM-Lag) pertain to the spatial lag model as the alternative. The next two (LM-Error and Robust LM-Error) refer to the spatial error model as the alternative. In my results, LM-Lag and LM-Error are all significant while Robust LM-Lag is less significant (Prob 0.00002 in Spatial lag with queen contiguity 1st order /0.00001 for 1st and 2nd order) than Robust LM-Error (Prob 0.00000). The last test, LM-SARMA, relates to the higher order alternative of a model with both spatial lag and spatial error terms. This test is only included for the sake of completeness, since it is not that useful in practice.

According to Anselin 2005, I should use spatial error model. But I chose to spatial lag model. “In the rare instance that both would be highly significant, go with the model with the largest value for the test statistic. However, in this situation, some caution is needed, since there may be other sources of misspecification. One obvious action to take is to consider the results for different spatial weights and/or to change the basic (i.e., not the spatial part) specification of the model.”

  1. b) Spatial autocorrelation

From both the global Moran’s I and Anselin Local Moran’s I, drop and the Gini coefficient are spatial autocorrelated.

Table 3 Spatial Autocorrelation
Variable Global Moran’s Index1 z-score  
The Gini coefficient 2000 0.46732 91.45 clustered
drop 0.1253 23.82 clustered
Note: 1 Only the continental U.S. counties are used to calculate Global Moran’s I and Anselin Local Moran’s I. All the options are left as default. 2,840 counties are used for drop, and resilience. Conceptualization of Spatial Relationships: Inverse distance, Standardization : None
2This is different from the first time I calculated, which is 0.453298 with z-score 88.71. Another same shapefile gives 0.44 with z-score 83.79.

local moran I drop 3 local moran I gini00

Figure 3 and 4 Local Moran’s I of drop and the Gini coefficient 2000.

c) Spatial lag model

To deal with the spatial autocorrelation, I create spatial weights using queen contiguity. Contiguity based weights are weights based on shared borders (or vertices) instead of distance. The queen criterion determines neighboring units as those that have any point in common, including both common boundaries and common corners. Therefore, the number of neighbors for any given unit according to the queen criterion will be equal to or greater than that using the rook criterion which eliminates corner neighbors i.e. tracts that don’t have a full boundary segment in common.

I tried 3 orders of queen contiguity.

queen contiguity

A county’s drop is affected by drop in neighboring counties. Because spatially weighted drop is significant in all three models. The coefficients are all positive, indicating drop in a specific county is positively related to its neighboring counties. Hence a county surrounded by high drop counties will be high in drop while a county surrounded by low drop counties is likely to be low in drop.

Moreover, the coefficient of the weighted drop increase with the contiguity order. My explanation is that when the drop of a county is affected by more counties in its neighborhood, this region would be overall vulnerable or robust, therefore the coefficients are large.

Models Log likelihood AICc Schwarz criterion W_drop
OLS 2383.06 -4696.13 -4488.7  
Spatial lag 1 (1st queen contiguity order) 2416.53 -4761.07 -4547.71 0.215***
Spatial lag 2 (2nd queen contiguity order) 2426.12 -4780.24 -4566.88 0.356***
Spatial lag3 (3rd queen contiguity order) 2425.8 -4779.59 -4566.24 0.436***

 

 

An increase in log likelihood from 2383.06 (OLS) to 2416.53 (Spatial Lag with 1st order neighbors), 2426.12 (Spatial lag with 1st and 2nd order neighbors), 2425.8 (Spatial lag with 1st, 2nd , 3rd order neighbors). Compensating the improved fit for the added variable (the spatially lagged dependent variable), the AIC (from -4696.13 to -4761.07/-4780.24/-4779.59) and SC (from -4488.7 to -4547.71/-4566.88/-4566.24) both decrease relative to OLS, again suggesting an improvement of fit for the spatial lag specifications.

Among the three spatial lag models, the spatial lag with 1st and 2nd order neighbors has the highest log likelihood and lowest AICc and Schwarz criterion. Spatially lagged drop on 1st and 2nd order neighbors best fit the model.

Limited number of Diagnostics are provides with the ML lag estimation of spatial lag model 2(using Queen contiguity 1st and 2nd order). A Breusch-Pagan test for Heteroscedasticity in the error terms. The highly significant value of 587.2494 suggests that heteroscedasticity is still a serious problem.

The second test is an alternative to the asymptotic significance test on the spatial autoregressive coefficient, it is not a test on remaining spatial autocorrelation. The Likelihood Ratio Test is one of the three classic specification tests comparing the null model (the classic regression specification) to the alternative spatial lag model. The value of 86. 1133 confirms the strong significance of the spatial autoregressive coefficient.

The three classic tests are asymptotically equivalent, but in finite samples should follow the ordering: W > LR > LM. In my example, the Wald test is 9.63145^2 = 92.8 (rounded), the LR test is 86. 1133 and the LM-Lag test was 506.9370, which is not compatible with the expected order. This probably suggests that other sources of misspecification may invalidate the asymptotic properties of the ML estimates and test statistics. Given the rather poor fit of the model to begin with, the high degree of non-normality and strong heteroscedasticity, this is not surprising. Further consideration is needed of alternative specifications, either including new explanatory variables or incorporating different spatial weights.

 

6 Significance

Understanding what factors affect county level resistance could offer some insights that local policies to promote regional resistance and prevent future downturns.

My results show pre-recession income inequality is not a significant factor in explaining drop. Drop, which refers to the percentage of employment loss during the Great Recession, is significantly explained by prerecession population growth rate, race and ethnicity, educational attainment, natural amenity scale and several industries. Results in full model of drop show that counties with a lower population growth rate from 2001 to 2005, a lower percentage of Black or African American, a higher percentage of Hispanic or Latino population, a higher percentage of people with Bachelor’s degree or higher, a lower natural amenity scale, and higher employment shares in manufacturing, wholesale trade, retail trade, transportation and warehousing, information, finance and insurance, educational services, health care and social assistance, accommodation and food services and government and government enterprises would be more resistant and experience a low employment deviation from expected growth path during the Great Recession.

Further investigation is needed for industries such as manufacturing, retail trade, information, finance and insurances, educational services, health care and social assistance that slow down drop.

7 Learning

I learned principal components analysis in R which shrink the size of the explanatory variables. But I find it’s hard to explain in the GWR results. Moreover, my GWR still shows a low local r squared (only less than 20 counties has R squared higher than 0.3). So there might be misspecification in my original model.

Spatial lag and spatial error model in GeoDa. I also did spatial error model, but results are not shown here. The spatial lag model with 1st and 2nd queen contiguity order yield higher loglikelihood than any of the spatial error model. But due to my previous LM test for lag and error, spatial error is more significant and spatial error model should have better result.  This needs more investigation. I find the tools very useful. I am curious to find a way to test the spatial lag in income inequality in neighboring counties.

Moreover, I learned a lot from the tutorials and projects from my classmates. They are wonderful!

 

8 Statistics

Hotspot:

Pros: identify the clustering of values, show patterns, easy to apply, fits both binary and continuous values

Cons: sensitivity to geographic outliers, edge effect, sample size >30, univariate analysis, couldn’t identify outliers as Local Moran’s I, sensitive to spatial distribution of points

Spatial autocorrelation (Local Moran’s I):

Pros: similar to hotspot analysis, could identify outliers (high-low outliers, low-high outliers)

Cons: similar to hotspot analysis

Regression (OLS, GWR):

OLS gives a general coefficients, a global relationship. It is multivariate analysis and could include many explanatory variables.

GWR builds a local regression equation for each feature in the dataset.

Cons: Maps of coefficients could only show one relationship at a time, cannot include large numbers of explanatory variables.

Multivariate methods (PCA):

Pros: could reduce the size of explanatory variables, resulted PC (linear combination of the candidate variables) capture the most variation of the dataset

Cons: it’s hard to explain the coefficients/results when more than one pcs work in the regression.

Learning: Resulted principal components capture the most variation of the dataset, however, it may not best capture the variation in the dependent variables.

Reference:

Anselin, Luc. “Exploring spatial data with GeoDaTM: a workbook.” Urbana 51 (2004): 61801.

Han, Yicheol, and Stephan J. Goetz. “The Economic Resilience of US Counties during the Great Recession.” The Review of Regional Studies 45, no. 2 (2015): 131.

Rajan, Raghuram. Fault lines. HarperCollins Publishers, 2012.

Rupasingha, Anil, Stephan J. Goetz, and David Freshwater. “The production of social capital in US counties.” The journal of socio-economics 35, no. 1 (2006): 83-101. Data Retrieved on March 23rd, 2016  http://aese.psu.edu/nercrd/community/social-capital-resources

U.S. Census Bureau, Current Population Survey, Annual Social and Economic Supplements “H2 Share of aggregate income received by each fifth and top 5 percent of households all races: 1967 to 2014”,” H4 Gini ratios for households by race and Hispanic origin of householder: 1967 to 2014”, (2014a) Retrieved on May 17th , 2016 https://www.census.gov/hhes/www/income/data/historical/inequality/

U.S. Census Bureau, Current Population Survey, Annual Social and Economic Supplements  Table 2 Poverty status of people by family relationship, race, and Hispanic origin: 1959 to 2014” (2014b) Retrieved on May 17th , 2016 https://www.census.gov/hhes/www/poverty/data/historical/people.htmlU.S. Census Bureau, Historical Income Tables Footnote. (2016a). Retrieved on May 17th, 2016 from http://www.census.gov/hhes/www/income/data/historical/ftnotes.html

U.S. Census Bureau Historical Poverty Tables-Footnote. (2016b). Retrieved on May 17th, 2016 from https://www.census.gov/hhes/www/poverty/histpov/footnotes.html

IMG_20160602_123648779

Background

Bluebunch wheatgrass is an important native species used extensively in the restoration of Great Basin habitats. Bluebunch wheatgrass has also been found to be locally adapted to climate (St. Clair et al. 2013). Seed transfer zones were developed for bluebunch wheatgrass to reduce the chances of maladaptation. Maladaptation can stem from the movement of locally adapted bluebunch wheatgrass phenotypes to differing climates where they are not likely to thrive.

Although climate clearly plays a large role in the success of this, and other plant species, it is unclear to what extent local soils may also influence genetic differentiation at the seed source population level. As more and more seed transfer zones are constructed for native species, it is important to consider all factors that may place selective pressure on plant populations. It follows that an understanding of soil-plant dynamics within the context of seed zones will help to further improve the seed transfer zone delineation process and lead to greater success in restoration efforts in general.

Because of the inherent complexity of soils, the significant effort involved in mapping them accurately, and their heterogeneous distribution in space, soils traits have been excluded from the already complex method of seed zone delineation. The focus of this study is to explore the body of soils data that exists for the Great Basin region and determine if the tools used to construct climate-based seed zones (such as existing common garden experiments) might also be used to better understand soil-plant relationships.

Research Question

Do seed-source population phenotypic traits of bluebunch wheatgrass vary with one or many soil traits that exist in the Great Basin?

Datasets

  1. The seed zones for bluebunch wheatgrass have been delineated into shapefiles. Each polygon represents a seed zone. The boundaries of these polygons were delineated using ArcMap spatial analysis of climate (precipitation / temperature), elevation, and plant traits (leaf length, leaf width, crown width etc.). This dataset / map was developed in 2013 and is based on two years of data collection at common-garden sites in the Great Basin.
  2. The Phenotypic trait dataset contains measurements of individual plants at sixteen different common gardens spread throughout the study area. Each seed zone is represented by two common gardens, and each common garden contains 4-5 populations from each seed zone. Measurements such as leaf length, width, and reproduction stage scores were gathered from each individual plant in all sixteen gardens. Each population of bluebunch wheatgrass growing in the common gardens is associated with a collection site (UTM / elevation / approximate area in acres). Each common garden also has an associated (UTM, elevation / and dimensions in meters).

Figure 1. Map of Study Area and Seed Zones for Bluebunch Wheatgrass

SeedZonesMap

3. “The gSSURGO (soils) database is derived from the official Soil Survey Geographic (SSURGO) database. SSURGO generally has the most detailed level of soil geographic data developed by the National Cooperative Soil Survey (NCSS) in accordance with NCSS mapping standards. The tabular data represent the soil attributes and are derived from properties and characteristics stored in the National Soil Information System (NASIS). The gSSURGO data were prepared by merging the traditional vector-based SSURGO digital map data and tabular data into statewide extents, adding a statewide gridded map layer derived from the vector layer, and adding a new value-added look up table (valu) containing “ready to map” attributes. The gridded map layer is in an ArcGIS file geodatabase in raster format. The raster and vector map data have a statewide extent. The raster map data have a 10-meter cell size that approximates the vector polygons in an Albers Equal Area projection. Each cell (and polygon) is linked to a map unit identifier called the map unit key. A unique map unit key is used to link the raster cells and polygons to attribute tables” (Soil survey staff 2015).

Objective

My objective is to identify important soils characteristics at seed source locations that influence phenotypic traits in bluebunch wheatgrass. A secondary objective is to generate hypotheses about how soil traits may influence genetic differentiation in bluebunch wheatgrass. If I find correlations between bluebunch wheatgrass phenotypes and local soils, then soils and not just climate, may require more serious consideration during seed zone delineation. If I am unable to correlate bluebunch wheatgrass phenotypes with soils characteristics, then I will conclude that soils are (most likely) not a dominant factor in the genetic differentiation of this species.

Approaches

  • Hot Spot analysis in ArcMap of soil order and various bluebunch wheatgrass traits
  • Geographically weighted regression in ArcMap of available water storage of seed-source soils and bluebunch wheatgrass phenotypic traits
  • PCA – To visualize differences among bluebunch wheatgrass populations in phenotypic trait space
  • MRPP* – To test the significance of the similarity in phenotypic traits between soil classifications

Hotspot Analysis Results:

Figure 2. Map of soil orders and hot spots of crown width in a common garden in zone 7 (cool and wet)

HotSpotCoolWet

Figure 3. Map of soil orders and hot spots of crown width in a common garden in zone 1 (hot and dry)

HotSpotHotDry

Discussion:

Based on these results there is not strong evidence of clustering in crown width in populations from either common gardens. Although not shown here, I also performed hot spot analysis on leaf length, width, and length-width ratio and produced similar results.

These results are not surprising given the dispersed geographic distribution of the populations selected for each common garden. Furthermore, in many cases the populations of bluebunch wheatgrass selected to represent a seed zone were spatially dispersed as well. Although many of the points are within relatively close proximity, they are usually residents of different seed zones and climate regimes with different growth habits. For this reason it is not surprising that the observed phenotypic traits do not group spatially. In areas where some indication of clustering occurs, it is likely that two populations from the same seed zone happen to also be close in proximity.

Spatially Weighted Regression Results:

Figure 4. Regression Coefficients of Leaf Width in terms of Available Water Storage (upper 25cm) for Populations Grown at the Wahluke Common Garden

neg.GWR.AWS25.Leafw.WahlukeN1.WahlukeN1

Figure 5. Regression Coefficients of Leaf Width in terms of Available Water Storage (upper 25cm) for Populations Grown at the Wahluke Common Garden

pos.GWR.AWS25.Leafw.WahlukeN1

Figure 6. Regression Coefficients of Leaf Length in terms of Available Water Storage (upper 25cm) for Populations Grown at the Wahluke Common Garden

CoefficientsWahluke

Darker blue points indicate more negative coefficients. Pale blue points indicate a very slight negative correlation between AWS and leaf length.

Figure 7. Local R-squared Values of Regression Coefficients for Leaf length in terms of AWS for Populations Grown at the Wahluke Common Garden.

LocalRWahluke

Interpretation

In figure 4 above, two of the populations with negative coefficients also tend to occur in moderately wet (AWS 3.7-6.8) soils. In figure 5, positive relationships exist in moderately wet to dry areas (AWS 2.5-6.8). Additionally in both figure 4 and 5, negative coefficients tend to be near other negative coefficients while positive coefficients also tend to be near other positive coefficients.

In figure 6, all of the regression coefficients were negative suggesting that to varying degrees, leaf length is negatively correlated with AWS. When compared to the local R-squared values in figure 7, it appears that the more strongly negative coefficients were also commonly associated with higher R-squared values.

PCA / MRPP Results:

Figure 8. Principle Component Analysis of Soil Order in Trait Space

PCA

Interpretation

The individual triangles represent each individual plant in each of the sixteen common gardens. Triangles that are closer in space are more similar in terms of phenotypic traits. The color-coding and enclosing polygons around the triangles represent soil orders. The radiating blue lines are indicators of the strength and direction of the relationship between each phenotypic trait and either of the principle components.

 

Table 1. Multiple Response Permutation Procedure (MRPP) results for the PCA above.

MRPP

Interpretation

The results of the MRPP indicate that there is a significant difference in bluebunch wheatgrass phenotypic traits when grouped according to Mollisols and Aridisols. In addition there is a moderately significant result for Aridisols versus Inceptisols. The low A statistic however indicates that there is very little within-group agreement (meaning that there is high trait variance within all of the soil orders). This suggests that although the differences we see are not likely attributable to chance, grouping by soil order is at best a weak predictor of phenotypic traits in bluebunch wheatgrass.

Significance

Although the results of my analysis are modest at best, it is important to recognize that bluebunch wheatgrass phenotypic traits may not be adapted to soils, or that the soil traits I have elected to analyze for this course are not strong drivers of selection. If no strong relationships between soils and local phenotypes emerge through this exploratory study then my work will support the current methodology in seed zone delineation which excludes soils.

Learning: Software

ArcMap

A huge hurdle that I overcame was learning how to utilize the gSSURGO soils database and extract relevant data for analysis in ArcMap as well as other programs. Now that I have spent considerable time working with this immense dataset I am much more confident in my ability to continue to analyze interesting soil-phenotype relationships in the future.

I also learned how to perform a hot spot analysis and a geographically weighted regression. Although being limited to the analysis of one common garden and phenotypic trait at a time does not provide particularly helpful results, I do find it very valuable to understand.

Learning: Statistics

ArcMap

This was my first exposure to geographically weighted regression and the process of mapping the regression coefficients and their associated R-squared values enhanced my ability to interpret the results.

PC-Ord

The process of interpreting figure 8 was very powerful for me. As I learn more and more about PCA and other multivariate approaches to statistical analysis, I am encouraged that they might be powerful tools for meeting my research objectives.

Other

Although I didn’t have time to gain personal experience with Kriging, wavelet analysis, or other tools that my classmates used, I learned a lot from their presentations about how to interpret the results and problem solve with them.

Note – This is largely an edited transcript of my final presentation, which was pre-recorded and uploaded to the internet as a video-blog. You can see it here.

Background

The Valley of Oaxaca, Mexico was home to one of the first civilizations in the New World, the Zapotec State. Founded around 500 BC with the construction of the hill-top city of Monte Albán, the Zapotec State persisted for more than 1000 years before finally dissolving around AD 850. My research focuses on pottery production and exchange at the height of Zapotec Civilization during the Late Classic (AD 550 – 850), when we suspect that large secondary centers such Dainzú-Macuilxóchitl and Jalieza were beginning to break away from the capital at Monte Albán.

Microsoft PowerPoint - Pink_MacuilTanSack2016.pptx

Figure 1: Late Classic settlement patterns in the Valley of Oaxaca, showing major sites mentioned in the text.

Our study of pottery exchange networks is intended to address how politically integrated these communities were by assessing their degree of economic interaction. To do this, we analyze elemental composition of ancient pottery recovered from excavations at these sites, and compare this to similar data for clay samples collected from across the region during the Oaxaca Clay Survey in 2007 and 2012. Figure 2 shows clay sampling locations relative to regional geology. As you can see, the geology of the area is quite complex, contributing to stark differences in the elemental composition of clay resources from different areas. The majority of our samples however were collected from the alluvial sediments from the valley floor, where sediments derived from multiple parent materials have mixed and altered through soil formation processes. Yet we know little about how interactions between these factors influence the elemental clay composition. This is important because we cannot sample every clay deposit in the region, but wish to identify possible production locales between our sampling locations.

 OaxacaGeo

Figure 2: Oaxaca Clay Survey sampling locations relative to regional geology.

Research Questions

  1. How do concentrations of individual elements in Oaxaca clays pattern spatially across the region?
  2. How are concentrations of individual elements related to environmental factors such as parent material, soil development, and the transport of sediment through alluvial systems?

Because this is an exploratory study, analyses will be limited to five major and minor elements measured for 137 clay samples collected from the eastern, Tlacolula arm of the valley. These elements include aluminum (Al), calcium (Ca), sodium (Na), Potassium (K), and iron (Fe).

Data

This study relies on three datasets:

  1. Measurements of aluminum (Al), calcium (Ca), sodium (Na), potassium (K), and iron (Fe) concentrations in 137 clay samples form the Tlacolula Valley. These data will act as response variables in regression analyses.
  2. Spectral reflectance and emissivity data for 13 bands of ASTER imagery, as measured at each clay sampling location during a cloud-free period in June of 2001 at a band-dependent spatial resolution of 15 to 90m (Figure 3). These data will act as proxy measures of surface lithology and serve as one set of predictor variables for clay chemistry.
  3. Elevation, slope, and flow accumulation estimates for each clay sampling location derived from a 90m DEM covering the Tlacolula Valley. Elevation and slope are good proxy measures for the differential processes affecting alluvial and residual soils, while flow accumulation provides a direct measure of a given location in the valley’s stream network. These data will act as a second set of predictor variables for clay chemistry.

GWR_OCS

Figure 3: Oaxaca Clay Survey sampling locations over ASTER image of the Tlacolula Valley

Hypotheses

H1 – Concentrations of individual elements will be spatially auto-correlated due to their association with different, spatially segregate parent materials.

H2 – The elemental composition of residual and alluvial clay deposits in the Tlacolula Valley, Oaxaca, Mexico, will be strongly conditioned by geologic parent material (as measured through spectral reflectance and emissivity in ASTER imagery) in both alluvial and residual soils.

H3 – The elemental composition of clay deposits will be further affected by processes of erosion and deposition (as measured through elevation, slope, and flow accumulation rasters).

H0 – The elemental composition of Tlacolula Valley Clays may be poorly predicted by both ASTER imagery and DEM-derived variables due to (1) spectral interferences from land-cover such as vegetation and pavement, and (2) elevation, slope, and flow accumulation may be inadequate proxy measures for other factors affecting sediment lithology and chemical alteration.

Approaches

Four statistical methods were used to evaluate the hypotheses listed above:

  1. Moran’s Index, to assess spatial auto-correlation for each element in the Tlacolula clays.
  2. Principal Components Analysis (PCA), to reduce the dimensionality of the ASTER data from 13 bands to 4 Principal Components (ASTPC1 – ASTPC4).
  3. Exploratory Multiple Regression (EMR), to identify an optimal combination of dependent variables for prediction of each element using global, least squares regression.
  4. Geographically Weighted Regression (GWR), to examine local deviations from global models that may result in non-stationarity in the data.

See my earlier blog-post on multiple regression for details regarding my use of Principal Components Analysis, Exploratory Multiple Regression, and Geographically Weighted Regression for analysis of these data.

Results

Moran’s I values for elemental concentrations of Al, Ca, Na, K, and Fe are highly significant for clay sampling locations in the Tlacolula Valley, supporting our first hypothesis that concentrations of each element are spatially auto-correlated, as one might expect if elemental clay chemistry is largely driven by parent material.

To clarify which factors were most related to concentrations of each element, a series of Exploratory Multiple Regressions were run using all four ASTER PCs, Elevation, Slope, and Flow Accumulation as dependent variables. Exploratory Multiple Regression is a brute-force method for identifying optimal combinations of predictor variables in complex multivariate data that works by assessing the relative fit of all possible combinations of variables.

Table 1 shows the final models selected for each of our five elements. Four things are important to note:

  1. All elements are predicted using a combination of ASTER PCs and DEM-derived variables, confirming that both parent material and soil formation processes influence Tlacolula Valley clay chemistry;
  2. Each element was best predicted using a different combination of variables;
  3. Elevation was not included in final models for any of the five elements; and
  4. While each model is statistically significant at α = 0.05, they have relatively weak measures of fit.

Table 1: Final least squares models selected for each element using Exploratory Multiple Regression.

Pink_GWR_T4

A possible reason for the poor fit of the EMR models is that they carry the assumption of spatial stationarity. And yet, we already know from our Moran’s I tests that each of response variables is spatially auto-correlated. Geographically Weighted Regression is a good post-hoc test for evaluating whether local deviations from global fit are an issue for OLS models. Table 2 shows quasi-global measures of fit for GWR models relative those of the final OLS models selected through Exploratory Multiple Regression. Each shows slight to moderate improvement over its least squares counterpart, but what is really of interest are measures of local fit.

Table 2: Comparison of quasi-global measures of fit for GWR and OLS models predicting concentrations of five elements in Tlacolula clays. Predictor variables are those selected for final OLS models using Exploratory Multiple Regression.

Pink_GWR_T5

Space prevents an in-depth discussion of each element, but a couple examples should be sufficient to illustrate my argument. Results of a GWR for Al show much stronger measures of local fit for the western side of the study area than the east (Figure 4), suggesting a strong lack of stationarity in the data. Results for Ca are remarkably similar (Figure 5), showing nearly identical patterns of strong local fit in the west, and poor local fit in the east. As discussed in my bog-post on multiple regression, my interpretation is that these result from strong variability in both predictor and response variables in the west relative to the east. A map of raw Ca concentrations for each clay sampling location (Figure 6) shows a number of extreme values in the west resulting from the association with limestone in these locations. There are also however a number of much lower, more normative values just downslope on the floodplain where sediments are derived from mixed lithologies. That is, in the western part of the study area there is high variability in both Ca and its predictor variables.

We can see something similar in the results of a GWR for Fe. Measures of local fit are highest in the west and abysmal in the central part of the study area on the Rio Salado floodplain (Figure 7). Again, I would argue that this is due to high levels of variability in Fe concentrations in the west related to localized differences in parent material, slope, and position in the stream network (Figure 8). In the Rio Salado floodplain, by contrast, there is little local variability in either slope as a predictor variable or Fe as a response.

High variability in both predictor and response variables will not in all cases result in a strong local or global fit. There must also be a relationship between the predictor variables and the response. However, an absence of local variability in the response variable due to spatial auto-correlation will inevitably result in lower measures of local fit.

GWR_Al

Figure 4 (above): Local fit of GWR model for Aluminum.

GWR_Ca

Figure 5 (above): Local fit of GWR model for Calcium.

GWR_CaPPM

Figure 6 (above): Calcium concentrations in ppm. Size of bubbles is proportional to elemental concentrations.

GWR_Fe

Figure 7 (above): Local fit of GWR model for Iron.

GWR_FePPM

Figure 6 (above): Iron concentrations in ppm. Size of bubbles is proportional to elemental concentrations.

Significance

While this is an exploratory study, and the results are preliminary, our work so far suggests that there are significant relationships between elemental clay chemistry, sediment lithology, and processes of erosion and deposition in the Tlacolula Valley. Furthermore, it shows that these relationships can be assessed using remotely sensed data derived from ASTER imagery. While the strength of these relationships varies both spatially and between individual elements, this analysis provides a view to the complex factors influencing clay chemistry in the Valley of Oaxaca.

For archaeologist working in Oaxaca, this is significant because it informs a greater understanding of the factors influencing the spatial distribution of compositionally distinct clay resources that would have been available to pottery producers. Similar methods could be employed in studies of ceramic exchange for any region with similarly complex geology.

Outside of archaeology, these methods could conceivably be used by anyone interested in how the admixture of sediments derived from multiple parent materials influences the elemental composition of alluvial soils at a regional scale. Concentrations of elements such as K, Na, and P may have an enormous effect on agricultural productivity from field to field, and this is but one example. One can imagine a range of other potential applications in Forestry, Botany, Geology, Soil Science, and Public Health. The sky is the limit.

Lessons Learned – Software

For this project, I used a combination of ENVI (for processing of the ASTER imagery), ARCGIS (for spatial data manipulation, raster calculation, Moran’s I, and Geographically Weighted Regression), and JMP (for calculation of non-spatial statistics, including Principal Components Analysis and Exploratory Multiple Regression). I was already familiar with these software, but I would be interested in any suggestions for other programs or packages that could improve these analyses.

I did get to learn how to make a Powerpoint into a Youtube video for my final presentation, which will be invaluable in my work as an ecampus instructor.

Lessons Learned – Statistics

Before this project I was unfamiliar with geographically weighted regression. After some initially promising results, I was forced to grapple with what these models were really telling me, and whether I could (in good faith) use them to generate predicted surfaces in the way that one might with a strong OLS model. I ultimately concluded that this would be inappropriate with my data and that Geographically Weighted Regression is perhaps better used as a post-hoc test for stationarity in a least squares model. That said, I recognize that this is not always how it is employed, and that other members of the class have tried to take it further. I have written about this at length in a previous blog-post, but am no more an authority on this subject than they are and would be interested in any other thoughts on the subject.

Boosted regression trees (BRT) are an ensemble tree-based species distribution model that iteratively grows small/simple trees based on the residuals from all previous trees.  The model is run on a random subset of your data, and ALL predictor variables are considered to produce the best splits at each node.  The tree-complexity determines how deep each individual tree will be grown to.  Anticipated interactions can be captured by setting the appropriate tree complexity.  The learning rate determines the overall weight of each individual tree.  This is an advanced form of regression methods which consists of two components. -The two elith08_fig1main components of BRT: regression trees and boosting.

 

  1. Regression trees:  Partitions the predictor space into rectangles, using a series of rules to identify regions with the most homogenous response to the predictor and fits a constant to each region.
  2. Boosting: An iterative procedure that reduces the deviance by accounting residuals of previous tree(s) by fitting another tree
    • Each individual tree inform subsequent trees and thus the final model
    • The boosting component makes boosted regression distinct from other tree-based methods.

 

Objective of this analysis:

Identify the biophysical parameters associated with giant gourami distribution in SE Asia, starting with a set of 47 global occurrence points.  This was an exploratory exercise to learn about the physical variables that might be important for the distribution of the giant gourami. 

My Data:

Pulled the 47 occurrence points for the giant gourami from fishbase.com and were used as the basis for the dataset.

ArcMap: generate points and convert to KML for import into Google Earth Engine

//Get coordinates for occurrence points for species of interest (Giant Gourami) from fishbase.com

//create random points to use as ‘pseudo absence’ points

//generate random points within study region and only in rivers

Google Earth Engine: ‘gather’ biophysical data for points from satellite imagery–used for ease of access to spatial layers

//load in image layers of interest (NDVI, CHIRPS, Population density, Flow, Surface Temp.)

//export to CSV for analysis in R Studio

 

R code for running the BRT model:

I ran the model using the gbm package in R, based on a tutorial by Jane Elith and John Leathwick (https://cran.r-project.org/web/packages/dismo/vignettes/brt.pdf)

 

>>source(“BRT/brt.functions.R”)

>>install.packages(‘gbm’)

>>library(gbm)

# define the dataset

>>gourami <- read.csv(“BRT/gourami/gourami_data.csv”)

# data consists of 39 presence and 305 pseudo absence (344)

# 5 predictor variables

>>gourami.tc3.lr005 <- gbm.step(data=gourami,

    gbm.x = 3:7, #columns in the dataset where the response variables are located

   gbm.y = 2, #column in the dataset where presence/absence data is located (0/1)

   family = “bernoulli”,

    tree.complexity = 3, #tree depth determines the number of layers in each tree

  learning.rate = 0.005, #weight of each tree in the overall model

    bag.fraction = 0.75) #fraction of the dataset used to build/train the model

 

The three main parameters to pay attention to at this point are tree complexity, learning rate, and bag fraction. The remaining fraction of the dataset not used in the bag fraction is then used for cross validation for model evaluation. These three parameters can be varied to determine the ‘best’ model.

Results

Initial output for model with tree complexity of 3, learning rate of 0.005, and bag fraction of 0.75.  Several warning messages were displayed with this particular model, which are not addressed in this tutorial:

mean total deviance = 0.707

mean residual deviance = 0.145

estimated cv deviance = 0.259 ; se = 0.058

training data correlation = 0.916

cv correlation =  0.838 ; se = 0.043

training data ROC score = 0.989

cv ROC score = 0.958 ; se = 0.02

elapsed time –  0.13 minutes

 

Warning messages:

1: glm.fit: algorithm did not converge

2: glm.fit: fitted probabilities numerically 0 or 1

 

Relative contributions of each variable in determining where the species is expected.  For this model, precipitation has the strongest pull:

> gourami.tc3.lr005$contributions

                   var   rel.inf

mean_Precip: 61.223530

mean_temp: 25.156042

pop_density: 10.299844

NDVI_mean:  1.685350

Flow:   1.635234

 

Interactions:

NDVI Precip flow Temp Pop
NDVI 0 29.62 0.11 0.07 0.08
Precip 0 0 17.00 317.66 84.51
flow 0 0 0 0 0.93
Temp 0 0 0 0 3.29
Pop 0 0 0 0 0

Partial dependence plots visualize the effect of a single variable on model response, holding all other variables constant.  Model results vary the most with precipitation as seen in the top left plot.  Mean temperature and population density appear to also play a role in giant gourami distribution based on these plots, but may be more apparent if you zoom in on the upper temperature threshold or the lower population density range.

gourami_tc3_lr005_plots
Model comparison, varying tree complexity and learning rate to evaluate the best setting. Top row illustrates model fit for a tree complexity of 3 with a learning rate of 0.01 (Left) and 0.005 (right).  The bottom row illustrates model fit for a tree complexity of 4 with learning rate 0.01(L) and 0.005(R) :Gourami_BRT_Rplot_3x4d3-4_lr0.1-0.005_wNOpop

It appears that a model with a tree complexity of 3 and a learning rate of 0.005 performs the best. This model indicates that precipitation has the largest effect on the distribution of giant gourami in SE Asia, based on the initial 34 occurrence points.  

Model critique: BRTs are not a spatially explicit model and thus relies only on the relationship between the variables and the sample points.  Additionally, due to the complex nature of the model (often outputting thousands of trees), the results can be difficult to interpret or explain.