Having endured much frustration trying to use Geographically Weighted Regression to determine relationships between the patch metrics I calculated using FRAGSTATS, I’ve decided to change things up a bit for my final blog post. I’ll continue working with Geographically Weighted Regression, but rather than applying it to FRAGSTATS-derived metrics (shape index, edge:area ratio etc.), I will experiment with a different LandTrendr output: forest biomass.

Dataset:

Explaining more about the dataset here will help me articulate the research question below. So, the biomass layer is a raster calculated using a Tassled Cap Transformation, which is a method of enhancing the spectral information content of Landsat imagery. It is essentially a measure of per-pixel brightness (soil), greenness (vegetation) and wetness (soil and canopy moisture). I will be using yearly time-series biomass layers and “time-stamped” clear-cut disturbance patches.

Research Question:

I’m still not sure how well I can articulate the question, but here goes: Is there a statistically significant relationship between the mean biomass value within a clear-cut patch at the timing of the clear-cut, and the mean biomass within that same patch, before and after the clear-cut?

 

Hypothesis:

Clear-cutting obviously results in a loss of biomass, and I expect that quantities of biomass within a clear-cut patch before, during and after a clear-cut will exhibit a significant relationship.

Approach:

While I had easy access to the biomass data, creating a dataset of disturbance patches with attributes for biomass before, during, and after the timing of each clear cut was a carpal-tunnel inducing task. I ought to have approached it programmatically, and I did try, but my Python skills were lacking. I ended up using a definition query to filter disturbance patches by year, and then ran three Zonal Statistic operations on the biomass layers (one for the year before a given set of clearcuts, one during, and one after). I then joined each biomass calculation back to the clear-cut patches. Below is an attribute table for one set of disturbance patches (note the three mean biomass values) and an example of a disturbance patch overlaid on a “before, during, and after” set of images. I did this for three sets of yearly clear-cuts, and then merged them into one dataset of roughly 700 features.

attrib befaftdur

I then ran Geographically Weighted Regression, with “mean biomass after clear-cut” as the dependent variable, and “mean biomass before” and “mean biomass at timing of clear-cut” as explanatory variables.

Results:

I experimented with multiple combinations of the three biomass mean variables, and also tried adjusting the kernel type. The most significant run was that on the left, which had parameters as described above.

r2         r2

gwr_r2gwr2gwr5

gwr3

 

gwr4

Significance:

While it was satisfying to finally produce some significant statistics, I recognize that the analysis is not groundbreaking. While change in biomass is certainly of interest to forest managers and ecologists, the way in which it was calculated here (as a mean of an annual snapshot within a patch) may not have significant implications.

What I learned:

If this course is nicknamed “Arc-aholics Anonymous” then you could say I had somewhat of a relapse, as most of my analyses throughout the quarter made use of tools I had used in ArcMap before. That said, I gained a much more thorough understanding of their functionality, and feel I have a better command over interpreting their sometimes perplexing results. I now have a much better idea of the types of datasets and variables that lend themselves to certain methods, and the experience of working with a large dataset of Landsat time-series-derived forest disturbance will be invaluable to my research moving forward. I learned a great deal from others in the course and am glad to have made some new contacts (especially you coding gurus). Some of the work produced in this course was truly outstanding and I feel inspired to hone my own skills further, particularly with open-source software.

 

Background

The Willamette Valley of the Terminal Pleistocene (10,000+ years ago) was a wildly different place than the valley we are used to seeing today. There were massive animals roaming the valley floor, which was a lot wetter, with more marshes, bogs, and lakes. Oak savannah and forested swaths of land made their way towards the Willamette River… This wonderful landscape was likely also inhabited by humans, though that is a very difficult question to explore.

In order to find out where in the Pleistocene Willamette Valley people might have lived, we must first understand the sediments that lay under our feet here in the valley, and the best way to do that is by extracting it from the ground and analyzing it.

This project is a part of a bigger picture study that seeks to use sediment cores extracted from a buried peat bog found at Woodburn High School in Woodburn, Oregon and identify the sediments buried within that are of the appropriate age and environment to find both potential Pleistocene aged archaeological sites as well as more evidence of Pleistocene megafauna.

 

Research Question

For the purposes of this project, I am seeking to find a different method for identifying stratigraphic breaks in a sediment profile. The most popular methods for identifying stratigraphy is through visual examination, multivariate statistical analysis, or by texturing the sediment.

Using x-ray fluorescence geochemical data, and Wavelet Analysis, a method typically used for studying time-series data, is it possible to determine the site stratigraphy using one or two different variables?

 

The Data

The dataset consists of XRF data taken at 2mm intervals, from 65 1.5-meter core samples. These cores come from 14 different boreholes covering the majority of the defined study area which is approximately a 200×50-meter area. The cores were extracted using a Geoprobe direct-push coring rig.

 

The Site:

1

The Geoprobe in action at the site:

2

The core samples were halved and run through an iTrax core scanning unit. The iTrax scans the cores using an optical camera, a radiograph (similar to medical x-ray), and an x-ray fluorescence scanner, which collects geochemical data consisting of 35 different element counts at 2mm intervals. The data is organized into 14 CSV files containing the XRF results.

 

The iTrax:

3

Hypothesis

Using wavelet analysis, significant increases and decreases of geochemical properties can indicate where stratigraphic breaks in the sediment occur. This pattern should be repeatable across all of the gathered cores.

 

Approach

The method I chose to analyze my core data and attempt to break apart the stratigraphy was Wavelet Analysis using an R package called “WaveletComp”.

‘WaveletComp” takes any form of continuous data, typically time-series data, spatial data in this case, and uses a small waveform called a wavelet to run variance calculations along the dataset. The resulting output is a power diagram, which shows (in red) the locations along the dataset where there is a great change in variance. A cross-wavelet power diagram can also be generated. This can indicate when two different variables are experiencing rises and/or drops at the same time.

 

Example of a wavelet.

2

There are two equations used when generating a wavelet power diagram…

3

The above equation uses the dataset to calculate the appropriate size of the wavelet according to the number of points in the dataset.

4

The above equation uses the wavelet to run variance calculations across the dataset and output the power diagram.

 

Using the ‘WaveletComp” package in R, I processed 5 different core scans. In order to properly conduct the analysis, elements had to be selected in order to do both univariate and bivariate analysis. There were a variety of ways that I could have selected the data, but ultimately, in the test sample, I chose to look at the elements that had the most obvious changes, aluminum and iron.

The details on how to actually run the “WaveletComp” package can be found in my wavelet analysis tutorial.

 

Results

 

After all of the tests were run, the resulting wavelet power diagrams were placed alongside line graphs of the element that was run, as well as a cross-wavelet power diagram, which indicates when the two selected elements change at the same time.

The wavelet power diagrams show significant changes in the waveform, as indicated by the red (high variance). The blue in the power diagram shows low variance. In the cross-wavelet power diagram (the one on the far right), the arrows indicate if the waveforms of the two elements are in-phase (pointing left) or out of phase (pointing right) with each other.

In order to identify stratigraphic breaks, I looked at the “taller” red plumes in the wavelet power diagrams, which indicated a significant change in the amount of an element present, either a great increase or a great decrease. This result was compared to the graph, as well as the image of the core (for visual identification of changes in color or texture). The spots that have the high plumes, and correspond with a significant color or texture change in the image are presumably the stratigraphic breaks.

Each of the five cores showed promise that we can identify stratigraphy using wavelet analysis. The two most significant cores are discussed below:

4

The results for the first core segment shows various distinct changes as indicated by the tall red plumes in the power diagrams, but there is one major problem with the results. The graphed data shows areas where there is zero data at points, as indicated by the drops to the very bottom of the graph. These spots also have the most distinct plumes (for the most part), but about 2/3 of the way down there is a distinct plume that, while contains zero values, also is a spot with a distinct texture change in the sediment. At this point the sediment changes texture and feel entirely.

5

The results of the second core sample were more interesting, as there were no zero values. The Al+Fe cross-wavelet diagram shows significant plumes where almost all of the significant looking color changes in the stratigraphy are in the profile.

6

7

8

The second core was the most interesting of the analysis, due to the lack of zero/null data. With a little bit of data management, the zero data can be reduced by interpolating some of the values and re-running the data with the interpolated data, That should help reduce the effects of the zero data as demonstrated in the above results.

 

Significance

Wavelet analysis is an excellent tool for observing patterns in spatio-temporal data that is sequential. As for the significance of this project…

Once all of the observed kinks are fixed in the data, this could serve as a new method for identifying changes in stratigraphy. This could be a good method to identify stratigraphy in sediment cores that are extremely similar in color or texture.

 

What I learned about the software

R is an excellent tool for conducting many different kinds of spatio-temporal analysis. From running wavelet analysis to regression, and really any other tool that ArcGIS has to offer.

ArcGIS is an excellent tool for data visualization, but it is a very finicky program that has to

 

What I learned about statistics.

Statistical applications in geospatial analysis are very important to understanding even the smallest of changes, such as an increase or decrease in iron in a sediment core.

 

Background

The relationship between microbiome composition and host health has recently generated a great deal of attention and research. The importance of host-associated microbiomes is still poorly understood, although significant relationships between gut micobiome composition and host health have been described. Although the gut microbiome has received the most attention, each body site is home to its own distinct microbial community. The nasal microbiome has received relatively little attention, although a few studies suggest that there is a relationship between nasal microbiome composition and incidence of infections. Using a unique system of closely studied semi-wild African Buffalo, I propose to study the drivers of nasal microbiome composition in a social species.

Data Collection

Our study herd of semi-wild African Buffalo (Syncerus caffer) is kept in a 900 hectare predator-free enclosure located in Kruger National Park, South Africa. Every 2-3 months, all 60-70 individuals are captured, and biological samples are collected for diagnosis as part of a larger study that the Jolles lab is conducting on Foot-and-Mouth Disease. Age, sex, and body condition are recorded, in addition to a number of other physiological parameters. Degree of relatedness is known for each pair of individuals in the herd. Each animal is fitted with a GPS collar that is programmed to record location every 30 minutes, and with a contact collar that records identity and duration of contacts with other buffalo. The GPS data used in this exploratory analysis was collected between October-December 2015.

Overarching Hypotheses

My research is guided by the following two hypotheses:

  1. Conspecific contacts drive nasal microbiome similarity and disease transmission.
  2. Habitat overlap drives nasal microbiome similarit and disease transmission.

I also propose to examine the relationship between the other parameters (age, sex, body condition, relatedness, etc) on nasal microbiome composition, but for this class I focused on the spatial parameters.

Approaches

The first step toward addressing hypothesis 1 is to quantify conspecific contacts. I tested several approaches during the course of this class. The first approach, presented during my first presentation was to generate buffers in ArcGIS around each animal at a given time point and generate a measure of “crowdedness.” The second method was to generate a contact matrix in R (see figure 1) to show distances between each individual at a given time point. Detailed methods and R code are given in my first blog post.

dist_matrix

Figure 1: Sample distance matrix, generated for a subset of 5 animals at a single time point. Distances range between 3-18 meters.

As an intermediate step towards addressing my second hypothesis, it will be necessary to measure the percent habitat overlap between every pair in the herd. I used a suite of tools to carry out a test analysis using GME, ArcMap, and excel. Detailed methods are outlined in my second blog post. In summary, I used GME to generate kernel densities and 95% isopleths for two individuals, then used the identity feature in Arc to calculate area of overlap. Figure 2 shows a sample of the output from the identity tool.

overlap

Figure 2: Habitat overlap between two individuals in the study herd. Individuals’ home ranges are colored yellow and purple, respectively. Area of overlap between the two animals is outlined in red. Overlap between the two animals is 80% and 90%, respectively.

Eventually, once I have determined contact network and habitat overlap matrices, I will look at correlation between microbiome similarity and habitat overlap and average distance with conspecifics.

Significance

  1. Distance matrix: this analysis showed potential usefulness for researching herd behavior and structure, and I will likely return to it in future analysis. However, since my question of interest relates to microbe transmission, which is likely to be associated only with close contacts, I plan to focus my current efforts on utilizing the contact collars I described in the data section. Although I will lose some spatial information, it will simplify my analysis and increase temporal resolution.
  2. Habitat overlap: The method described here for measuring habitat overlap shows great promise for use in my research, especially if the process can be automated. I will explore iteration functions in GME and ArcGIS ModelBuilder to find the best way to expedite this analysis across multiple pairs of individuals at multiple time points.

Potential Issues

A few problems that I will likely need to deal with as my analysis progresses:

  1. The possibility of high correlation between conspecific contacts and habitat overlap. I will try to control for this by looking for animals that have high spatial overlap but low contact rate, and vice versa.
  2. Non-matching pairwise percent overlap. For example, individual 13 showed 90% overlap with individual 1, whereas 1 showed only 80% overlap with 13. I can deal with this by looking at pairwise averages, unless the percent overlap is too different, in which case I will need to explore other options.

Lessons Learned

Software Packages:

Thanks to help from my classmates, I became much more comfortable and familiar with geospatial functions in R. I also used GME for the first time and discovered that it has great potential usefulness for my future analyses. I became familiar with ModelBuiler in ArcGIS while attempting to iterate the buffering analysis. I still have a lot to learn with all these tools, but I feel much more confident than I did prior to this class.

Statistical methods:

Although I did not utilize any of the statistical methods outlined in the syllabus due to the unique nature of my dataset, I learned a great deal from watching my classmates present. I expect that hotspot analysis, multivariate statistics, and different types of regression models will be part of my future. In particular, I plan to use regression models and PCA to help analyze my data in the future.

 

 

 

 

Mapping Weeds in Dryland Wheat: A spectral trajectory approach

Introductions: Into the weeds

When I started this course, I had a central goal in mind: I wanted to use remotely sensed data to identify weeds. What I didn’t have, was a clue as to how I was going to get there. I had a few data sets to work with, and although I had made some limited progress with them, I had not yet been able to ask the kind of spatial questions I knew were possible with them.I had data that I knew As a part of this course, I formed this central goal into two research questions:

  • How does spectral trajectory relate to weed density? Can variation in spectral trajectory be used to distinguish weed species from crop species?
  • Can this information be used to distinguish the spatial extent of weeds in dryland cropping systems? How well do different methods of assessing weed density perform?

Originally, I had planned on using some low altitude aerial imagery I had taken in 2015 for addressing this question. After many many late nights trying to get these data into a form that was suitable for analysis, I decided for the sake of brevity and actually getting something done in this class, I would use an alternative data set.

My alternative data set consisted of 3 kinds of reference data with regards to the location and species of weeds in a 17 acre field NE of Pendleton OR, as well as Landsat 8 images taken over the course of summer 2015. The reference data sets I had consisted of transect data where species level density counts were made, spectral measurements of green foreign material made during harvest, and observations made from the front of a combine at the time of harvest. In the landsat data set I used in this class, I only included cloud free images from the duration of the growing season, 2015.

Boettcher_Geo584

 

Motivations and Relevance:

My goal with the work I had hoped to achieve in this class was to use variation in phenology between weed and crop species to distinguish weed density using time series landsat data. One of the major issues with weeds in cropping systems however, is that they typically represent only a very small fraction of the total cover in a pixel. The revised hypotheses I ended up testing in this class were:

  1. Weeds and crop can be distinguished based on their relative rates of change in greenness (NDVI).
  2. Visual estimates of weed density will be more accurate at estimating weed density than automated inline assessments.

The main reasons why I find these questions interesting, is the fact that species level discrimination in satellite based remote sensing is still such a challenge. In most remote sensing efforts to identify weeds using satellite data generally perform poorly. A major reason for this, is the typically low signal weeds will have in a cropping system. Advancement in the methodologies for identifying species in satellite imagery would be a significant contribution. As well, there have been few attempts at using a time series approach to distinguishing weed to the species level. Most work has concentrated on increasing the spatial resolution or the spectral resolution of imaging efforts for making species level classifications. As well, the data I was working represented as complete a sampling of the field as is reasonably possible. All plant material from the field had been cut and measured by the spectrophotometer. In this way, the spatial resolution of my reference data was very fine.

Methods:

In this class, I ended up attempting to answer these hypotheses using two general approaches that were made available to us in the class. I used hotspot analysis to identify areas of the field that were statistically significant positive values as “weedy”.  I then used the output of the hotspot analysis as a predictor variable in a geographically weighted regression, with NDVI as the explanatory variable. With previous attempts at classification, I typically ended up with poor results when comparing the spectrally generated data set to my reference data set. Hotspot allowed me to overcome this, in that the classification was not as sensitive to extremely high or low values. The improved performance is likely a result of the fact that hotspot analysis takes into consideration the neighborhood of values a measurement resides in. I then used the output of the hotspot analysis as the predictor variable in a geographically weighted regression. With geographically weighted regression, we remove the assumption of stationarity, and fit regression models based on a local neighborhood rather than to the entire dataset. Finally, I classified the output of the regression analysis based on the coefficients for each of the terms of the local linear models. My goal here was to attempt to get at why the relationships between weed density and NDVI appeared to vary as they do.

Results:

The specific details of each of these analyses are available in their individual posts, but overall the results were very promising. I was able to improve the classification accuracy of a spectrophotometric assessment of weed density from ~50% to 85%, as good as a human observer. This results indicates that the in-situ assessment of weed density may be a real possibility in the near future, which has implications for farmers as well as scientists and land managers. Generation of quality reference data is very difficult to do over broad spatial areas, and any way to improve that process is a good thing.

Regarding the output of the geographically weighted regression, I find I’m still grappling with the results. While I did end up with distinct areas of the field having differing responses of weed density to NDVI, i’m finding difficulty in interpreting the output. I think that if additional explanatory variables for why weeds respond the way they do, and if this is predicted by geographically weighted regression, wouldn’t one simply include those variables into their original model for predicting weed density? So while geographically weighted regression may be suitable for a data exploration exercise, I’m not sufficiently convinced of its utility in predicting weed density. More work is necessary to come to a conclusive answer as to why local relationships between NDVI and weed density appear to vary as they do.

 

Learning:

I learned an incredible amount this quarter, mostly from the incredibly high quality presentations and conversations I had with my colleagues in this class. An incredibly high bar was set by the work done by this cohort. Although I didn’t get as far as I might have liked to with these data, this was my first foray into spatial statistics. I had the advantage of being very comfortable in R, but this class forced me to branch out into ArcMap, and even by the end of the quarter, into Python and Earth Engine. One of the major issues I had, was that the data set I was working with was not as high a resolution as I had originally intended. I wanted to be using methods from machine learning on an a very high resolution multispectral data set i’ve been working on, but it wasn’t ready for analysis by the time I needed to produce results.  I ended up doing a lot of work that never saw the light of day, looking at issues in spatial variance and resolution, that with more time, I would have liked to present on.

 

The research question and Background

Soil microbes are critically important for soil to function, because they provide the functional backbone for many soil ecosystem services such as the carbon and nitrogen cycles. The structure and composition of microbial communities can have a large influence on the rates at which these ecosystem services can be performed; therefore, determining the spatial distribution of these microbial communities is necessary to establish accurate functionality of soils across a landscape. However, this can be very difficult due to the fact that microbial communities are extremely diverse. In one gram of soil there can be over a billion microbes with anywhere from 4,000 to 50,000 different species. There are two different potential forces governing the spatial distribution of microbial communities. The first theory, first popularized by Baas and Becking (1934), stated that “Everything is everywhere but, the environment selects”. This idea is in line with the classical ecology niche theory, in which species distribution is dictated by their niche. The opposite of this paradigm is the idea that microbes are more limited by dispersal potential then environmental selection factors. This idea reflects the ideas of neutral theory, which states that species distributions are completely stochastic, and at any given location, the species composition is determined by the species composition of its neighboring areas.
Until recently, these theories have not really been tested in microbial communities. However, with the advent of high throughput sequencing in the 1990s, an in situ view of microbial communities became possible. In a recent study, O’Brien et al. (2016) determined that at a fine scale (1cm2) community composition was extremely heterogeneous and correlated weakly with environmental factors; however, as the scale and scope increased, patterns in microbial community distribution began to emerge and were closely linked to environmental parameters. This study illustrates that when examining microbial communities, the spatial scale of the study will influence your results.
In my study, we want to know which principles govern microbial distributions across a regional landscape. Are microbial communities across a regional landscape influenced by environmental parameters, or is their distribution dictated more by dispersal limitation? There has been some work on a similar sized scale that has determined that bacterial diversity is more closely linked to environmental parameters then dispersal limitation. Specifically studies from all over the world have found the most influential environmental factor to be soil pH (Fierer and Jackson 2006, Griffiths et al. 2011).
The Dataset
During the spring of 2015, 113 soil samples were collected across the state of Oregon to capture the spatial pattern of microbial community structure across a regional landscape. Sampling locations were stratified by Common Resource Areas (CRA), which are geographically associated land resource regions based on geology, climate, water, soils, biological resources, and land use. This stratification was used to help capture all of the spatial heterogeneity of the soil across a regional landscape. A subset of this data (90 out of 113) was previously sampled by the NRCS in soil series survey, where an extensive lab work up was conducted to catalog each of these soils’ physical and chemical properties.
Hypothesis
The physical and chemical properties in soil will have a greater influence on microbial communities than spatial variability. In essence, on a regional landscape scale, the factors dictating microbial community composition will be linked more closely to environmental factors in the soil rather than how geographically distant two samples are.
If this is true then we need to be able to quantify the heterogeneity of a landscape of these environmental parameters. In this blog post I will use the 90 sample points from the NRCS to see how well interpolation methods preform when examining the spatial heterogeneity of edaphic properties, using soil pH as a proxy. To examine the accuracy of these interpolation methods we use the SSURGO database of soil pH as our truth. This will allow me to compare the interpolation techniques and determine their strength and weaknesses.
Approaches
The first method used was spatially extrapolating pH point data over their relative CRA polygon. In theory since these CRA polygons were delineated by similar environmental factors, the edaphic properties such as soil pH should be conserved within each polygon.
Inverse Distance Weighted (IDW) Interpolation method was also attempted. This method essentially determines the value of any unknown point by taking weighted averages of all known points, where the weights are determined by their distances to the unknown point.
The last interpolation method used was ordinary kriging. Just as IDW only considers distances between points for interpolation, so does kriging; however, unlike IDW kriging incorporates the fact that after a certain distance there is no more useful information to explain the variation at a given point, determined through the variogram. Therefore, kriging uses a variogram model to determine the maximum distance from any given point that information should come from. The figure below shows the relationship between the variance explained and distance from the given point. The fitted line is the model used in the ordinary kriging interpolation and how the weights for the average are determined.

Variogram

After interpolation, these maps were then compared to the SSURGO map to test their performance. This comparison was done by finding the absolute difference at each pixel and summing the variance observed. Since the SSURGO map was not completely finished for all of Oregon, each of the interpolated maps were clipped to same spatial extent as the SSURGO map.
Results

The map shown below is the soil pH values extrapolated over its respective CRA. You can see the Willamette Valley is quite homogeneous in its soil pH values while Oregon’s Eastern side is quite a bit more heterogeneous. The problem with this map is it draws sharp lines between CRA polygons where in theory the real pH values would have a continuum as you moved from one CRA to another.

CRA

The IDW map seemed to show the same pattern in the Willamette valley; however, the heterogeneity of eastern side of Oregon seems to be blended together. The IDW interpolation also seemed to develop pockets of like pH values which look more like artifacts of the IDW algorithm than real characteristics of the landscape.

IDW_plot
Like IDW, ordinary kriging also blended the heterogeneity of the eastern side of Oregon together; however, it did so without creating many of these artificial pockets that the IDW method seemed to create.

kriging
The main quantitative difference between the SSURGO map and the interpolated maps can be seen in the western part of Oregon. The SSURGO map shows a much higher level of heterogeneity compared to the interpolated maps as seen in the Willamette Valley. It also shows a large amount of heterogeneity in the southeastern side of Oregon that was only captured in the CRA map.

SSURGO_map

As expected, the IDW map performed the least favorably (a variance score of 2676) while both the kriging and CRA map had very similar results (variance scores of 2440 and 2414, respectively). This is quite surprising since the kriging map seemed to have a homogenizing effect on soil pH in eastern Oregon which was present in both the CRA map and the SSURGO map. Below is a map of the difference in pH between the CRA map and the SSURGO values of pH. You can see the key areas in which the CRA failed were areas of high heterogeneity of the SSURGO map.

DIfference map

Significance
Through this analysis we have determined that across this landscape there is a large amount of heterogeneity in edaphic properties. There may be significantly correlated parameters that have not been captured in the SSURGO database. If so, it may be inappropriate to interpolate these values using just our sampling points.
Finally, if it is determined that geographic distance is the best proxy for determining microbial community distribution at this scale, an interpolated map of the community needs to be presented in a way in which the audience understands where the map has strong areas of success and areas in which we are quite sure the map is wrong.
What I Learned: Software
The majority of the project was conducted in R’s statistical environment, which is due to several reasons. First, I was already familiar with the programming language. Moreover, R is quite powerful in its data manipulation capability and its ability to execute complex statistics in its base functions. Because R is also completely free and availability to anyone, it allows me to conduct my analyses without relying on a paid programs. The main drawback to this particular statistical software is it has a rather steep learning curve which made AcrGIS more preferable in some aspects. I found ArcGIS rather useful in its ability to quickly display results and data with minimal effort. It is able to quickly display a large amount of information very quickly in a user-friendly way. However, when attempting statistical interpolation methods in ArcGIS, I felt rather limited in its customizability and was completely ignorant of the underlying processes it was doing during interpolation.
R on the other hand requires more work upfront to create and display raster and vector files; however, once the data is in, manipulating it and doing analytical analysis on it was quite straightforward. When conducting spatial analysis, R is transparent; each function comes with its own documentation about what exactly is happening during the execution of the function. It also requires the users to have a working understanding of the basic methodology when doing spatial analysis, which is always a good thing.
All in all, both programs have their niches and best uses; however, one should not feel limited to just one. Exploiting the strengths of each program and bypassing their relative weakness by shifting to different platforms should be everyone’s goal when conducting any sort of analysis including spatial analysis.

What I Learned: Spatial Statistics
Before joining this class I had a very limited understanding of spatial statistics. Through the class exercises and a lot of reading I developed a comfortable understanding of the basic idea behind the process used in several different spatial statistic methods. Some of these models include IDW, kriging, geographically weighted regression, PCA ordination, and boosted regression tree. However, I feel the most important lessons I learned how to critically interpret the results of these analysis such as geographically weighted regression in ways that make intuitive sense.
Work Cited
Fierer, Noah, and Robert B. Jackson. “The Diversity and Biogeography of Soil Bacterial Communities.” Proceedings of the National Academy of Sciences of the United States of America 103, no. 3 (January 17, 2006): 626–31. doi:10.1073/pnas.0507535103.

Griffiths, Robert I., Bruce C. Thomson, PHillip James, Thomas Bell, Mark Bailey, and Andrew S. Whiteley. “The Bacterial Biogeography of British Soils.” Environmental Microbiology 13, no. 6 (June 1, 2011): 1642–54. doi:10.1111/j.1462-2920.2011.02480.x.

O’Brien, Sarah L., Sean M. Gibbons, Sarah M. Owens, Jarrad Hampton-Marcell, Eric R. Johnston, Julie D. Jastrow, Jack A. Gilbert, Folker Meyer, and Dionysios A. Antonopoulos. “Spatial Scale Drives Patterns in Soil Bacterial Diversity.” Environmental Microbiology, March 1, 2016, n/a-n/a. doi:10.1111/1462-2920.13231.

Exploring Giant Gourami Distribution Models

The research question:

In the context of predicting the impacts of the rapidly expanding aquaculture industry and understanding the impacts of natural and human changes on the landscape, one of the overarching research question that I am interested in is: What is the natural distribution of the the air-breathing giant gourami (Osphronemus goramy) in South East Asia and how is it predicted to change over time with respect to 1) the biophysical properties of the landscape, 2) human impacts, and 3) climate projections into the future.

Specific to this class, the research question that I explored was, what is the distribution of the giant gourami in SE Asia in the year 2000 based on 4 environmental variable (NDVI, precipitation, surface temperature, and river flow accumulation) and human population density.

Background: Giant gourami inhabits regions characterized by fresh to brackish water and in slow-moving areas like swamps, lakes, and large rivers.  Given its unique ability to breather air, this fish can survive in poorly oxygenated water to anoxic areas.  Already a popular fish in the aquarium trade and eaten in some regions, this fish is farmed in SE Asia.  I expect that with climate change, increased urbanization, and the changing hydrologic profile of the system due to potential dams that this fish may become more suitable than others for its ability to live in ‘poorer’ environmental conditions.  

The Dataset:

gourami_distPoint_dataMy dataset consists of points of fish presence/pseudo-absence across SE Asia (image above) characterized by associated environmental variables of interest.  The ‘presence’ portion of my dataset was pulled from fishbase.com, consisting of 47 occurrence points for giant gourami collected between primarily 1980s-1990s through an unknown sample protocol.  Clipping to my study region, SE Asia left me with 36 presence points.  Background data, or pseudo-absence points were generated randomly along rivers and streams of SE Asia in ArcMap.  Environmental data for all points were taken from freely available satellite datasets listed below on Google Earth Engine in a 1km buffer around the point data and filtered to the date range of Feb-Jun 2000 when possible (for retaining consistency in the dataset).

  • NDVI: 32-day Landsat 7
  • Precip: CHIRPS (Climate Hazards Group InfraRed Precipitation with Station data) at 0.05deg resolution.
  • Flow: HydroSHEDS flow accumulation 15-arc seconds (Feb 11-22, 2000)
  • Temp: NCEP/NCAR reanalysis 2.5deg resolution
  • Population: WorldPop project- Estimated mean population density 2010 & 2016 at a 100x100m resolutiongourami_BRT_rastersgouramiVar_histos_ALL

Hypotheses:

Based on my point data and the variables used, my hypotheses are below (with the third un-tested):

  1. Giant gourami are distributed in a wide range of habitat types, including relative warmer surface temperatures, a range of flows, and in areas with range of precipitation.
  2. This distribution of the Giant gourami is not affected by human population density.
  3. Giant gourami distributions will not change much given predicted climate warming (untested)

Approaches:

This was an exploratory analysis that employed the species distribution model, boosted regression trees (BRT).  This is an ensemble tree-based species distribution model that iteratively grows small/simple regression trees based on the residuals from all previous trees to explore non-parametric data. BRTs consist of two components: regression trees and boosting and ultimately help to identify the variables in the dataset that best predict where the species is expected based on the presence/absence data.  (you can view my the boosted regression tutorial for more on this analysis approach).

Results:

After building my dataset in Arc and through Google Earth Engine, I was able to produced BRT results in R studio for several combinations of learning rates and tree complexities with and without the population data as a variable.  Preliminary analysis indicates that precipitation and temperature contribute the most in determining where giant gourami are expected, based on the data.  Digging into my two hypotheses, I explored the contribution of temperature on giant gourami distribution and population density.

Precipitation was identified by the BRT models as the strongest contributor in determining in the likelihood of finding a giant gourami in my study area.  

  • R output:
  • > gourami.tc3.lr005$contributions
  • var:   rel.inf
  • mean_Precip: 60.511405
  • mean_temp: 25.182547
  • pop_density: 10.984079
  • NDVI_mean: 1.673674
  • flow: 1.64829

Exploring the spatial relationship between precipitation trends and my presence/pseud-oabsence point data, it is clear that the presence points are clustered in regions with lower rainfall.  As a means to explore this relationship further, it might be helpful to shrink the extent of my study area to represent a smaller area, more representative of the conditions associated with the presence points.
precip_SEasia_gourami

Population density did not appear to have an effect on model performance as expected.  As shown in the model output below, which displays model output for models with population density as a variable in the left column and without on the right.  Model deviance in cross-validation does not change when population density is removed as an explanatory variable.

gourami_TC3-4_lr005_PoP-noPoPExploring this relationship visually on a map, this result makes sense as the point data are distributed across a wide range of population densities.

pop_seasia_points

Significance:

The results of this exploratory analysis revealed some interesting patterns in giant gourami distribution, but is limited in a big way by the point data used.  Points were only present in Malaysia due to availability, so this distribution pattern is possibly more a function of sample effort than actual distribution.  If the analysis were limited to Malaysia only, it may provide a better representation of the data.  

Understanding the spatio-temporal patterns that govern the range of species like the giant gourami allow resource managers to help meet increasing demands and at the same time mitigate environmental harm.  Protein consumption is expected to increase to 45kg per capita by 2020, a 25% increase from 1997—the fish consumption rate is no outlier.  The growing aquaculture industry provides roughly half of the global fish supply (FAO, 2014).  The giant gourami are 1 species of ~400 air breathing fish present–ideal candidates for aquaculture.  This prospect presents an opportunity for increased protein production in a changing climate, but also increases threat of invasion/outcompetition.

 

Lessons Learned:

In this analysis I learned about integrating Arc, Google Earth Engine, and R for building my dataset as well as running a boosted regression tree model.  Through the process, I tracked my progress, workflow, and challenges.  This allowed me to identify areas where I could have been more efficient.  For example, I generated my background/psuedo-absence points in Arc and then brought them into Earth Engine, when I could have done the whole process in Earth Engine.

A note on Earth Engine: the reason that I chose to use this platform was for its ability to access large global datasets across wide ranging time frames quickly and without the need to download to manipulate or extract data.  Earth Engine functions through javascript, which is a hurdle to overcome for someone new to programming.

My data analysis was done in R studio, through which I learned to run a BRT model with the ‘gbm’ package, I learned some simple histogram plotting functions, and how to import/view raster images with the packages ‘raster’ and ‘rgdal’.

In terms of the statistics that I explored for this analysis, I have listed some of my take home points below:

  • Boosted Regression Tree strengths and limitations
    • Deals with non-parametric data well
    • Is able to deal with small sample sizes
    • Pseudo-absence points present issues in interpreting results since they are not true absence points
    • Sample bias is potentially a problem
    • Is not a spatially explicit analysis so issues like spatial autocorrelation are not dealt with
  • Hot Spot Analysis (conducted on a different dataset)
    • A quick and easy way to identify patterns at different scales for data points that are independent of the researcher and span a large temporal and spatial range.  
    • could be an approach to hypothesis generation.  

 

Research Question:  What is the correlation between the location of leatherback sea turtles, sea surface temperature, and chlorophyll?

Data:

Leatherback sea turtle locations: The leatherback turtle dataset was obtained from http://seamap.env.duke.edu/dataset.  The dataset was in point feature format.  The extent of the dataset is within the Gulf of Mexico. Observations of each sea turtle was collected from January through December 2005.

SeaTurtlePointData

Figure 1: Sea turtle locations in the Gulf of Mexico, derived from http://seamap.env.duke.edu/

Sea Surface Temperature: The Sea Surface Temperature (SST) dataset was obtained from NOAA.  The dataset was downloaded from the NOAA website in .csv format with latitude and longitude coordinates and temperature data associated with each coordinate.  The extent of the dataset was the Gulf of Mexico.  The SST data was the average daily maximum for January and December 2005.  The data ranged from 2.44 – 26.89 Celsius, the mean was 23.57 Celsius.

SST_image_GOM

Figure 2: Sea Surface Temperature (Celsius) data for January 2005 within the Gulf of Mexico, derived from NOAA.

Chlorophyll-a: The chlorophyll data was obtained from NOAA.  The dataset was downloaded from the NOAA website in .csv format with latitude and longitude coordinates and chlorophyll data associated with each coordinate.  The extent of the dataset was the Gulf of Mexico.  The Chlorophyll data was a 3-day composite in December 2005.  The data ranged from 0 to 1.63 mg/m3, the mean was 0.29 mg/m3.

Chl_image_GOM

Figure 3: Chlorophyll-a(mg/m^3) data for January 2005 within the Gulf of Mexico, derived from NOAA.

Hypothesis: I expected leatherback sea turtles to be clustered in areas based on high jellyfish concentrations.  Jellyfish tend to be located in areas that have higher chlorophyll-a concentrations and where sea surface temperature is low.  Thus sea turtle locations should correlate well with higher values of chlorophyll-a and lower values of SST.

Analysis Approaches:  To test my hypothesis, I utilized the hotspot analysis, Spatial autocorrelation (Moran’s I), Geographically Weighted Regression, Kernel Density tools in ArcGIS.  In addition, I also utilized R statistical package to create a graph that correlated chlorophyll with SST.

a.  Hotspot Analysis: The spatial distribution of the sea turtle locations appeared to be clustered toward the middle of the Gulf of Mexico.  The hotspot analysis tool helps to identify where statistically significant hotspots or clusters of sea turtles are located within the Gulf of Mexico.

b.  Spatial Autocorrelation (Moran’s I): This tool measures spatial autocorrelation using feature locations and feature values simultaneously. The Moran’s I index will be a value between -1 and 1. Positive spatial autocorrelation will show values that are clustered. Negative autocorrelation is dispersed. Random is close to zero. The tool generates a Z-score and p-value which helps evaluate the significance of the Moran’s index. I tested the spatial autocorrelation of chlorophyll and sea surface temperature at each feature location.  The conceptualization of spatial relationships method used was the inverse distance and the Euclidean distance measure was used for the distance method.  I selected a 500km distance (smaller distances were too small for the study site).

c.  Ordinary Least Squares: This tool performs a global linear regression to “generate predictions or model a dependent variable in terms of its relationships to a set of explanatory variables.  Before conducting this test, I sampled the SST and the CHL-a values at each of the feature locations (sea turtle locations) using the Extract Multi Values to Points tool.  This tool “Extracts cell values at locations specified in a point feature class from one or more rasters, and records the values to the attribute table of the point feature class.”   This model was run three separate times, increasingly adding more explanatory variables each time.  Each OLS run used Chlorophyll as the dependent variable.  The first OLS run, SST as the explanatory variable.  The second run, used SST and depth (m) as the explanatory variables and the third run, used SST, depth, and turtle count as the explanatory variables.

d.  Geographically Weighted Regression: Based on the observations and results found in the OLS analysis (the data being nonstationary).  I decided to conduct a Geographically Weighted Regression analysis.  This tool performs a local form of linear regression used to model spatially varying relationships.  The dependent variable used for this tool was the Chlorophyll and the explanatory variable was SST.

Results:

Hotspot Analysis:  The results of the hotspot analysis (shown below) suggest that the turtle locations are significantly clustered off the coast of Louisiana and Texas between approximately 2000 to 3,000m depth of water.  However, the results of this analysis appear to be quite deceptive.  Upon taking measurements of turtles in the hotspot cluster it appears as though they may be more dispersed.  Further analysis is needed in order to determine further patterns of Sea turtle locations.

hotspotAnalysis_5June

Figure 4: Results of the hotspot analysis for leatherback sea turtle locations

Spatial Autocorrelation (Moran’s I) – sea surface temperature:  The results of the spatial Autocorrelation tool suggest that the pattern of Sea Surface temperature at each feature location is clustered.  The Moran’s Index was 0.402514, the z-score was 2.608211, and the p-value was 0.009102.   Since the critical value (z-score) was greater than 2.58 there is less than 1-percent likelihood that the clustered pattern is a result of random chance.

SeasurfaceTemp_Morans_resullts1Of2 SeasurfaceTemp_Morans_resullts2Of2

Figure 5: Sea surface temperature results for Moran’s I tool.

Spatial Autocorrelation (Moran’s I) – Chlorophyll-a :  The results of the spatial Autocorrelation tool suggest that the pattern of chlorophyll at each feature location is clustered.  The Moran’s Index was 0.346961, the z-score was 2.216243, and the p-value was 0.026675.  The critical value (z-score) was less than 2.58 but greater than 1.96 thus suggesting that there is less than 5-percent likelihood that the clustered pattern is a result of random chance.

Chlorophyll_Morans_resullts1Of2 Chlorophyll_Morans_resullts2Of2

Figure 6:  Results of the spatial autocorrelation Moran’s I for chlorophyll-a at the leatherback sea turtle locations.

Ordinary Least Squares:  This model was run three separate times, increasingly adding more explanatory variables each time.  Each OLS run used Chlorophyll as the dependent variable.  The first OLS run, SST as the explanatory variable.  The second run, used SST and depth (m) as the explanatory variables and the third run, used SST, depth and turtle count as the explanatory variables.

Model 1:

Model Structure:  Chl-a = f(SST)

Model Results:

a) Overall r2: 0.449473

b) Coefficient on SST in model: -0.052654

The results suggest that Chl is negatively related to SST and given the p-value of 0.000, we can deduce that SST is a significant predictor of Chlorophyll-a.

Model 2:

Model Structure:   (Chl =f(SST), (Depth))

Model Structure:

  1. a) Overall r2: 0.452436
  2. b) Coefficient on SST: -0.051769
  3. C) Coefficient on depth: -0.000015

The results suggest that SST and Depth are negatively correlated with Chlorophyll.  Given the p-value of 0.056397 for depth, this is not a statistically significant relationship.  The p-value for SST is 0.0000 suggesting that it is a statistically significant relationship and is a better predictor of chlorophyll-a than depth.

Model 3:

Model Structure:  (Chl =f(SST), (Depth), (Count))

Model Structure:

  1. a) Overall r2: 0.460299
  2. b) Coefficient on SST: -0.050454
  3. c) Coefficient on Depth: -0.000014
  4. d) Coefficient on Count: 0.121644

The results suggest that SST and Depth are negatively correlated with Chlorophyll and the count is positively correlated.  Given the p-value of 0.067482 for depth, this is not a statistically significant relationship.  The p-value for Count is 0.001815, suggesting that the relationship is statistically significant.  The p-value for SST is 0.0000 suggesting that it too is a statistically significant relationship.  In this model we see that the number of turtles found at each location and the SST values have statistically significant relationships to Chlorophyll.  SST has the lowest p-value and would suggest that it is the best indicator for chlorophyll, though we should not discount the count variable.

Overall results: Running the model using three explanatory variables provided the best Overall R-Square value of .46.  The model significance proves to have an overall statistical significance due to the Koenker (BP) statistic being statistically significant, therefore I used the Joint Wald Statistic as an assessor of the model significance.  The Joint Wald Statistic was significant as shown below:

model3_joint_wald_stat

Figure 7: OLS model 3 – Joint Wald Statistic

The Koekner statistic is used to assess model stationarity.  This statistic revealed that the model was not stationary in geographic space and/ or data space due to its significance and having a p-value <0.05 as shown below:

model3_koenker_BP_stat

Figure 8:  OLS model 3 – Koenker BP Statistic

Since the Koekner statistic was significant it was appropriate to look at the robust probabilities of each variable to assess their effectiveness.  The Robust Probability scores for each of variables reveals that the count and SST are statistically significant (as shown below) thus they are found to be important to the regression model.  However, it also appears as though the model is a good candidate for Geographically Weighted Regression due to it being nonstationary.

model3_variable_stats

Figure 9: OLS model 3 – Variable Statistics

 

Geographically Weighted Regression:

Model Structure:  (Chl =f(SST), (Depth), (Count))

Model Result

a) spatial pattern of r2 values (map)

gw2_localr2_update

Figure 10: Geographically Weighted Regression Analysis: Map of the Local R-Squared values

After conducting the GWR analysis using the chlorophyll as the dependent variable and the Sea Surface Temperature as the explanatory variable I mapped the local R-Squared values of each feature location to show where the model predicted well and where it predicted poorly.  The map shows that predicts are made best where turtle locations appear to be in areas where Sea Surface temperature is cooler off the Texas coastline.

b) Spatial pattern of coefficients for SST

GWR_2_coefficients_SST_5June

Figure 11:   Geographically Weighted Regression Analysis: Map of the coefficients

I mapped the coefficients in order to understand regional variation of the model.  When using GWR to model the Chlorophyll (dependent variable) I was interested in understanding the factors that contribute to the turtle locations (or chlorophyll at each of the turtle locations).  I was also interested in examining the stationarity of the data being that the OLS model revealed that it was not stationary.  In order to do these tasks I mapped the coefficient distribution as a surface to show where and how much variation was present.  As shown below in the map, it appears that Sea surface temperature has a negative relationship with Chlorophyll.  The range of the coefficients is -0.066176 to -0.64989.  There is very little variation in the coefficients.  The results of this test help to inform policies at a regional scale.

Significance: What did you learn from your results?  How are these results important to science? to resource managers?

The results of the tests suggest that chlorophyll-a has a negative relationship to sea surface temperature in the Gulf of Mexico according to leatherback sea turtle locations.  Sea turtles are located in areas where sea surface temperature is low and chlorophyll values are higher.  After mapping the coefficients of the variables we see that there is little variation in the data suggesting that policies regarding the protection of leatherback sea turtles should extend throughout the entirety of the Gulf of Mexico rather than in a few selected areas.

Learning Outcomes:

Programming in R:

With the help of Mark Roberts, I was able to use the R software programming to load and utilize the ggplot2 package to plot the correlation between chlorophyll-a and sea surface temperature. Lots of room for improvement here!  The following is the code used to make the plot as well as the plot outcome:

GGplot_code

Figure 12: ggplot code used in R to plot the correlation between chlorophyll-a and sea surface temperature.

 

Chlorophyll_vs_SST_plot

Figure 13: Correlation between chlorophyll-a and sea surface temperature.

 

What did you learn about spatial statistics:

I think scale is an important factor.  You need to understand your data and be sure to not take the results as they are.  You should investigate everything to make sure that the data and results make sense.  If the results appear incorrect they probably are incorrect.   I also  learned that it is dangerous to conduct spatial statistical analyses on data unless you can interpret what makes sense. The spatial autocorrelation conducted on SST and Chl-a indicated that they were clustered but I was still confused about this outcome.  While conducting research it is important to thing about scale of the data to ensure that all of the data line up.  I encountered an issue with the SST data due to the way it was sampled. I had many sea turtle locations that did not have sea surface data associated with them because of the way the SST data was sampled.   A classmate pointed out that I did not consider the effects of currents on jellyfish.  This could be a significant factor that may lead us to understand why sea turtles are present in Gulf of Mexico since jellyfish are partially distributed by currents.  Another lesson learned is that it is important to remember the basic principles of geographic information systems while conducting these analyses.  For instance it is important to turn on extensions, and be sure that all the data is properly projected prior to conducting any analysis as this can and will through off the results. Overall, it is important to know your data.

Tim Sheehan

GEO 584: Advanced Spatial Statistics and GIScience

Final Blog Post

Introduction

The MC2 dynamic global vegetation model

Dynamic global vegetation models (DGVMs) model vegetation dynamics over regional to global scales. DGVMs are process-based models, meaning they model the physical and chemical processes that drive vegetation growth. MC2 (Bachelet et al., 2015) is one such model. MC2 models processes on a regular latitude longitude grid using a monthly time step. Each grid cell is treated independently. That is to say, processes in one cell do not affect processes in any other cell. MC2 inputs include soil characteristics, elevation, and monthly means for minimum temperature, maximum temperature, precipitation, and dew point temperature.

Unlike many DGVMs, MC2 includes a fire module that models the effects of fire on vegetation. The current version of MC2 simulates a fire whenever fuel conditions exceed a defined threshold. In other words, ignitions are assumed. This has the potential to overestimate fire occurrence in areas where the fuel threshold is frequently exceeded, to underestimate fire occurrence in areas where the threshold is never or rarely exceeded, and to model only severe fires. Other aspects of model results, especially vegetation occurrence, age, and density as well as carbon dynamics, are driven in part by fire, so correctly modeling fire is important to correctly modeling other processes.

In an effort to make MC2 fire modeling more realistic, I have implemented a stochastic ignitions algorithm in the MC2 fire model. Rather than having a single fuel threshold that determines fire occurrence, the stochastic model uses a probabilistic method (Monte Carlo) to determine if an ignition source is present in a model grid cell. If an ignition source is present, then a second probabilistic method, based on fuel conditions, is used to determine whether the ignition leads to a fire occurrence. If a fire occurs, it is modeled in the same way fires are modeled with assumed ignitions.

A fire regime is the pattern, frequency, and intensity of wildfire in an area and is inextricably linked to an area’s ecology. Despite the multiple factors that define a fire regime, aspects of DGVM fire results, including those from MC2, are commonly presented in terms of individual metrics over time, for example an increase in fire frequency in the 21st century versus the 20th century (e.g. Sheehan et al., 2015).

EPA level III ecoregions

EPA ecoregions (https://archive.epa.gov/wed/ecoregions/web/html/ecoregions-2.html) are geographic regions for North America based on similarities of biotic and abiotic phenomena affecting ecosystem integrity and quality. EPA ecoregions are defined at 4 levels (I through IIII) with higher levels covering smaller areas. This study uses level III ecoregions which range in area from approximately 24,000 to 100,000 km2.

Research questions and hypotheses

Research questions for this study are:

  • How consistent are fire regimes within ecoregions?
  • How do fire regimes change under projected climate?
  • Could fire regimes produced by MC2 be predicted by a statistical model?

To gain insight into these questions, I tested the following hypotheses:

  1. Fire-free cells will decrease in the 21st century compared to the 20th due to increasing temperatures.
  2. Fire regimes will group within individual ecoregions over the 20th century, but less so in the 21st century due to changes in fire regime as a result of projected changes in climate between the two centuries.
  3. Logistic regression will be able to predict the fire regimes produced by MC2 due to embedded statistical relationships between fire and input variables in the model.
  4. Logistic regression performed on an ecoregional basis will have greater predictive power than that one performed over the entire region due to local geographical relationships within each ecoregion.

Methods

Study area, resolution, and data

The spatial domain for this study covers the Pacific Northwest (PNW) (Fig. 1), defined as the portion of the conterminous United States north of 42° latitude and west of -111° longitude. The area was mapped to a 2.5 arc minute (approximately 4 km x 4 km) latitude/longitude grid. Modeling was done over the period 1895-2100. I also repeated the analysis using only the Blue Mountain Ecoregion (Fig. 1C), located within the PNW.

SheehanFinalBlogFig1

Figure 1: A) Index map of PNW, B) digital elevation model (DEM) of PNW, and C) EPA Level III ecoregions.

Inputs for MC2 include static soil and elevation data as well as monthly climate data. Climate data consist of minimum temperature, maximum temperature, precipitation, and dew point temperature. Inputs for 1895-2010 were from the PRISM (http://www.prism.oregonstate.edu/) 4 km dataset. This dataset is created using interpolation of observed data. Inputs for 2011-2100 came from the Coupled Model Intercomparison Project 5 (CMIP 5, http://cmip-pcmdi.llnl.gov/cmip5/) Community Climate System Model 4 (CCSM4, http://www.cesm.ucar.edu/models/ccsm4.0/) run using the Representative Concentration Pathway (RCP) 8.5 (https://en.wikipedia.org/wiki/Representative_Concentration_Pathways) CO2 concentration data. RCP 8.5 is based on a “business as usual” scenario of CO2 production through the end of the 21st century. 2011-2100 climate data was downscaled to the 2.5 arc minute resolution using the Multivariate Adaptive Constructed Analogs method (MACA; Abatzoglou and Brown, 2012; http://maca.northwestknowledge.net/). MC2 outputs include annual values for vegetation type, carbon pools and fluxes, hydrological data, and fire occurrence data including fraction of grid cell burned and carbon consumed by fire. MC2 input and output data were divided into two datasets based on time period: 20th century (1901-2000) and 21st century (2001-2100).

k-means cluster analysis

For this study, fire regime was defined as a combination of mean annual carbon consumed by fire, the mean annual fraction of cell burned, and the fire return interval (FRI) over the century. (Note that FRI is usually calculated over periods longer than a century, especially in ecosystems rarely exposed to fire). For cluster analysis, the two datasets were combined, and normalized to avoid uneven influences caused by differing data value ranges. I used K-means clustering in R to produce 4 fire regime clusters (details in my blog post on K-means clustering in R: http://blogs.oregonstate.edu/geo599spatialstatistics/2016/05/21/cluster-analysis-r/).

Logistic regression

I did stepwise logistic regression modeling for each of the fire regime clusters (details in my blog post on logistic regression in R: http://blogs.oregonstate.edu/geo599spatialstatistics/2016/05/27/logistic-regression-r/). Explanatory values were taken from inputs to the MC2 model runs (Fig. 2). Climate variables were summarized annually, by “summer” (April-September), and by “winter” (October-January). The mean of these was then taken over each century. Additional explanatory variables were soil depth and elevation. For each cluster, presence was membership in the cluster and absence was non-membership. The process resulted in four logistic functions, for each fire regime cluster.

Tournament cluster prediction

I used a tournament method to predict the fire regime cluster. For each grid cell, each cluster’s linear regression model was used to calculate the probability that the grid cell would be in that cluster. The cluster with the highest membership probability was chosen as the predicted cluster. I constructed confusion matrices and contingency tables of predicted fire regimes versus actual fire regimes. I also produced maps showing the predicted fire regime cluster, the actual fire regime cluster, and the Euclidean distance between cluster centers for the predicted and actual fire regime clusters.

Results

Changes in input variables over time

Climate models consistently project a warming climate (Sheehan et al., 2015) in the Northwest during the 21st century, but precipitation changes vary across models. The results of the CCSM4 run with the RCP 8.5 CO2 values indicate that temperature warms approximately 2 to 4 °C over the entire region, that precipitation is relatively stable with local increases, and that mean dew point temperature is stable to increasing by up to approximately 3 °C (Fig. 2).

SheehanFinalBlogFig2

Fig. 2: Climate variables and their differences by century: A) 20th century maximum temperature; B) 21st century maximum temperature; C) change in maximum temperature between 20th and 21st centuries; D) 20th century precipitation; E) 21st century precipitation; F) change in precipitation between 20th and 21st centuries; G) 20th century dew point temperature; H) 21st century dew point temperature; and I) change in dew point temperature between 20th and 21st centuries.

Cluster analysis

The four clusters yielded by the analysis (Table 1) (details and graphs in my blog post on K-means clustering in R) can be described in terms of the factors used to generate them. These are: 1) low carbon consumed, medium fraction burned, low FRI; 2) high carbon consumed, medium fraction burned, medium FRI; 3) low carbon consumed, low fraction burned, medium FRI; 4) no fire. Euclidean distances between cluster centers are listed in Table 2.

Table 1: Non-normalized values for centers of clusters from cluster analysis. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl1

Table 2: Euclidean distances between cluster centers (normalized values). (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl2

Overall, there were more fire-free cells in the 20th century (31%) than in the 21st (18%). This is readily apparent in the Cascade and Rocky mountain areas (Fig. 3). Fire regimes are fairly consistent in some level III ecoregions and less so in others (Fig. 3). In some cases there is greater consistency in the 20th century (e.g. Coast Range, Cascades, and North Cascades) while others show more consistency in the 21st century (e.g. Northern Basin and Range and Columbia Plateau). Still in others are dominated by more than one but with the exclusion of others (e.g. Willamette Valley). Overall, coverage by the no-fire fire regime decreases while coverage by the high carbon consumed, medium fraction burned, medium FRI increases, especially in the Eastern Cascades Slopes and Foothills, Northern Rockies, and Blue Mountains ecoregions.

SheehanFinalBlogFig3

Fig. 3: Fire regime coverage determined by cluster analysis with EPA Level III Ecoregion overlay for A) the 20th century; B) the 21st century; C) Euclidean distance between cluster centers for time periods. (C: Carbon; Med: Medium; Frac: Fraction)

Logistic regression and tournament fire regime prediction

Of the fourteen explanatory variables used in the logistic regression functions, every variable was used in at least one regression, and soil depth was the only variable used in all regressions (Table 3).

Table 3: Explanatory variables used in the logistic regression functions. (D: Depth; Max: Maximum; T: Temperature; Min: Minimum; Ppt: Precipitation)

SheehanFinalBlogTbl3

The confusion matrix (Table 4) and marginal proportions table (Table 5) for the predictions show that predictions were correct in 62% of grid cells, and that for each fire regime cluster, the correct value was predicted more than any incorrect value. Maps for 20th and 21st century predicted and actual fire regime clusters (Fig. 4, A-B, E-F) show that predictions align fairly well with actual fire regime clusters. Maps of the distances between the centers of the predicted and actual fire regime clusters (Fig. 4) show that when clusters differ, they generally differ by a small amount. The mean Euclidean distance between predicted and actual clusters is 0.21 for the 20th century and 0.18 for the 21st, far less than the minimum distance between any two cluster centers, 0.33.

Table 4: Confusion matrix for predicted and actual fire regime clusters for combined 20th and 21st centuries. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl4

Table 5: Marginal proportions table for predicted and actual fire regime clusters for combined 20th and 21st centuries. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl5

SheehanFinalBlogFig4

Fig. 4: Predicted and actual fire regime clusters and Euclidean distances between predicted and actual cluster centers: A) predicted fire regime clusters for the 20th century; B) actual fire regime clusters for the 20th century; C) Euclidean distances between predicted and actual cluster centers for the 20th century; D) predicted fire regime clusters for the 21st century; E) actual fire regime clusters for the 21st century; and F) Euclidean distances between predicted and actual cluster centers for the 21st century. (C: Carbon; Med: Medium; Frac: Fraction)

Geographical influences

Within the Blue Mountain Ecoregion a logistic regression could not be performed on the no-fire fire regime due to its low occurrence. For this reason no predictions were made for it. Logistic regression was performed for the other fire regimes and an analysis of the ecoregion was performed. Results (Tables 6 and 7) show that predictions using the regressions from the Blue Mountain ecoregion were more accurate (66% correct classification) than predictions for the ecoregion using the logistic regression for the entire PNW (Tables 8 and 9) (63% correct classification).

Table 6: Confusion matrix for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the ecoregion. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl6

Table 7: Marginal proportions table for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the ecoregion. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl7_copy

Table 8: Confusion matrix for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the entire PNW. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl8

Table 9: Marginal proportions table for predicted and actual fire regime clusters within the Blue Mountain Ecoregion using the logistic regressions developed for the entire PNW. (C: Carbon; Med: Medium; Frac: Fraction)

SheehanFinalBlogTbl4

Discussion

Results and hypotheses

Fire-free cells decreased over the region while precipitation and dew point temperature increased. It is highly likely that rising temperatures were ultimately behind this, but analysis to prove this more forcefully is beyond the scope of this study. That said, I would suggest that hypothesis 1 (Fire-free cells will decrease in the 21st century compared to the 20th due to increasing temperatures) is strongly supported.

Given the strong influence of fire on ecosystem characteristics, and a warming climate expanding the area where fire is likely, I expected fire regimes to change, especially between along the edges of low elevation, flatter ecoregions and the hillier, higher elevation ecoregions surrounding them (for example the Columbia Plateau and Northern Rockies ecoregions). While this expected expansion took place in some places, the opposite effect took place within other ecoregions. The relationships between fire and ecoregions seem to hold or become stronger in as many cases as they become weaker. Thus the hypothesis 2 (Fire regimes will group within individual ecoregions over the 20th century, but less so in the 21st century due to projected changes in climate between the two centuries) is not supported.

The use of logistic regression with a tournament algorithm yielded a statistical model with very strong predictive powers. The high rate of correct prediction and the relatively small distances between mis-predicted fire regime clusters and actual fire regime clusters indicate that statistical relationships are embedded with the process-based MC2 and that logistic regression can be used to project fire regime clusters. Hypothesis 3 is supported.

The predictive power of the logistic regression developed for the Blue Mountain Ecoregion was somewhat better than that of the regression for the entire area. One should not base a conclusion on a sample of one, but this does provide some support for hypothesis 4 (Logistic regression performed on an ecoregional basis will have greater predictive power than that one performed over the entire region due to local geographical relationships within ecoregions).

Broader significance

Fire is projected to occur in areas where it has not been present in the past. The relationship between cohesiveness of fire regime and ecoregion varies from ecoregion to ecoregion. As the climate warms and fire regimes change, though, changing fire regimes will likely change characteristics of some areas, and some ecotones will likely shift as a result. The implications include that for some ecoregions, boundaries will likely shift, and land managers should consider the potential effects of not only the introduction of fire into previously fire-free areas, but also shifts in ecotones.

There is a case to be made for the use of both statistical and process-based vegetation models. Process-based models not only produce predictive results, but they also provide opportunities to find non-linear responses (so-called tipping points) as well as novel combinations of vegetation types, fire, and weather/climate. Statistical models, on the other hand, are much easier to develop and to execute. While a statistical model might not provide as detailed a relationship between predictive and resulting variables, they can often provide more timely results. My results indicate that it might be possible to substitute a statistical model for a process-based vegetation model under certain circumstances.

Statistical analyses are commonly used to analyze climate models by comparing hindcasted model results to observed data. In my experience, vegetation modelers struggle to validate hindcasted results, often using simple side-by-side comparisons of observational and model data. The results from this study point to a possible way to deepen such evaluations. Researchers could develop one set of statistical models using actual historical observations for vegetation type, fire, etc. and another set using modeled data for the same time period. Comparing the models to one another, they could evaluate the similarities of differences of explanatory values’ influences to gain insight to how well the model is emulating natural processes.

Future work

This study can best be described as a pilot study for combining multiple fire-related output variables into fire regimes instead of analyzing them individually. How to best define a fire regime becomes both a philosophical and research question. I will be researching this in the coming weeks and months as I plan to use a refined version of the methods I used in this course for one of the chapters in my dissertation. I have not come across the use of cluster analysis to characterize vegetation model fire regimes, so I am hoping it will be considered a novel approach.

The effects of the stochastic ignitions algorithm within the model cannot be ignored when doing a cluster analysis. Ideally, the model would be run long enough for stochastic effects to be evened out over model time. Another approach is to run multiple (hundreds to thousands) of iterations of the model and average the results. Time and resource constraints make this impractical. Another possible method would be to use a spatial weighting kernel to smooth results over the study area. This method would seem justified given the tendency for fire regimes to cluster geographically. This will be another avenue I investigate.

Finally, the results from the Blue Mountain Ecorgion speak to the potential for geographically weighted regression. I will also do more research on that method.

My Learning

Cluster analysis

About a year ago I was first exposed to cluster analysis. This course gave me a great venue in which to try it on my own. Going through the process taught me the method and some of the pitfalls. I did not normalize the values on my first pass using cluster analysis, and I learned that this can be very important to get meaningful results.

Logistic regression and tournament

The logistic regression portion of this study reintroduced me to a technique I had used before, but did not understand well. As I worked through the logistic regression, I gained a better understanding of the process and how it worked. Putting together the logistic regression blog post made me go through the steps in a deliberate fashion. Doing the 4 by 4 confusion matrix helped me to get a better understanding of what a confusion matrix represents as well as what recall and precision mean in that context. While I did not have a chance to implement geographically weighted logistic regression, I at least gained an understanding of what it means. Implementing a tournament approach was not new to me; I have used a similar technique in genetic algorithms, but it was good to confirm the potential power of such a simple method.

Other packages and methods

Through the term I learned quite a bit about various software packages and tools. I’ve used Arc quite a bit from time to time in the past, but delving into a couple of the spatial statistics tools – hotspot analysis and Moran’s I – reminded me of the power Arc has to offer, as well as some of its quirks. For example my dataset was too large for Arc to handle with Moran’s I. Since I did most of my work in R, I did not have occasion to reacquaint myself with ModelBuilder. I have done a lot of raster processing in Python, but the clipping I had to do to get the Blue Mountain Ecoregion out of the larger dataset forced me to implement a flexible and efficient clipping script that will serve me well in the future.

Since R was the package that was best for my large dataset and had the algorithms I needed, it was the program I learned the most about. I learned how to read and write NetCDF files in R, do cluster analysis, and do stepwise logistic regression in R. In addition to these specific methods, I learned the differences between R datasets, vectors, and matrices. R handles each of these in a different manner. I have found the nuances tricky in the past and tricky still, but I have certainly become more facile with the program.

Statistics

What I have learned about statistics in this class starts with the existence of analytical methods I was not previously aware of. I was unaware of hotspot analysis, but going through the exercise provided me with a good foundational understanding. Spatial autocorrelation is something I had not had formal exposure to, but the basic concept made perfect sense. Its importance, though, and the complexity of dealing with it had not occurred to me. I had been exposed to spectral analysis with temporal data and its use with spatial data makes sense. I had heard of wavelets, but got a better understanding of the method. Correlograms were a new concept for me, presentations on them helped me to understand them. For the most part, regression methods presented in the class reinforced or expanded on concepts I already knew. The application of a kernel for geographically weighted regression was the technique that stretched my understanding the furthest. I had been exposed to the concept of applying a kernel to a dataset, but for some reason, its use in GWR made the concept click for me. Principle components analysis was another method I had heard of, but I did not understand it until it was presented in class, and the concept now makes perfect sense.

Conclusion

Diving into my own project in this class has been a valuable experience, but probably just as valuable has been the interactions with other students and their projects. The broad scope of approaches has exposed me to a wide variety of techniques. Going forward in my research, I will be able to investigate methods I was not aware of in the past.

References

Abatzoglou, J.T., Brown, T.J., 2012. A comparison of statistical downscaling methods suited for wildfire applications. Int. J. Climatol. 32, 772–780.

Bachelet, D., Ferschweiler, K., Sheehan, T.J., Sleeter, B., Zhu, Z., 2015. Projected carbon stocks in the conterminous US with land use and variable fire regimes. Global Change Biol., http://dx.doi.org/10.1111/gcb.13048.

Sheehan, T., Bachelet, D., Ferschweiler, K., 2015. Projected major fire and vegetation changes in the Pacific Northwest of the conterminous United States under selected CMIP5 climate futures. Ecological Modelling, 317, 16-29.

Research Question

Over the duration of this course, my research question has taken on a form different from that presented in my opening blog post, but still equally valuable to my research. Instead of asking how I could bring statistics into a single landslide hazard analysis, I am now asking how statistics may be used to add consistency to the entire landslide mapping process (Figure 1).

Initial_Flowchart

Figure 1: Correspondence of landslide map types and selected inputs.

Mapping landslides can be a complicated task that, along the way, incorporates a sequence of three map types:

  1. Inventory – the mapped extents of past landslides. These extents can simply be one polygon, or they can be several polygons representing the features that compose a landslide; features such as scarps and deposits.
  2. Hazard – mapped predictions of the probability of landslide occurrence or the amount of ground deformation associated with the advent of some natural hazard, such as heavy rainfall or seismic ground motions.
  3. Risk – a mapped combination of landslide hazard and the vulnerability and exposure of infrastructure. Typical risk maps attempt to express the costs incurred by a landslide’s occurrence in a particular location.

In addition to these three types, there is also susceptibility mapping. Susceptibility maps, which show where ground conditions may be conducive to landsliding, are useful for some applications, but they are not necessary in this context.

Inventory, hazard, and risk maps should be viewed as a progression, as each new map is dependent on the content of its predecessor. A lack in geotechnical design parameters (i.e. friction angle, depth to groundwater, soil stratigraphy) at the regional scale requires that landslide inventories be used to back-calculate conditions at failure. These conditions can then be interpolated across the region to improve the inputs to a hazard map. This approach has many imperfections, but it represents the most informed analysis on many occasions.

Additionally, the hazard map is a primary input for a risk map. A good way to think about the relationship between hazard and risk maps is by answering the age-old question, “If a [landslide occurs] in the woods, does anyone [care]?” The answer is typically no, but on the occasion that the landslide wipes out their driveway, railroad track, or doghouse, the answer becomes YES! A risk map considers whether the infrastructure’s location corresponds with that of high landslide hazard, and sometimes, how much the repair of the damaged infrastructure might cost. For these reasons, risk maps are the ultimate goal for land managers, like the Oregon Department of Transportation. Knowing the costs in advance allows for improved allocation of resources and better budgeting estimates.

Datasets

The datasets used for this course were:

  1. Statewide Landslide Information Database for Oregon (SLIDO) – Points representing the location of historic landslides and polylines representing the extents of landslide scarps
  2. Northwest River Forecast Center Rainfall Data – Weather station points with past rainfall amounts and future rainfall predictions.
  3. Oregon Lidar Consortium Digital Elevation Models (DEM) – 3 foot resolution bare-earth elevation rasters.

All datasets were evaluated for various locations in the Oregon Coast Range.

Hypotheses

The hypotheses related to rainfall-induced landslide mapping are as follows:

  1. Topography and soil strength account for most of a soil’s strength, but these two factors are not solely responsible for most landslides.
  2. Rainfall is the factor that most often leads to slope failure. A slope is in equilibrium until the addition of pore water pressures from rainfall induces a failure.

These hypotheses must be broken down into more specific hypotheses to address my research question. The specific hypothesis are listed below:

  1. The adequacy of any topographic data for landslide mapping is determined by its resolution and the scale at which it is evaluated.
  2. Different elevation derivatives (i.e. curvature, slope, roughness) are better for identifying specific landslide features. For example, one derivative might be better at identifying landslide deposits, while another might be better at identifying the failure surface.
  3. The intensity and timing of rainfall determines how soil strength is diminished.

Approaches

Each of the three specific hypotheses were evaluated as coursework this quarter, and their role in the landslide mapping process is shown in Figure 2. Hypothesis one was addressed using fourier transforms, hypothesis two was addressed using principal component analysis (PCA), and hypothesis three was evaluated using kriging. A non-hypothesis related approach also came in the form of a hot spot analysis performed in order to identify locations of more costly landslide activity.

Final_Flowchart

Figure 2: Relationship between hypotheses and the landslide mapping process.

Results

Documentation for the implementation and results associated with the hot spot and kriging analyses has been provided in previous blog posts, but PCA and fourier transforms will be discussed here.

Principal Component Analysis

The purpose of performing a principal component analysis was to determine which topographic derivatives were most closely associated with the crest of a landslide scarp. The values used as inputs to the PCA were the mean slope, standard deviation of elevation, profile curvature, and planform curvature that corresponded with the location of each scarp polyline. Table 1 shows the results of the PCA.

Table 1: Coefficients resulting from principal component analysis.

Principal Component Slope Profile Curvature Standard Deviation of Slope Planform Curvature
1 0.99 0.00 0.00 -0.16
2 0.00 0.80 0.59 0.04
3 0.16 -0.06 0.02 0.98
4 -0.01 -0.59 0.80 -0.05

 

Table 1 shows that the first principal component is strongly correlated with slope, while the second principal component is strongly correlated with profile curvature and the standard deviation of slope. Table 1 was not considered further because it relies assumption that the scarp polylines represent the true location of landslide scarps, which was later determined to be unlikely. The PCA results still do provide useful information, as the strong correlations of both profile curvature and standard deviation of slope with the second principal component spurred an additional investigation.

Having two variables strongly correlated with the same data implies that the two variables are also correlated with each other. To confirm this understanding, profile curvature and standard deviation of slope were compared (Figure 3). The results show a nearly linear relationship between the two variables. Based on these results, standard deviation of slope was no longer considered in future analyses related to landslide scarps.

PCA

Figure 3: Comparison of profile curvature and standard deviation of slope.

Fourier transforms

Fourier transforms use a weighted sum of pairs of sine and cosine functions to represent some finite function. Each paired sine and cosine function has a unique frequency that is plotted against its amplitude to develop what is termed the frequency domain. In common practice, the frequency domain is used to identify dominant frequencies and to remove frequencies associated with noise in the data. In the case of this project, fourier transforms were used to determine the frequency of topography (a digital elevation model, with results in Figure 4), which in turn provides its period. Knowing the period of topography is a useful way of determining the scale at which it may be identified in an entire landscape.

Frequency_domain

Figure 4: Example of the frequency domain of topography.

The primary failure of this approach is that most topography is dominated by very low frequencies (high periods), meaning that topography is inherently flat, which makes clear identification of small landslide features impossible. Future work filtering the frequency domain will be necessary before any conclusions may be drawn from this approach.

Significance

The significance of this work has two major aspects:

  1. Land managers and the public benefit from landslide maps because they show the cost of building in certain locations.
  2. The statistical framework can provide a better way to threshold continuous natural data to improve consistency in the implementation of landslide mapping procedures.

The two aspects come together in that the consistent production of landslide maps will yield products that are easier to interpret. Improved interpretation of the maps will hopefully influence future construction and mitigation for existing infrastructure.

Course Learning

The primary research-oriented lessons learned during this course are:

  1. Combinations of software programs are often necessary to efficiently complete a task. While some software may have almost infinite capabilities, the time needed to implement some approaches may favor the use of other software.
  2. Most programming languages have significant libraries of code that are already written. While an individual may have the ability to write code to perform a task, unnecessary time is spent rewriting what someone else has already done. Often, that same time can be used to explore additional opportunities that may lead to innovative ideas.

From these two lessons it should be evident that I value efficiency. The breadth of my problem is great and what I was able to accomplish during this course is only a small piece of the problem (see Figure 2). On top of only scratching the surface of my problem, many of my efforts also ended without meaningful results. Despite these failures, several approaches that I first thought would not apply to my work, surprised me. For this reason my greatest lesson learned is that it is important to try many different approaches to the same problem; some may work and some may not. Improved efficiency simply makes it possible to implement more analyses.

Statistical Learning

Of the activities performed during this course, the hot spot analysis and kriging were univariate analyses. Based on these two analyses, below are several advantages and limitations to univariate statistical approaches.

  1. Relatively easy to apply (in terms of time required and interpretation of results)
  2. May reveal patterns that are not obvious, which is most evident in the results of my hot spot analysis.
  3. Require large sample sizes, which is also most evident in the results of my hot spot analysis.
  4. Kriging was particularly sensitive to geographic outliers.
  5. Sensitive to the spatial distribution of sampling sites. Geographic biases, such as the selection of only landslides along roadways in my hot spot analysis, may produce deceptive results. I would not trust isolated sampling sites.
  6. One variable is not capable of modeling many processes.

Other statistical approaches performed during this course involved transformations that brought data into unfamiliar states. Both the fourier frequency domain and principal component loadings are abstract notions can only be interpreted with specific knowledge.

1 Research Question

Counties reacted differently towards the Great Recession (officially started from Dec.2007 to June 2009). Economic resilience is defined as the regional capacity “to absorb and resist shocks as well as to recover from them” (Han and Goetz, 2015). There are two dimensions of economic resilience, resistance and recovery. This research focuses on resistance.

The research question is what factors are associated with the resistance to the 2008 recession at the county level in the US? The variable drop developed Han and Goetz (2015) is used to measure the resistance (actually vulnerability), which is the amount of impulse that a county experiences from a shock (the percentage of deviation of the actual employment from the expected employment during the Great Recession.).

droprebound

Figure 1 Regional economic change from a major shock and concepts of drop and rebound

dropequation

Rising income inequality is considered as one structural cause of the crisis (Raghuram, 2012). A rising trend in income inequality is observed in Figure 2. Therefore prerecession rising income inequality is hypothesized to increase drop (reciprocal of resistance).

income inequality

Figure 2 Income inequality in the U.S., 1967-2014 Source: Data from U.S. Census Bureau 2014a, 2014b.

There are two values for four of the income inequality measures and income distribution indicator at year 2013. Because a portion of the 2014 CPS ASEC, about 30,000 addresses of the 98,000 addresses, received redesigned questions for income and health insurance coverage. The value based on this portion denoted 2013a. While the remaining 68,000 addresses received the income questions similar to questions used in the 2013 CPS ASEC. This portion is labeled 2013b (U.S. Census Bureau,2016a, 2016b).

Spatial autocorrelation is hypothesized to act in the relationship. The effect of income inequality or drop may not be limited within a region but attenuate with distance. Drop in a county might be affected by its own characteristics as well as the surrounding counties (spatial lag and spatial error model).

 

2 Description of the dataset:

Income inequality: the Gini coefficient

Income distribution: poverty rate and the share of aggregate income held by households earning $200,000 or more

Control variables: Population growth rate from 2001-2005, % Black or African & American (2000), % Hispanic or Latino (2000)

Capital Stock variables:

Human Capital: % population with Bachelor’s degree or higher, age group (20-29, 30-49), female civilian labor force participation rate (2000)

Natural Capital: natural amenity scale (1999)

Social Capital: social capital index (2005)

Built Capital(2000): median housing value (2000)

Financial Capital (2000): share of dividends, interest and rent(2000)

Economic structure:

Employment share of 20 two-digit NAICS industries (manufacturing, construction, etc. other services (except public administration) omitted)

 

Table 1 Summary Statistics
Variables Obs Mean Std. Dev. Min Max Description Data source
Drop, rebound and resilience
drop 2,839 0.186 0.116 -0.236 0.895   Wu estimates based on 2003-2014 BLS QCEW
Income Distribution
Gini coefficient 3,057 0.434 0.037 0.333 0.605   Census 2000 P052
Poverty rate 3,055 0.142 0.065 0.000 0.569   Census 2000 P087
% Aggregate income held by HH earning 200K or more 3,141 0.091 0.050 0.000 0.456   Census 2000 P054
Control Variables  
Population growth rate 2001 -2005 3,055 0.020 0.056 -0.203 0.428   BEA 2001-2005 CA5N
%Black or African American 3,055 0.087 0.145 0.000 0.865   Census 2000 SF1 QT-P3
%Hispanic or Latino 3,055 0.063 0.121 0.001 0.975  
Capital stocks  
% persons with Bachelor’s degree or higher 3,055 0.164 0.076 0.049 0.605 Human Capital Census 2000 SF3 P037
%Total: 20 to 29 years 3,055 0.118 0.033 0.030 0.346 Census 2000 P012
%Total: 30 to 49 years 3,055 0.288 0.026 0.158 0.420
%Female Civilian Labor Force Participation 3,055 0.547 0.065 0.266 0.809 Census 2000 SF3 QT-P24
Natural amenity scale 3,055 0.056 2.296 -6.400 11.170 Natural capital 1999 USDA- ERS Natural Amenity Index
Social capital index 2005 3,055 -0.004 1.390 -3.904 14.379 Social capital Rupasingha, Goetz and Freshwater, 2006
Median value  All owner-occupied housing units 3,055 80495.190 41744.080 12500 583500 Built capital Census 2000 H085
Share of dividend, interest and rent 3,110 0.188 0.053 0.059 0.561 Financial capital 2000 BEA CA5
Economic Structure  
Farm employment 3,107 0.090 0.087 0 0.627   BEA 2001 CA25N
Forestry, fishing, and related activities 3,107 0.006 0.016 0 0.232  
Mining, quarrying, and oil and gas extraction 3,107 0.010 0.033 0 0.839  
Utilities 3,107 0.002 0.006 0 0.200  
Construction 3,107 0.058 0.032 0 0.260  
Manufacturing 3,107 0.109 0.089 0 0.558  
Wholesale trade 3,107 0.023 0.019 0 0.239  
Retail trade 3,107 0.109 0.030 0 0.372  
Transportation and warehousing 3,107 0.021 0.023 0 0.262  
Information 3,107 0.011 0.010 0 0.137  
Finance and insurance 3,107 0.027 0.017 0 0.201  
Real estate and rental and leasing 3,107 0.022 0.016 0 0.135  
Professional, scientific, and technical services 3,107 0.025 0.028 0 0.828  
Management of companies and enterprises 3,107 0.003 0.006 0 0.147  
Administrative and support and waste management and remediation services 3,107 0.026 0.024 0 0.192  
Educational services 3,107 0.006 0.010 0 0.112  
Health care and social assistance 3,107 0.052 0.048 0 0.305  
Arts, entertainment, and recreation 3,107 0.012 0.018 0 0.726  
Accommodation and food services 3,107 0.048 0.038 0 0.386  
  3,107 0.054 0.020 0 0.162  
Government and government enterprises 3,107 0.170 0.073 0.026 0.888  
Note: ACS- American Community Survey, BEA- Bureau of Economic Analysis, BLS- Bureau of Labor Statistics, ERS-Economic Research Service, QCEW-Quarterly Census of Employment and Wages.

Analysis units: U.S. counties

3 Hypotheses

  1. a) Pre-recession income inequality and other demographic economic and industrial factors affect drop at county level. drop(reciprocal of resistance) is positively related to income inequality
  2. b) drop (reciprocal of resistance) is spatially autocorrelated.
  3. c) drop (reciprocal of resistance) is related to characteristics of a county and characteristics (here focused on drop) of neighboring counties.at county level4 Approaches

Approaches

  1. a) drop(reciprocal of resistance) is positively related to income inequality and related with other characteristics of a county

Ordinary least squares regression

  1. b) drop (reciprocal of resistance) is spatially autocorrelated.

Global Moran’s I and Anselin Local Moran’s I

  1. c) drop (reciprocal of resistance) is on drop of neighboring counties.

Spatial lag model in Geoda

5 Results

  1. a) OLS regression

From the table below, we can see the Gini coefficient, poverty rate, and the share of aggregate income held by HH earning $200, 000 or more are not significant to predict drop. Therefore the prerecession income inequality and income distribution are not correlated with drop (reciprocal to resistance).

Besides income inequality, there are several other significant variables, for example, population growth rate from 2001-2005, % Black or African American, % Hispanic or Latino, % pop with Bachelor’s degree or higher, natural amenity scale, employment share of industries like manufacturing, wholesale trade, retail trade, transportation and warehousing, information, finance and insurance, educational services, health care and social assistance, accommodation and food services, government and government enterprises.

Table 2 Full model of drop  
Drop Full Model
Gini coefficient in 2000 -0.322
  (-1.60)
Poverty rate 0.153
  (1.69)
% Aggregate income held by HH earning  200K or more 0.166
  (1.48)
Population growth rate 2001 -2005 0.205***
  (3.61)
%Black or African American 0.0542*
  (2.33)
%Hispanic or Latino -0.113***
  (-5.16)
% persons with Bachelor’s degree or higher -0.132*
  (-2.09)
%Total: 20 to 29 years -0.0218
  (-0.21)
%Total: 30 to 49 years -0.0731
  (-0.58)
%Female Civilian Labor Force Participation -0.0260
  (-0.37)
Natural amenity scale 0.00728***
  (6.13)
Social capital index 2005 -0.00386
  (-1.32)
Median value  All owner-occupied housing units -4.13e-08
  (-0.47)
Share of dividend, interest and rent 0.0632
  (0.99)
Farm employment -0.0183
  (-0.29)
Forestry, fishing, and related activities 0.163
  (0.99)
Mining, quarrying, and oil and gas extraction -0.0548
  (-0.49)
Utilities 0.0237
  (0.06)
Construction 0.0631
  (0.64)
Manufacturing -0.112*
  (-2.32)
Wholesale trade -0.580***
  (-4.58)
Retail trade -0.620***
  (-5.99)
Transportation and warehousing -0.337***
  (-3.47)
Information -0.590**
  (-2.62)
Finance and insurance -0.710***
  (-5.04)
Real estate and rental and leasing 0.203
  (0.87)
Professional, scientific, and technical services 0.0355
  (0.15)
Management of companies and enterprises -0.207
  (-0.53)
Administrative and support and waste management and remediation services -0.0719
(-0.59)
Educational services -0.894***
  (-4.85)
Health care and social assistance -0.215***
  (-4.26)
Arts, entertainment, and recreation -0.00162
  (-0.01)
Accommodation and food services -0.244**
  (-2.92)
Government and government enterprises -0.231***
  (-4.12)
_cons 0.527***
  (5.19)
N 2771
adj. R2 0.199
Log likelihood 2383.06
AICc -4696.13
Schwarz criterion -4488.7
Jarque Bera ***Non-normality
Breusch-Pagan tes ***Non-stationary
White Test *** heteroscedasticity
For weight Matrix Queen Contiguity 1st order  
Moran’s I (error) 10.5178***
Lagrange Multiplier (lag) 506.9370***
Robust LM (lag) 19.9202***
Lagrange Multiplier (error) 1.#INF ***
Robust LM (error) 1.#INF ***
Lagrange Multiplier (SARMA) 1.#INF ***
For weight Matrix Queen Contiguity 1st and 2nd order  
Moran’s I (error) 7.9411***
Lagrange Multiplier (lag) 569.0604***
Robust LM (lag) 18.1515***
Lagrange Multiplier (error) 1.#INF ***
Robust LM (error) 1.#INF ***
Lagrange Multiplier (SARMA) 1.#INF ***

Significant results for Jarque Bera, Breusch-Pagan test and White Test show non-normality, non-stationary, and heteroscedasticity exist. The value of Moran’s I shows there is spatial autocorrelation. While the five Lagrange Multiplier test statistics are reported in the diagnostic output. The first set of tests is between Lagrange Multiplier (LM), which tests for the presence of spatial dependence, and Robust LM, which tests which if either lag or error spatial dependence could be at work. The second set of tests is lag or error. Lag refers to spatially lagged dependent variable, whereas error refers to spatial autoregressive process for the error term. The first two (LM-Lag and Robust LM-Lag) pertain to the spatial lag model as the alternative. The next two (LM-Error and Robust LM-Error) refer to the spatial error model as the alternative. In my results, LM-Lag and LM-Error are all significant while Robust LM-Lag is less significant (Prob 0.00002 in Spatial lag with queen contiguity 1st order /0.00001 for 1st and 2nd order) than Robust LM-Error (Prob 0.00000). The last test, LM-SARMA, relates to the higher order alternative of a model with both spatial lag and spatial error terms. This test is only included for the sake of completeness, since it is not that useful in practice.

According to Anselin 2005, I should use spatial error model. But I chose to spatial lag model. “In the rare instance that both would be highly significant, go with the model with the largest value for the test statistic. However, in this situation, some caution is needed, since there may be other sources of misspecification. One obvious action to take is to consider the results for different spatial weights and/or to change the basic (i.e., not the spatial part) specification of the model.”

  1. b) Spatial autocorrelation

From both the global Moran’s I and Anselin Local Moran’s I, drop and the Gini coefficient are spatial autocorrelated.

Table 3 Spatial Autocorrelation
Variable Global Moran’s Index1 z-score  
The Gini coefficient 2000 0.46732 91.45 clustered
drop 0.1253 23.82 clustered
Note: 1 Only the continental U.S. counties are used to calculate Global Moran’s I and Anselin Local Moran’s I. All the options are left as default. 2,840 counties are used for drop, and resilience. Conceptualization of Spatial Relationships: Inverse distance, Standardization : None
2This is different from the first time I calculated, which is 0.453298 with z-score 88.71. Another same shapefile gives 0.44 with z-score 83.79.

local moran I drop 3 local moran I gini00

Figure 3 and 4 Local Moran’s I of drop and the Gini coefficient 2000.

c) Spatial lag model

To deal with the spatial autocorrelation, I create spatial weights using queen contiguity. Contiguity based weights are weights based on shared borders (or vertices) instead of distance. The queen criterion determines neighboring units as those that have any point in common, including both common boundaries and common corners. Therefore, the number of neighbors for any given unit according to the queen criterion will be equal to or greater than that using the rook criterion which eliminates corner neighbors i.e. tracts that don’t have a full boundary segment in common.

I tried 3 orders of queen contiguity.

queen contiguity

A county’s drop is affected by drop in neighboring counties. Because spatially weighted drop is significant in all three models. The coefficients are all positive, indicating drop in a specific county is positively related to its neighboring counties. Hence a county surrounded by high drop counties will be high in drop while a county surrounded by low drop counties is likely to be low in drop.

Moreover, the coefficient of the weighted drop increase with the contiguity order. My explanation is that when the drop of a county is affected by more counties in its neighborhood, this region would be overall vulnerable or robust, therefore the coefficients are large.

Models Log likelihood AICc Schwarz criterion W_drop
OLS 2383.06 -4696.13 -4488.7  
Spatial lag 1 (1st queen contiguity order) 2416.53 -4761.07 -4547.71 0.215***
Spatial lag 2 (2nd queen contiguity order) 2426.12 -4780.24 -4566.88 0.356***
Spatial lag3 (3rd queen contiguity order) 2425.8 -4779.59 -4566.24 0.436***

 

 

An increase in log likelihood from 2383.06 (OLS) to 2416.53 (Spatial Lag with 1st order neighbors), 2426.12 (Spatial lag with 1st and 2nd order neighbors), 2425.8 (Spatial lag with 1st, 2nd , 3rd order neighbors). Compensating the improved fit for the added variable (the spatially lagged dependent variable), the AIC (from -4696.13 to -4761.07/-4780.24/-4779.59) and SC (from -4488.7 to -4547.71/-4566.88/-4566.24) both decrease relative to OLS, again suggesting an improvement of fit for the spatial lag specifications.

Among the three spatial lag models, the spatial lag with 1st and 2nd order neighbors has the highest log likelihood and lowest AICc and Schwarz criterion. Spatially lagged drop on 1st and 2nd order neighbors best fit the model.

Limited number of Diagnostics are provides with the ML lag estimation of spatial lag model 2(using Queen contiguity 1st and 2nd order). A Breusch-Pagan test for Heteroscedasticity in the error terms. The highly significant value of 587.2494 suggests that heteroscedasticity is still a serious problem.

The second test is an alternative to the asymptotic significance test on the spatial autoregressive coefficient, it is not a test on remaining spatial autocorrelation. The Likelihood Ratio Test is one of the three classic specification tests comparing the null model (the classic regression specification) to the alternative spatial lag model. The value of 86. 1133 confirms the strong significance of the spatial autoregressive coefficient.

The three classic tests are asymptotically equivalent, but in finite samples should follow the ordering: W > LR > LM. In my example, the Wald test is 9.63145^2 = 92.8 (rounded), the LR test is 86. 1133 and the LM-Lag test was 506.9370, which is not compatible with the expected order. This probably suggests that other sources of misspecification may invalidate the asymptotic properties of the ML estimates and test statistics. Given the rather poor fit of the model to begin with, the high degree of non-normality and strong heteroscedasticity, this is not surprising. Further consideration is needed of alternative specifications, either including new explanatory variables or incorporating different spatial weights.

 

6 Significance

Understanding what factors affect county level resistance could offer some insights that local policies to promote regional resistance and prevent future downturns.

My results show pre-recession income inequality is not a significant factor in explaining drop. Drop, which refers to the percentage of employment loss during the Great Recession, is significantly explained by prerecession population growth rate, race and ethnicity, educational attainment, natural amenity scale and several industries. Results in full model of drop show that counties with a lower population growth rate from 2001 to 2005, a lower percentage of Black or African American, a higher percentage of Hispanic or Latino population, a higher percentage of people with Bachelor’s degree or higher, a lower natural amenity scale, and higher employment shares in manufacturing, wholesale trade, retail trade, transportation and warehousing, information, finance and insurance, educational services, health care and social assistance, accommodation and food services and government and government enterprises would be more resistant and experience a low employment deviation from expected growth path during the Great Recession.

Further investigation is needed for industries such as manufacturing, retail trade, information, finance and insurances, educational services, health care and social assistance that slow down drop.

7 Learning

I learned principal components analysis in R which shrink the size of the explanatory variables. But I find it’s hard to explain in the GWR results. Moreover, my GWR still shows a low local r squared (only less than 20 counties has R squared higher than 0.3). So there might be misspecification in my original model.

Spatial lag and spatial error model in GeoDa. I also did spatial error model, but results are not shown here. The spatial lag model with 1st and 2nd queen contiguity order yield higher loglikelihood than any of the spatial error model. But due to my previous LM test for lag and error, spatial error is more significant and spatial error model should have better result.  This needs more investigation. I find the tools very useful. I am curious to find a way to test the spatial lag in income inequality in neighboring counties.

Moreover, I learned a lot from the tutorials and projects from my classmates. They are wonderful!

 

8 Statistics

Hotspot:

Pros: identify the clustering of values, show patterns, easy to apply, fits both binary and continuous values

Cons: sensitivity to geographic outliers, edge effect, sample size >30, univariate analysis, couldn’t identify outliers as Local Moran’s I, sensitive to spatial distribution of points

Spatial autocorrelation (Local Moran’s I):

Pros: similar to hotspot analysis, could identify outliers (high-low outliers, low-high outliers)

Cons: similar to hotspot analysis

Regression (OLS, GWR):

OLS gives a general coefficients, a global relationship. It is multivariate analysis and could include many explanatory variables.

GWR builds a local regression equation for each feature in the dataset.

Cons: Maps of coefficients could only show one relationship at a time, cannot include large numbers of explanatory variables.

Multivariate methods (PCA):

Pros: could reduce the size of explanatory variables, resulted PC (linear combination of the candidate variables) capture the most variation of the dataset

Cons: it’s hard to explain the coefficients/results when more than one pcs work in the regression.

Learning: Resulted principal components capture the most variation of the dataset, however, it may not best capture the variation in the dependent variables.

Reference:

Anselin, Luc. “Exploring spatial data with GeoDaTM: a workbook.” Urbana 51 (2004): 61801.

Han, Yicheol, and Stephan J. Goetz. “The Economic Resilience of US Counties during the Great Recession.” The Review of Regional Studies 45, no. 2 (2015): 131.

Rajan, Raghuram. Fault lines. HarperCollins Publishers, 2012.

Rupasingha, Anil, Stephan J. Goetz, and David Freshwater. “The production of social capital in US counties.” The journal of socio-economics 35, no. 1 (2006): 83-101. Data Retrieved on March 23rd, 2016  http://aese.psu.edu/nercrd/community/social-capital-resources

U.S. Census Bureau, Current Population Survey, Annual Social and Economic Supplements “H2 Share of aggregate income received by each fifth and top 5 percent of households all races: 1967 to 2014”,” H4 Gini ratios for households by race and Hispanic origin of householder: 1967 to 2014”, (2014a) Retrieved on May 17th , 2016 https://www.census.gov/hhes/www/income/data/historical/inequality/

U.S. Census Bureau, Current Population Survey, Annual Social and Economic Supplements  Table 2 Poverty status of people by family relationship, race, and Hispanic origin: 1959 to 2014” (2014b) Retrieved on May 17th , 2016 https://www.census.gov/hhes/www/poverty/data/historical/people.htmlU.S. Census Bureau, Historical Income Tables Footnote. (2016a). Retrieved on May 17th, 2016 from http://www.census.gov/hhes/www/income/data/historical/ftnotes.html

U.S. Census Bureau Historical Poverty Tables-Footnote. (2016b). Retrieved on May 17th, 2016 from https://www.census.gov/hhes/www/poverty/histpov/footnotes.html