May 2016 - GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Introduction
Hot Spot Analysis is a common method for identifying spatially significant clusters of high and low values among features. While feature values may be easily mapped using histogram distributions to identify areas or points with high or low values, it is difficult to tell whether these values are spatially significant.

In ArcGIS, the Hotspot Analysis tool calculates the Getis-Ord Gi* statistic (Figure 1) for each feature in a dataset. It then outputs a z-score and p-value for each feature, allowing one to map the spatial distribution of significant values to identify local hotspots. To be a statistically significant hotspot, a feature must not only have a high or low value, but must be surrounded by features with similarly high or low values.

Figure 1: Equation for the Getis-Ord Gi* statistic.

As we shall see, significance for the Getis-Ord Gi*statistic depends largely on calculation of a local sum for a feature and its neighbors. This is compared proportionally to the sum of all features to identify regions where when the observed local sum is significantly different from the expected local sum, or where the difference is too large to be the result of random chance.

But how does the Hotspot Analysis tool determine what is “local”?

When the threshold distance parameter is left blank, a default threshold equal to the Euclidean distance that ensures that every feature has at least one neighbor is computed and applied to the entire dataset. An implication of this is that Hotspot Analysis is extremely sensitive to the spatial distribution of features, making some data inappropriate for this type of analysis.
In this tutorial, we will explore how manual changes in threshold distance affect our view of what is significant using clay chemistry data from the Valley of Oaxaca, Mexico. I will argue that because clay sample locations were researcher-selected rather than the product of natural processes, Hotspot Analysis is not appropriate for this data.

Data, Analysis, Results

The Valley of Oaxaca is a broad Y-shaped valley located in the central highlands of Oaxaca, Mexico and the heart of ancient Zapotec Civilization. Over the past several years, researchers at the OSU Archaeometry Laboratory have collected over 300 clay samples from across the Valley for elemental analysis using Instrumental Neutron Activation Analysis (INAA) to establish a comparative basis for the geochemical fingerprinting of ancient pottery from the region.

Figure 2: Clay Sampling locations relative to regional geology in the Valley of Oaxaca, Mexico.

Our ability to identify pottery produced in different part of the Valley is due in large part to the geologic complexity of the region. In each area of the Valley, clay composition is strongly conditioned by parent material, sometimes representing the product of mixed sediments from multiple sources. For this exercise, we will focus on two elements, lanthanum (La) and cesium (Cs), found in elevated concentrations in clays derived from gneiss and rhyolite respectively. La concentrations tend to be high along the western side of the valley (Figure 3), while Cs concentrations are high in the east (Figure 4).

Figure 3: La concentrations in samples from the Oaxaca Clay Survey

Figure 4: Cs concentrations in samples from the Oaxaca Clay Survey

A hotspot analysis of La concentrations from across the Valley using an unspecified (default setting) threshold distance (Figure 5) identifies significantly high hotspots of La along the central portion of the western edge of the valley – as expected – as well as significant cold spots along the eastern edge of the southern arm of the valley and along the northern side of the eastern arm. A similar analysis of Cs values identifies Cs cold spots along the western edge of the valley and Cs hot spots near the southern and eastern sides of the eastern arm (Figure 6).

Figure 5: Hotspot analysis of La concentrations using an unspecified threshold distance.

Figure 6: Hotspot analysis of Cs concentrations using an unspecified threshold distance.

Broadly speaking these patterns are in accord with our previous understanding of regional clay composition, but to what degree are the significance levels identified by hotspot analysis dependent upon our unspecified threshold distance?

To address this question, we conducted a series of three additional hotspot analyses of Cs concentrations using specified threshold distances of 100 m (Figure 7), 1,000 m (Figure 8), and 10,000 m (Figure 9). Results of these analyses show that the number of features identified as significant hot or cold spots increases as a function of threshold distance. Using a threshold of 100 m, only a few points in the far eastern portion of the Valley are identified as significant hotspots. This pattern largely holds true using a 1,000 m threshold with a few additional points in the same limited area defined as significant hotspots. However, using a threshold of 10,000 m, the majority of features on the eastern side of the Valley are classified as significant hotspots, while the majority of those in the west are classified as significant cold spots. When nearly everything is significant, what is significant? Clearly 10,000 m is an improper choice of threshold distance, but this raises the threefold issue of how the default threshold is determined, whether this distance best describes the data, and whether the assumptions of this tool are met by the data.

Figure 7: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 100 m.

Figure 8: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 1000 m.

Figure 9: Hotspot analysis of Cs concentrations for the Oaxaca Clay Survey conducted using a specified threshold distance of 10,000 m.

Discussion

Results of a series of hotspot analyses of La and Cs concentrations in clay samples from the Valley of Oaxaca largely confirm what we already knew – La concentrations are high in clays along the western side of the valley and Cs concentrations are highest in clays in the eastern side of the valley. Hotspot analysis allows us to identify spatially significant clusters of points with higher or lower than expected values, but as we have seen, what gets flagged as significant is strongly dependent upon ones choice of threshold distance.

By default, ArcGIS selects a threshold distance equal to the minimum distance required to ensure that all features have at least one neighbor. This means that the default minimum distance is contingent upon the spatial distribution of features, and is likely driven by features at the edge of the dataset and/or spatial outliers.

This raises the larger issue of what factors drive the spatial distribution of features in one’s dataset. Are the location of individual features the product of the physical or social environment? Is their spatial distribution critical to your understanding of the data? Or are the locations researcher-selected according to some regular, stratified, or arbitrary sampling design that is not a product of the environment, but of how one chose to sample that environment?

In the case of the Oaxaca Clay Survey, clay sampling locations were selected opportunistically from subsurface horizons along roadcuts and streambanks, with greater sampling intensity near key archaeological sites sparse density in areas with few important sites. Because the sampling locations were arbitrarily selected, their overall distribution is not of interest. Nor are the inter-point distances used to calculate the Getis-Ord Gi* statistic.

Because the spatial distribution of clay sampling locations tells us more about researcher behavior than clay formation processes, and Hotspot Analysis uses this information in its calculation of z-scores and significance levels, Hotspot Analysis is not appropriate for this data.

Conclusion

Before using Hotspot Analysis, one should always verify (1) that the spatial locations of features of interest are the product of the physical or social environment rather than researcher-selection; and (2) that the default local threshold value is not driven by spatial outliers.

Relationships that vary in space and time: A challenge for simple linear regression

Regression is a fundamental analytical tool for any researcher interested in relating cause and effect. A basic assumption of regression is that of stationarity, the principal that the relationship between a predictor variable and its response is constant across its sample space; that a relationship is true in all regions where it is being applied. This assumption is a particularly poor assumption in spatially based analyses, where we know that interactions may exist between known and often unknown factors in how response variable relates to a given explanatory variable. While this is a challenge to simple linear regression, it is also what generally makes spatial problems interesting: the fact that relationships are not constant across space and time.

Spatially weighted regression challenges the assumption of stationary in that where simple linear regression develops a single relationship to describes a phenomena, spatially weighted regression allows the relationship to vary spatially. Unhinging the relationship between a explanatory variable and its response spatially creates a set of local coefficient for each instance where an explanatory variable is offered. This is done through the use of a weighting function. Wherein simple linear regression, each data point assumes equal weight with regards to the final relationship, a weighting function applies greater import to values closer to where a regression point would be calculated.

Spatial weighting function (1)

Fig 1: A spatial weighting function weights data points closer to a regression point. In this way bandwidths can vary across a feature space, such that two local regression values may be constructed of a different number of data points.

Fig 2: Spatially weighted regression allows the relationship between a response and explanatory to vary across a study region.

NDVI and weed density: A case study in spatially weighted regression

Normalized difference vegetation index (NDVI) has been used in remote sensing as a proxy for phenology in many remote sensing and cropping systems studies. NDVI is calculated as the ratio of red to near-infrared light, and is generally related to the amount of green photo-synthetically active tissue. In principal, weeds and crops should be distinguishable based on how their NDVI response varies in time.

Question: Can NDVI be used to predict weed density? Does the relationship between NDVI and weed density vary spatially?

Fig 3: A general hypothesis for how weeds and crop may in their NDVI response. Weeds may mature earlier or later than a crop, but this relationship may also vary spatially.

Here, spatially weighted regression is offered as a method for distinguishing relationships between weed density and NDVI. Allowing the relationship between NDVI and weed density to vary spatially may allow one classify areas of the field based on the nature of these relationships.

Fig 4: NDVI over the course of the growing season at a 17 acre field located NE of Pendleton, OR, Summer 2015. Note that the field does not increase or decrease in NDVI evenly, but rather peak NDVI passes as a wave across the field. [ Note: GIF does not appear animated in preview. Click on the GIF directly if animation is not working ]

An important first step in deciding if a data set is suitable for spatially weighted regression is to look at the residuals of the linear relationship you are choosing to model. Here we examine the following function for predicting weed density:

This function uses 3 samples of NDVI over the course of the growing season, centered around the time of peak NDVI. The purpose of using these 3 times was to try and emphasize time periods where weeds and crop would vary in their relative response in NDVI.

Fig 5: Weed densities were calculated based on linear transects made prior to harvest. Weed hot spots were calculated using the Getis-Ord Gi* statistic in ArcMap. Weed hot spots from a prior analysis were used as input for predicting weed density in this exercise.

Fig 6: Predicted weed density for a multiple regression model based on 3 measurements of NDVI surrounding peak NDVI. Multiple R2: 0.044, P-value < 0.001

Fig 7: Residuals from a multiple regression model based on 3 measurements of NDVI surrounding peak NDVI. Both high and low residuals cluster spatially, indicating that the relationship between NDVI and weed density may vary spatially and may be a good candidate for geographically weighted regression.

Fig 8: Predicted weed density from a spatially weighted regression. Quasi-global R2: 0.92.

By using a spatially weighted regression, we’re able to account for 92 percent of the variance that occurs in the distribution of weeds in this field. Unlike in a standard regression, the result of this process is a collection of local regression formula. In this sense, the result is not a result that can be easily extrapolated to predict weeds distributions in future data sets. However, these coefficients do offer us the opportunity to look for some spatial patterns that may yield additional information as to what the nature of these local spatial relationships might be

Fig 10: Map classified by coefficient slope.

Introduction

I have done a cluster analysis on a fire regimes produced by the MC2 dynamic global vegetation model (DGVM) run across the Pacific Northwest (PNW) over the time period 1895-2100 (see my previous Cluster Analysis in R post for details). Clusters were based on fire return interval (FRI), carbon consumed by fire, and fraction of cell burned. MC2 is a process-based model with modules for biogeochemistry, fire, and biogeography. Process models are generally complicated models with high computational costs. I wanted to explore the question “How well could a statistical model reproduce the results from a process-based model?”

A complete answer to that question would take much work, of course, as MC2 produces data on vegetation types, carbon pools, carbon fluxes, water flows, and, of course, fire. But this investigation provides a starting point and has the potential to show that at least one aspect of the model could have a statistical alternative to process-based computation.

Method description

Overview

Inputs into the MC2 include monthly weather in the form of maximum temperature, minimum temperature, precipitation, and mean dew point temperature, as well as elevation, and soil depth. I chose to use these inputs with a set of logistic regression models (described below) to predict the fire regime cluster of each cell in the results from my cluster analysis. To simplify the number of regression variables, I summarized the monthly variables by “summer” (April-September), “winter” (October-January), and annually.

Logistic regression

Conceptually, logistic regression differs from linear regression in that logistic regression is designed to predict the probability of an outcome or occurrence in a binary data set. It is commonly used in models predicting the presence or absence of a phenomenon (e.g. fire occurrence, species occurrence) over a spatial domain. Linear regression uses one or more continuous or binary explanatory variables to calculate the binary response variable.

The formula used in linear regression is:

e^t / (e^t + 1)

which is equivalent to

1 / (1 + e^-t)

where t is the logit, a linear function of the explanatory values. The graph of the function is S-shaped.

An example in R should make things clearer:

Make a dataset with an independent variable and a dependent binary variable:

myDS = data.frame( iv = 1:50, # independent variable dv = c(rep(0,25),rep(1,25)) )

plot(myDS$iv,myDS$dv)
Your graph should look like this:

Scramble the dependent variables between 20 and 30 to make the plot messier, like real life:

myDS$dv[21:30] = sample(1:0,10,replace=TRUE)
plot(myDS$iv,myDS$dv)
Your graph should look similar to this one, with the regions of 1 and 0 values overlapping:

Now run a logistic regression on the data and look at the summary of the result:

tstGlm = glm(dv ~ iv, data = myDS, family = binomial)
summary(tstGlm)
Your summary should look something like this:

Call: glm(formula = dv ~ iv, family = binomial, data = myDS)
Deviance Residuals: Min 1Q Median 3Q Max -1.55618 -0.15882 -0.01953 0.15884 2.01358

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -9.0868 3.0071 -3.022 0.00251 ** iv 0.3429 0.1113 3.080 0.00207 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)
Null deviance: 69.235 on 49 degrees of freedom Residual deviance: 19.165 on 48 degrees of freedom

AIC: 23.165
Number of Fisher Scoring iterations: 7
There is a good explanation of these in this video: https://www.youtube.com/watch?v=xl5dZo_BSJk. A good reference for formulas used can be found at http://documentation.statsoft.com/portals/0/formula%20guide/Logistic%20Regression%20Formula%20Guide.pdf

But the short explanation is:

Call: The call that was made to the glm function

Deviance residuals: Not really useful in a binary response regression

Coefficients:

Intercept: The intercept, or constant coefficient, used in the logit linear function
Estimate: These are the coefficients used with each variable in the logit (in this case we have one variable in the logit)
Std. Error: The standard error associated with each coefficient.
Z Value: The z value for the coefficient. This is the Estimate divided by the Std. Error.
Pr(>z): The p value associated with each z value.
Null deviance: A measure of deviance between the model and a model that only has the constant coefficient

Residual deviance: A measure of deviance to the model produced.

AIC: Akaike information criterion. This is a measure of the quality of the model based on goodness of fit and model simplicity. For models created using the same number of data points, a model with a lower AIC is considered better than one with a higher AIC.

Make a plot of the regression function along with the original points. You have to use a data frame in which the column has the same name as the column you used to create the regression:

tmpData = data.frame(iv = seq(0,50,0.2))
plot(seq(0,50,0.2), predict(tstGlm, newdata = tmpData, type = 'response'), type='l')
points(myDS$iv,myDS$dv)

Notice how the logistic curve, which predicts the probability of a positive outcome, moves from a y value of 0 to a y value of 1 where the dependent values from the original dataset are mixed between 1 and 0. From this, it is easy to see how logistic regression predicts probabilities for binary data.

Logistic regression can be used with multivariate regression, too. Each explanatory value has a term in the logit. Each term’s coefficient determines its influence on the final prediction.

Stepwise Logistic Regression

Finding the best combination of explanatory values to use can present a challenge. Often, there are more variables than desired, and using too many can result in an overfitted model. One method used to determine which variables to use is stepwise regression. There are two approaches to this method: forward and backward.

In forward stepwise regression, the algorithm starts with the constant coefficient. Then it tests all of the explanatory values to find the one that, by itself, has the greatest explanatory power. It computes the AIC for the model and then tests the remaining explanatory variables to find which adds the greatest explanatory power. It then computes the AIC with the added variable. If the AIC is lower (better) with the new explanatory variable, it repeats the process. If the AIC is greater, it discards the new variable and stops.

Backward stepwise regression takes the opposite approach. It starts with all the variables and removes the one with the least explanatory value and then calculates AIC. It stops at the point when removing a variable would raise the AIC.

One approach to finding a model with stepwise regression is to do both forward and backward regression and see if one returns a better result than the other. Additionally, the results can be modified by removing any explanatory variables that have a high p-value to see if removing them yields a better model. I used both forward and backward regression and in two cases, found model improvement by removing an explanatory variable with a low p value.

A problem with stepwise regression is that it does not guarantee the best possible regression function. To find that, one has to test all possible combinations of explanatory variables, but that is generally a very compute-intensive proposition. So, even though it is easy to find criticisms of stepwise regression, it is commonly used.

Without going into too much detail, here is some R code that performs stepwise regression:

# Generate some independent, explanatory variables. Note how some are more “scrambled” than others
indepVs = data.frame( v1 = c(1:50), v2 = c(1:50), v3 = c(1:50), v4 = c(1:50), v5 = c(1:50), v6 = c(1:50) )
indepVs$v2[23:27] = sample(c(23:27),5) indepVs$v3[21:30] = sample(c(21:30),10) indepVs$v4[16:35] = sample(c(16:35),20) indepVs$v5[11:40] = sample(c(11:40),30) indepVs$v6[1:50] = sample(c(1:50),50)
# The dependent variable, scrambled in the middle

depV = rep(0,50) depV[26:50] = 1 depV[21:30] = sample(c(0:1),10,replace=TRUE)

# Generate a full model (using all the explanatory variables)
fullGlm = glm( depV ~ v1+v2+v3+v4+v5+v6, data = indepVs, family = binomial() )

# Generate an empty model
emptyGlm = glm(depV ~ 1, data = indepVs,family=binomial)
# Do forward stepwise regression
forwardGlm = step( emptyGlm, scope = list(lower=formula(emptyGlm),upper=formula(fullGlm)), direction='forward' )

# Do a backward stepwise regression

backGlm = step(fullGlm)

Plotting the results and examining the resulting models using summary() will give you insights into how the process is working.

Geographically weighted logistic regression

I did some searching for geographically weighted logistic regression and found a couple of papers that used the method. I did not come across any easily implemented solutions for it (which does not mean one does not exist). I think doing geographically weighted logistic regression could have a lot of value, especially over large and varied geographic areas, like the one I’m studying.

I did a logistic regression on a single EPA Level III ecoregion (The Blue Mountains) within my study area and compared the results of that to results from the model produced using the whole study region and found it had a higher predictive value for the ecoregion. This is an indication that geographically weighted regression could provide advantages.

The Analysis

I used logistic regression as part of a process to predict which fire regime each cell in my dataset would have. My explanatory values were derived from the input dataset I used for modeling, these included: annual, summer, and winter means for maximum temperature, minimum temperature, precipitation, and mean dewpoint temperature; elevation; and soil depth. For each of my four fire regimes, I generated a logistic regression predicting whether a cell was in the fire regime or not. Based on recommendations from https://www.medcalc.org/manual/logistic_regression.php, I used this formula to determine the number of data points to use in each regression:

10 * num_exp_vars / min(frac_pos_cases, frac_neg_cases)

where num_ind_vars is the number of explanatory variables, frac_pos_cases is the fraction of positive (presence) cells in the dataset, and frac_neg_cases is the fraction of negative cases in the dataset.

This gave me four logistic regression functions. For each cell in the study area, I ran each of the logistic functions and recorded the probability returned. The predicted fire regime for a cell was assigned based on the logistic function that produced the highest probability. I repeated this entire procedure for the Blue Mountain Ecoregion only.

Results

Across the full study area, this process correctly predicted the fire regime with 61% accuracy (Keep in mind random assignment would be expected to return 25% accuracy). Within the Blue Mountain Ecoregion, the process predicted the fire regime with 46% accuracy using the logistic functions derived from the data for the entire study area, and with 74% accuracy using the logistic functions derived for the Blue Mountain Ecoregion. This is s a strong indication that geographic influences exist and that geographically weighted logistic regression has the potential to improve results.

Method Critique

I was very pleased with the results of this method, and frankly, a little surprised at how successful the method was at predicting fire regime. This investigation leads me to believe that statistical models should be considered as alternatives to process-based models for some vegetation modeling tasks, especially considering the lower computational overhead of statistical models.

The biggest downside to the method is the manual procedures involved in finding the best logistic function. This process could be automated with some programming, however, as could a geographically weighted version of logistic regression.

The goal for this tutorial is to expose users to interpolation methods using packages in R. I found doing these in R you get very informative insight into how these interpolation methods operate since you have to specify specific options instead of clicking the krigging function in ArcGIS.

Just a quick refresher: 1) First I need to load the necessary libraries. The main libraries I need for interpolation are gstat and raster 2) Import all the data that we need The first data I will be entering is a polygon shape file which I will use for just projection and aesthetic for displaying the maps. There is also a .csv file that I will import which has all the points I need. I will need to convert this to a point vector datatype. The last time, I was able to extrapolation the point layer pH values over each of their respective polygons. If you more detail on how exactly this was accomplished please read my previous blog post labeled: Spatial Data Management and Simple Plotting with R.

x lapply(x, library, character.only=T)


## This is a polygon shapefile

CRA
## OGR data source with driver: ESRI Shapefile

## Source: "Data/NRCS_Data", layer: "Final_CRA"

## with 60 features

## It has 4 fields
## Reading in a .csv file with all the attributes and removing all the missing pH values.

NRCS_data c
NRCS_data
## Creates a matrix of lat long coord

NRCS_mat row.names(NRCS_mat) ## Takes the projection from the CRA shapfile

CRA.projection
## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"
## Creates a spatail dataframe with the projection as our CRA shapefile

NRCS NRCS@data
## aggergating my atributes by pH finding the mean pH for each CRA

agg_NRCS_data agg_NRCS_data
## merging the 2 datasets together by CRA

CRA@data
## Joining by: "Final_CRA"
## Warning in left_join_impl(x, y, by$x, by$y): joining character vector and

## factor, coercing into character vector
CRA@data

## ploting the extraploated CRA polygon layer. An indepth expliation of this code is in with the IDW code. tm_shape(CRA)+ tm_fill("pH_water",breaks=c(-Inf, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, Inf), auto.palette.mapping=F, legend.show = F)+tm_borders()+tm_shape(NRCS)+ tm_bubbles("red", size = .1, border.col="black")+tm_layout(title="Extrapolation of pH Across CRAs", title.position = c("center", "top"))

In order to do any form of interpolation you need to create a form of resolution that you want the interpolation to happen. In ARCGIS this is done automatically; however, in R you need to create a raster layer that has your specified resolution that you want. Here I create a blank raster layer that has 100 rows and 100 columns. You can change our resolution to what ever you want by editing ncol and nrow variables.

## creating a raster grid usign the raster package grd ## "Spreading" grid to ecompass the entire CRA polygon layer extent(grd)


## giving the grid a projection

crs(grd)
## changing the variable class from a raster to a SpatialPixel. They are both rasters; however, this is the format that the package gstat preferrs.

NRCS.raster crs(NRCS.raster) gridded(NRCS.raster)

## This is jusst a quick plot ot see that the raster grid has the correct projection and matches up correctly. plot(NRCS.raster) lines(CRA, col="blue")

This is a plot of the polygon layer over the empty gridded raster layer.

Now we have all of our layers, its time to create do the inverse distance weighting (IDW) and Ordinary Kriging. IDW and krigging are both only use geographic distance from know points to unknown points. IDW is simpler since it is a weighted average between unknown points strictly based on distance. Kriging is a little more sophisticated since a model of the semivariogram is created to calculate the weights.

OK, first we are going to do the IDW, to do this we need both the points and the z-value we want to interpolate and the resolution that we want that to happen on (our blank raster layer).

## In gstat the IDW function is called idw. We want to interpolate the pH_water variable in the NRCS point shape file over the resoltion of the NRCS.raster layer. idw.out <- gstat::idw(pH_water~1, NRCS, newdata=NRCS.raster)


## [inverse distance weighted interpolation]
## Here we are backing up the IDW raster layer and trimming it to the same width as our CRA polygon layer.

idw.out.BU <- idw.out

idw.out <- idw.out.BU[CRA,]
## Now to plot i, I know this piece of code loos indtimidating; however, that is only becuase you have create every little aspect about your plot layer by layer.  First comes the ph IDW raster layer.  I want the colors of the raster layer colors to be partitioned by the bins of pH that I speficfy after break.  I also want to turn off the legend since it is a little cluttered.  Next layer is the polygon layer and I and jsut displaying the outlines of each polygon.  The last layer is the Point layer inwhich I am displaying each point as blue.  tm_layout is jsut for formating the title posistioning it in the top center.

tm_shape(idw.out)+tm_raster(breaks=c(-Inf, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, Inf), auto.palette.mapping=F, legend.show = F)+tm_shape(CRA)+tm_borders()+tm_shape(NRCS)+ tm_bubbles("blue", size = .1, border.col="black")+tm_layout(title="Interpolation using IDW", title.position = c("center", "top"))

Now for Kriging. To do krigging in R first you need to examine the semivariagram and then model the semi variance and use that model for your interpolation.

## Creating the semivariaogram and plotting it. cvnrcs<-variogram(pH_water~1, locations=NRCS, data=NRCS) plot(cvnrcs)

## Though this variogram does not look great for the purposes of this exerice it will work. However, in reality I would be quite hesitant to do any more of interpolation over this data. ##Looking at the variogram I choose to go withe a Spherical model that has a sill of .06, a nugget of .01 and a range of 100. nrcs.fit.sph <- fit.variogram(cvnrcs, model = vgm(psill=.06, model="Sph", nugget=.01, range=100)) plot(cvnrcs, nrcs.fit.sph)

## The fit isn't completely bad; however, it is not great. ## Now to create the Ordinary Kriged raster layer and then back it up and trim it to the right size as the polygon layer. pH.ok <- krige(pH_water~1,NRCS, NRCS.raster, model=nrcs.fit.sph)


## [using ordinary kriging]
pH.ok.BU <- pH.ok

pH.ok <- pH.ok.BU[CRA,]

## This is essintially the same plotting code as the IDW plot; however, the raster laster I am using is the one created by the ordinary kriged layer. tm_shape(pH.ok)+tm_raster(breaks=c(-Inf, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, Inf), title="Soil pH", auto.palette.mapping=F, legend.show = F)+tm_shape(CRA)+tm_borders()+tm_shape(NRCS)+ tm_bubbles("blue", size = .1, border.col="black")+tm_layout(title="Interpolation using Ordinary Kriging", title.position = c("center", "top"))

In examining the plots you can see that the pH values on the western side of Oregon pretty well conserved since there is not much heterogeneity. However, as you move east you see that each map displays something different. This is because there are more parameters dictating pH other than just distance. Also the sampling density of eastern Oregon is not dense enough to capture the heterogeneity of the landscape as you can see how the IDW and ordinary krigging blend a lot of it together.

However, now you know how to use this amazing functionality in R and this is only scratching the surface. Though R may not be as user friendly and its graphic capabilities may not be as polished, its ability to truly customize each step of the analysis makes it well worth the effort in learning how to conduct these analysis.

Introduction

I am working with a dataset from a dynamic global vegetation model (DGVM) run across the Pacific Northwest (PNW) over the time period 1895-2100. This is a process-based model that includes a dynamic fire model. I am particularly interested in drivers of the fire regimes (the frequency and intensity of wildfire) produced by the model. The general question I am studying is “what, if any, is the relationship between the climatic and elevation inputs of the model and the fire regimes produced by the model?” A more specific pair of questions for this method is “Do fire regimes cluster locally?” and “How do fire regimes change over time?”

Method description

Overview

I divided the data into two time periods: 1901-2000 and 2001-2100 to capture the historical and projected climates and fire regimes over those periods. To define a fire regime, I chose the two fire-related output variables – carbon consumed by fire and percent of model pixel burned – along with fire return interval (FRI) over the century time period. I used k-means cluster analysis in R to define four fire regimes.

K-means clustering divides a dataset into a specified number of data point clusters and calculates centroids and cluster membership such that the Euclidean distance between each cluster’s centroids its members is minimized. The relationships between scales of the input variables should be taken into consideration in this type of analysis as it affects the values of Euclidean distances calculated. The steps in the analysis are outlined below. An appendix of the actual commands used for the analysis are in Appendix A.

Outline of steps

For each of the two input files:

Open the NetCDF file
Reverse the order of the latitude values (change from highest to lowest to lowest to highest)
Filter out the na (no data) values
Copy data (carbon consumed by fire, fraction pixel burned, FRI) into a combined data set

For the combined dataset:

Normalize values for each variable to 0.0 to 1.0 using the minimum and maximum of each variable in the combined dataset.
Generate the four clusters
Output value distributions of the four clusters
Distribute resulting data into two output datasets, one for each of the two time periods
1. Divide into two separate datasets
2. Distribute results into non-na data locations

For each of the output datasets:

Create output NetCDF file
Write data to NetCDF file
Close NetCDF file

Results

Clusters

With my input data, the FRI values range from a minimum of 1 to a maximum of 100 years. The mean annual fraction area burned has a theoretical range from 0 to 1, and the mean annual carbon consumed ranges from 0 to 266.5 gm^-2. Performing the cluster analysis using the original input values resulted in 3 of the 4 clusters driven primarily by the values of carbon consumed (Fig. 1a-c). Normalizing the values of each input variable using the variable’s minimum and maximum, resulted in clusters driven by different variables (Fig. 2). For the remainder of the project I am using normalized variables.

Fig. 1: Z-score distributions of input variables within clusters for clusters computed without normalizing input data. Clusters 1, 2, and 3 (A-C) exhibit strong overlaps in both fraction area burned, and time between fires, and strong shifts in carbon consumed between clusters. Cluster 4 (D) is the only cluster in which all three factors differ substantially with other clusters.

Fig. 2: Distributions of normalized input variables within clusters. Each of the clusters was created by performing cluster analysis using normalized input data. All clusters have substantial shifts in the distributions of at least two factors compared to the other clusters.

Result maps

The clustering of fire regimes can easily be seen in maps of the fire regimes (Fig. 3). Areas without fire tend to be at higher elevations: in the Coast Range and Cascades of Oregon and Washington and the Rockies of western Montana and northwestern Idaho. High frequency fires (Low FRI) are common in the plains and plateaus of southern Idaho, south and central Oregon, and the Columbia Plateau in central Washington. The other two fire regimes, both of which have a Medium FRI are somewhat intermingled, but are present in mid elevations in the Rockies, in southwest Oregon, and in the Willamette Valley and Puget Trough regions of Oregon and Washington, respectively.

Fig. 3: Fire regime maps for (A) the historical period (1901-2000) and (B) the future period (2001-2100). (C: carbon consumed; Med: medium; Frac: fraction of grid cell; No Fire: no fire occurred in the cell over the time period)

Fire regime change from the historical (Fig. 3A) to future (Fig. 3B) period include the appearance of the High Carbon Consumed, Medium Fraction Burned, Medium FRI fire regime around the edges of the plains and plateaus as well as into the Blue Mountains of northeastern Oregon as well as the spread of both Medium FRI fire regimes into the Coast, Cascade, and Rocky Mountain ranges.

Method critique

I found this method to be very useful for my study. Fire and vegetation modeling results are commonly expressed using single variables over time or area and fail to take into consideration the relationships between multiple variables. This work indicates that breaking sets of multiple fire-related variables into clusters provides a way to better characterize regional characteristics as well as spatial changes through time.

I did find the data manipulation required for the method a bit tricky. Having to work around the na values within the data required a bit of labor. Converting matrix data to vector data for the methods was another minor inconvenience. All that said, however, I believe such manipulations will become second nature as my experience with R grows.

Appendix A: Annotated R commands used for the analysis of normalized input data.
## Cluster analysis on normalized data.
## The dataset is drawn from two files, one for the 20th
## century and one for the 21st
## Install the necessary libraries

library(“DescTools”) # for displaying data
library(“ncdf4″) # reading and writing netCDF files

## Set the current working directory for data and results

setwd(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis’)

## Open the NetCDF files for reading

histnc = nc_open(‘SummaryOver_1901-2000.nc’)
futnc = nc_open(‘SummaryOver_2001-2100.nc’)

## Get the lat and lon variables

lon = ncvar_get(histnc,’lon’)
lat = ncvar_get(histnc,’lat’)

## The lat dimension is highest to lowest in the input
## file, so its vector needs to be reversed.
## Reverse lat into ascending order (lower case r rev)

lat = rev(lat)
## Create the data frames for historical, future, and difference data

normHist = expand.grid(lon=lon, lat=lat)
normFut = expand.grid(lon=lon, lat=lat)
normDiff = expand.grid(lon=lon, lat=lat)

## Get the fields from the input files
## Note the lat dimension comes in from high to low, so
## it needs to be reversed so that R can display it (upper case R Rev)

## Consumed is the annual mean for carbon consumed
## PBurn is the annual mean of fraction of area burned
## FRI is the mean number of years between fires

normHist$Consumed = c(Rev(ncvar_get(histnc,’CONSUMED_mean’),margin = 2))
normHist$PBurn = c(Rev(ncvar_get(histnc,’PART_BURN_mean’),margin = 2))
normHist$FRI = c(Rev(ncvar_get(histnc,’PART_BURN_intvl’),margin = 2))

normFut$Consumed = c(Rev(ncvar_get(futnc,’CONSUMED_mean’),margin = 2))
normFut$PBurn = c(Rev(ncvar_get(futnc,’PART_BURN_mean’),margin = 2))
normFut$FRI = c(Rev(ncvar_get(futnc,’PART_BURN_intvl’),margin = 2))

## Normalize the values prior to doing the analysis
## Also get the z scores for distribution plotting later
## Note that NA values from the data must be omitted before
## taking the statistics

## Loop over the three variables used in the clusters
## foreach one filter out cells with no data using na.omit
## Normalize the data based on the min and max value of each
## variable.

for(nm in c(‘Consumed’,’PBurn’,’FRI’)) {
## Temporary data for efficient processing
tmpVect = c(na.omit(normHist[[nm]]),na.omit(normFut[[nm]]))
tmpMin = min(tmpVect)
tmpMax = max(tmpVect)
tmpDiff = tmpMax – tmpMin
tmpMean = mean(tmpVect)
tmpSD = sd(tmpVect)

## Normalize the data
normHist[[nm]] = (normHist[[nm]] – tmpMin) / tmpDiff
normFut[[nm]] = (normFut[[nm]] – tmpMin) / tmpDiff

## Z scores
normHist[[paste(nm,’_Z’,sep=”)]] = (normHist[[nm]] – tmpMean) / tmpSD
normFut[[paste(nm,’_Z’,sep=”)]] = (normFut[[nm]] – tmpMean) / tmpSD
}

## Create the data frame for clustering
## Again, we omit data with no value using na.omit

normWorkDF = data.frame(‘Consumed’ = c(na.omit(normHist$Consumed),na.omit(normFut$Consumed)))
normWorkDF$PBurn = c(na.omit(normHist$PBurn),na.omit(normFut$PBurn))
normWorkDF$FRI = c(na.omit(normHist$FRI),na.omit(normFut$FRI))

## Perform the clustering analysis
## This is the point of all the work being done here
## The Lloyd algorithm runs efficiently on the large dataset
## (~100k points)

normCluster = kmeans(
data.frame(normWorkDF$Consumed,normWorkDF$PBurn,normWorkDF$FRI),
4,
algorithm=”Lloyd”,
iter.max=500
)

## Copy the clusters back to the work data frame

normWorkDF$Clust = normCluster$cluster

# Plot the cluster distributions

for(ndx in 1:4) {
fNm = paste(
‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormClust_’,
ndx,
‘.png’,
sep=”
)
png(fNm)
tmpDF = subset(normWorkDF,normWorkDF$Clust == ndx)
plot(
density(tmpDF$Consumed),
xlim=c(0,1),
ylim=c(0.0,6.0),
col=’darkblue’,
xlab=’Normalized Value’,
main=paste(‘Cluster’,ndx,’Distribution’)
)
lines(density(tmpDF$PBurn),xlim=c(0,1),col=’green’)
lines(density(tmpDF$FRI),xlim=c(0,1),col=’red’)
legend(
‘topright’,
legend=c(‘C Consumed’,’Fract. Area Burned’,’Time Betw. Fires’),
lwd=1,
col=c(‘darkblue’,’green’,’red’)
)
dev.off()
}

## Add the cluster numbers into the original data frames
## Note that NA needs to be taken into account using
## [!is.na(hist$Consumed)] to get the indexes of those data
## items that are not NA

## Historical values are in first half of data frame, future in second
## Calculate indexes for clarity

histClustStartNdx = 1
histClustEndNdx = length(normCluster$cluster)/2
futClustStartNdx = length(normCluster$cluster)/2 + 1
futClustEndNdx = length(normCluster$cluster)

normHist$Clust = NA
normHist$Clust[!is.na(normHist$Consumed)] = normWorkDF$Clust[histClustStartNdx:histClustEndNdx]

normFut$Clust = NA
normFut$Clust[!is.na(normFut$Consumed)] = normWorkDF$Clust[futClustStartNdx:futClustEndNdx]

## Create matrices for results and display
## png() tells R what file to store the results to

png(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormHistClustMap.png’)
normHistClustArr = matrix(normHist$Clust,nrow=331,ncol=169)
image(lon,lat,normHistClustArr)
dev.off()

png(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormFutClustMap.png’)
normFutClustArr = matrix(normFut$Clust,nrow=331,ncol=169)
image(lon,lat,normFutClustArr)
dev.off()

## Calculate Euclidean distance between cluster centers

normClustCtrEucDist = data.frame(cbind(rep(0,4),rep(0,4),rep(0,4),rep(0,4)))

for(ndx_1 in 1:4) {
for(ndx_2 in 1:4) {
normClustCtrEucDist[ndx_1,ndx_2] =
sqrt(
(normCluster$centers[ndx_1,1] – normCluster$centers[ndx_2,1]) ^ 2 +
(normCluster$centers[ndx_1,2] – normCluster$centers[ndx_2,2]) ^ 2 +
(normCluster$centers[ndx_1,3] – normCluster$centers[ndx_2,3]) ^ 2
)
}
}

## Create Fire Regime Euclidean distance map between historical and future

normDiff$HistClust = normHist$Clust
normDiff$FutClust = normFut$Clust
normDiff$EucDist = NA

normDiff$EucDist = mapply(
function(x,y) ifelse(is.na(y), NA, normClustCtrEucDist[x,y]),
normDiff$HistClust,
normDiff$FutClust
)

normDiffEucDistArr = matrix(normDiff$EucDist,nrow=331,ncol=169)
image(lon,lat,normDiffEucDistArr)

## Create NetCDF file and store results there. This yields a
## NetCDF file that can be used independently of R

x = ncdim_def(‘lon’,’degrees’,lon)
## Reverse the order of latitude values
y = ncdim_def(‘lat’,’degrees’,rev(lat))

histFCNCVar = ncvar_def(‘HistFireCluster’,’classification’,list(x,y),-9999,prec=’short’)
futFCNCVar = ncvar_def(‘FutFireCluster’,’classification’,list(x,y),-9999,prec=’short’)
eucDistNCVar = ncvar_def(‘EucDistance’,’NormalizedDistance’,list(x,y),9.9692099683868690e+36,prec=’float’)

# Reverse the lat order of the values being output
OutHistClust = c(Rev(matrix(normDiff$HistClust,ncol=length(lat),nrow=length(lon)),margin=2))
OutFutClust = c(Rev(matrix(normDiff$FutClust,ncol=length(lat),nrow=length(lon)),margin=2))
OutEucDist = c(Rev(matrix(normDiff$EucDist,ncol=length(lat),nrow=length(lon)),margin=2))

## Output the variables and close the file

ncOut = nc_create(‘NormClusters.nc’,list(histFCNCVar,futFCNCVar,eucDistNCVar))

ncvar_put(ncOut,histFCNCVar,OutHistClust)
ncvar_put(ncOut,futFCNCVar,OutFutClust)
ncvar_put(ncOut,eucDistNCVar,OutEucDist)

nc_close(ncOut)

Background & data used

As a practice, geneocological studies such as this one, utilize common gardens to monitor phenotypic measurements of plants from a dispersed set of populations in a common environment. By collecting seed from a variety of populations in the range of the species, and growing them up in a common garden, the effects of site to site variation are more controlled. Ultimately, the phenotypic traits of each plant observed at all of the common gardens are meant to be tied back to the seed source population so that we might learn about how population-level selective pressures translate to divergent traits (even when these plants are grown in a variety of climates). In theory, planting the same seed source population in multiple climate regimes will illuminate what traits are plastic, (i.e. change within one lifetime of a plant) and what traits change over generational time. For example, if you observe that no matter what common garden a seed source population is planted in, you always see wide leaves, and populations from other seed source populations always have narrow leaves, then it is possible to reach the conclusion that leaf width changes over generational /evolutionary time. This is important because if plants have been adapting to selective pressures over many generations and they are moved to a differing conditions, the traits they have developed might be maladapated to the new conditions. My dataset includes 39 population means (4-5 individuals) for each of sixteen gardens for several phenotypic traits such as leaf width, crown width and others.

One of the problems I hope to deal with is how to both quantify and visualize bluebunch wheatgrass phenotypic trait variability across soil gradients. One technique that may be useful is multiple linear regression. One drawback of my dataset however, is the fact that many soils variables are spatially auto correlated. This lack of independence leads to a violation of one of the assumptions of multiple linear regression. In order to deal with the spatially related variables in my dataset, it seemed appropriate to use a regression method that accounts for this spatial relationship.

Question

Is there a relationship between available water storage (AWS) in the top 25cm of the soil surface and the leaf width in bluebunch wheatgrass populations?

Approach and tools used

In order to make my results easier to interpret, I have restricted my analysis to a single common garden that contains 39 different bluebunch wheatgrass populations.

Tools:

geographically weighted regression analysis in ArcMap.
Spatial join tool in ArcMap
Join field tool in ArcMap.
Hot Spot analysis tool in ArcMap

Steps followed to complete the analysis

Surveyed gssurgo database for continuously variable soil metrics that might be relevant to plant growth (i.e. available water storage).
Joined AWS tabular data to gssurgo raster data for the entire study area.
Spatially joined soils tabular data to common garden source population x y coordinates, and plant phenotypic traits data from common garden monitoring.
Selected a common garden that had existing soils data for all seed source populations.
Reduced dataset to only include soils and trait data for the Wahluke common garden in seed zone 1.
Performed geographically weighted regression analysis.
1. Analysis failed initially due to the wide geographic spread of populations.
2. Set the minimum number of populations to be used in analysis to 8 (5 was too few, and 10 produced a smaller p-value).
Exported regression results to a .csv table
Joined local regression coefficients to seed source population shape file attribute table.
Symbolized local regression coefficients according to sign (negative = blue, positive = red)
Performed a hotspot analysis on mapped regression coefficients to look for significant clustering

Results

Positive relationships between leaf width and available water storage occur in moderately wet areas
Negative relationships between leaf width and available water storage occur in moderately wet to dry areas.
Negative coefficients tend to be near other negative coefficients spatially
Positive coefficients tend to be near other positive coefficients as well

Critique

Limited to the analysis of one plant trait and one common garden at a time.
The resolution and accuracy of the AWS data is questionable.
The AWS values may not vary enough to be biologically significant.

GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Just another blogs.oregonstate.edu site

Monthly Archives: May 2016

Hotspot Analysis of Elemental Clay Chemistry from the Valley of Oaxaca, Mexico

Spatially weighted regression: A case study

Logistic Regression in R

Comparing Interpolation Methods in R

K-means Clustering in R

Geographically Weighted Regression Analysis of Available Soil Moisture and Leaf Width