Tutorials 2016 Archives - Page 2 of 2 - GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Introduction

I am working with a dataset from a dynamic global vegetation model (DGVM) run across the Pacific Northwest (PNW) over the time period 1895-2100. This is a process-based model that includes a dynamic fire model. I am particularly interested in drivers of the fire regimes (the frequency and intensity of wildfire) produced by the model. The general question I am studying is “what, if any, is the relationship between the climatic and elevation inputs of the model and the fire regimes produced by the model?” A more specific pair of questions for this method is “Do fire regimes cluster locally?” and “How do fire regimes change over time?”

Method description

Overview

I divided the data into two time periods: 1901-2000 and 2001-2100 to capture the historical and projected climates and fire regimes over those periods. To define a fire regime, I chose the two fire-related output variables – carbon consumed by fire and percent of model pixel burned – along with fire return interval (FRI) over the century time period. I used k-means cluster analysis in R to define four fire regimes.

K-means clustering divides a dataset into a specified number of data point clusters and calculates centroids and cluster membership such that the Euclidean distance between each cluster’s centroids its members is minimized. The relationships between scales of the input variables should be taken into consideration in this type of analysis as it affects the values of Euclidean distances calculated. The steps in the analysis are outlined below. An appendix of the actual commands used for the analysis are in Appendix A.

Outline of steps

For each of the two input files:

Open the NetCDF file
Reverse the order of the latitude values (change from highest to lowest to lowest to highest)
Filter out the na (no data) values
Copy data (carbon consumed by fire, fraction pixel burned, FRI) into a combined data set

For the combined dataset:

Normalize values for each variable to 0.0 to 1.0 using the minimum and maximum of each variable in the combined dataset.
Generate the four clusters
Output value distributions of the four clusters
Distribute resulting data into two output datasets, one for each of the two time periods
1. Divide into two separate datasets
2. Distribute results into non-na data locations

For each of the output datasets:

Create output NetCDF file
Write data to NetCDF file
Close NetCDF file

Results

Clusters

With my input data, the FRI values range from a minimum of 1 to a maximum of 100 years. The mean annual fraction area burned has a theoretical range from 0 to 1, and the mean annual carbon consumed ranges from 0 to 266.5 gm^-2. Performing the cluster analysis using the original input values resulted in 3 of the 4 clusters driven primarily by the values of carbon consumed (Fig. 1a-c). Normalizing the values of each input variable using the variable’s minimum and maximum, resulted in clusters driven by different variables (Fig. 2). For the remainder of the project I am using normalized variables.

Fig. 1: Z-score distributions of input variables within clusters for clusters computed without normalizing input data. Clusters 1, 2, and 3 (A-C) exhibit strong overlaps in both fraction area burned, and time between fires, and strong shifts in carbon consumed between clusters. Cluster 4 (D) is the only cluster in which all three factors differ substantially with other clusters.

Fig. 2: Distributions of normalized input variables within clusters. Each of the clusters was created by performing cluster analysis using normalized input data. All clusters have substantial shifts in the distributions of at least two factors compared to the other clusters.

Result maps

The clustering of fire regimes can easily be seen in maps of the fire regimes (Fig. 3). Areas without fire tend to be at higher elevations: in the Coast Range and Cascades of Oregon and Washington and the Rockies of western Montana and northwestern Idaho. High frequency fires (Low FRI) are common in the plains and plateaus of southern Idaho, south and central Oregon, and the Columbia Plateau in central Washington. The other two fire regimes, both of which have a Medium FRI are somewhat intermingled, but are present in mid elevations in the Rockies, in southwest Oregon, and in the Willamette Valley and Puget Trough regions of Oregon and Washington, respectively.

Fig. 3: Fire regime maps for (A) the historical period (1901-2000) and (B) the future period (2001-2100). (C: carbon consumed; Med: medium; Frac: fraction of grid cell; No Fire: no fire occurred in the cell over the time period)

Fire regime change from the historical (Fig. 3A) to future (Fig. 3B) period include the appearance of the High Carbon Consumed, Medium Fraction Burned, Medium FRI fire regime around the edges of the plains and plateaus as well as into the Blue Mountains of northeastern Oregon as well as the spread of both Medium FRI fire regimes into the Coast, Cascade, and Rocky Mountain ranges.

Method critique

I found this method to be very useful for my study. Fire and vegetation modeling results are commonly expressed using single variables over time or area and fail to take into consideration the relationships between multiple variables. This work indicates that breaking sets of multiple fire-related variables into clusters provides a way to better characterize regional characteristics as well as spatial changes through time.

I did find the data manipulation required for the method a bit tricky. Having to work around the na values within the data required a bit of labor. Converting matrix data to vector data for the methods was another minor inconvenience. All that said, however, I believe such manipulations will become second nature as my experience with R grows.

Appendix A: Annotated R commands used for the analysis of normalized input data.
## Cluster analysis on normalized data.
## The dataset is drawn from two files, one for the 20th
## century and one for the 21st
## Install the necessary libraries

library(“DescTools”) # for displaying data
library(“ncdf4″) # reading and writing netCDF files

## Set the current working directory for data and results

setwd(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis’)

## Open the NetCDF files for reading

histnc = nc_open(‘SummaryOver_1901-2000.nc’)
futnc = nc_open(‘SummaryOver_2001-2100.nc’)

## Get the lat and lon variables

lon = ncvar_get(histnc,’lon’)
lat = ncvar_get(histnc,’lat’)

## The lat dimension is highest to lowest in the input
## file, so its vector needs to be reversed.
## Reverse lat into ascending order (lower case r rev)

lat = rev(lat)
## Create the data frames for historical, future, and difference data

normHist = expand.grid(lon=lon, lat=lat)
normFut = expand.grid(lon=lon, lat=lat)
normDiff = expand.grid(lon=lon, lat=lat)

## Get the fields from the input files
## Note the lat dimension comes in from high to low, so
## it needs to be reversed so that R can display it (upper case R Rev)

## Consumed is the annual mean for carbon consumed
## PBurn is the annual mean of fraction of area burned
## FRI is the mean number of years between fires

normHist$Consumed = c(Rev(ncvar_get(histnc,’CONSUMED_mean’),margin = 2))
normHist$PBurn = c(Rev(ncvar_get(histnc,’PART_BURN_mean’),margin = 2))
normHist$FRI = c(Rev(ncvar_get(histnc,’PART_BURN_intvl’),margin = 2))

normFut$Consumed = c(Rev(ncvar_get(futnc,’CONSUMED_mean’),margin = 2))
normFut$PBurn = c(Rev(ncvar_get(futnc,’PART_BURN_mean’),margin = 2))
normFut$FRI = c(Rev(ncvar_get(futnc,’PART_BURN_intvl’),margin = 2))

## Normalize the values prior to doing the analysis
## Also get the z scores for distribution plotting later
## Note that NA values from the data must be omitted before
## taking the statistics

## Loop over the three variables used in the clusters
## foreach one filter out cells with no data using na.omit
## Normalize the data based on the min and max value of each
## variable.

for(nm in c(‘Consumed’,’PBurn’,’FRI’)) {
## Temporary data for efficient processing
tmpVect = c(na.omit(normHist[[nm]]),na.omit(normFut[[nm]]))
tmpMin = min(tmpVect)
tmpMax = max(tmpVect)
tmpDiff = tmpMax – tmpMin
tmpMean = mean(tmpVect)
tmpSD = sd(tmpVect)

## Normalize the data
normHist[[nm]] = (normHist[[nm]] – tmpMin) / tmpDiff
normFut[[nm]] = (normFut[[nm]] – tmpMin) / tmpDiff

## Z scores
normHist[[paste(nm,’_Z’,sep=”)]] = (normHist[[nm]] – tmpMean) / tmpSD
normFut[[paste(nm,’_Z’,sep=”)]] = (normFut[[nm]] – tmpMean) / tmpSD
}

## Create the data frame for clustering
## Again, we omit data with no value using na.omit

normWorkDF = data.frame(‘Consumed’ = c(na.omit(normHist$Consumed),na.omit(normFut$Consumed)))
normWorkDF$PBurn = c(na.omit(normHist$PBurn),na.omit(normFut$PBurn))
normWorkDF$FRI = c(na.omit(normHist$FRI),na.omit(normFut$FRI))

## Perform the clustering analysis
## This is the point of all the work being done here
## The Lloyd algorithm runs efficiently on the large dataset
## (~100k points)

normCluster = kmeans(
data.frame(normWorkDF$Consumed,normWorkDF$PBurn,normWorkDF$FRI),
4,
algorithm=”Lloyd”,
iter.max=500
)

## Copy the clusters back to the work data frame

normWorkDF$Clust = normCluster$cluster

# Plot the cluster distributions

for(ndx in 1:4) {
fNm = paste(
‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormClust_’,
ndx,
‘.png’,
sep=”
)
png(fNm)
tmpDF = subset(normWorkDF,normWorkDF$Clust == ndx)
plot(
density(tmpDF$Consumed),
xlim=c(0,1),
ylim=c(0.0,6.0),
col=’darkblue’,
xlab=’Normalized Value’,
main=paste(‘Cluster’,ndx,’Distribution’)
)
lines(density(tmpDF$PBurn),xlim=c(0,1),col=’green’)
lines(density(tmpDF$FRI),xlim=c(0,1),col=’red’)
legend(
‘topright’,
legend=c(‘C Consumed’,’Fract. Area Burned’,’Time Betw. Fires’),
lwd=1,
col=c(‘darkblue’,’green’,’red’)
)
dev.off()
}

## Add the cluster numbers into the original data frames
## Note that NA needs to be taken into account using
## [!is.na(hist$Consumed)] to get the indexes of those data
## items that are not NA

## Historical values are in first half of data frame, future in second
## Calculate indexes for clarity

histClustStartNdx = 1
histClustEndNdx = length(normCluster$cluster)/2
futClustStartNdx = length(normCluster$cluster)/2 + 1
futClustEndNdx = length(normCluster$cluster)

normHist$Clust = NA
normHist$Clust[!is.na(normHist$Consumed)] = normWorkDF$Clust[histClustStartNdx:histClustEndNdx]

normFut$Clust = NA
normFut$Clust[!is.na(normFut$Consumed)] = normWorkDF$Clust[futClustStartNdx:futClustEndNdx]

## Create matrices for results and display
## png() tells R what file to store the results to

png(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormHistClustMap.png’)
normHistClustArr = matrix(normHist$Clust,nrow=331,ncol=169)
image(lon,lat,normHistClustArr)
dev.off()

png(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormFutClustMap.png’)
normFutClustArr = matrix(normFut$Clust,nrow=331,ncol=169)
image(lon,lat,normFutClustArr)
dev.off()

## Calculate Euclidean distance between cluster centers

normClustCtrEucDist = data.frame(cbind(rep(0,4),rep(0,4),rep(0,4),rep(0,4)))

for(ndx_1 in 1:4) {
for(ndx_2 in 1:4) {
normClustCtrEucDist[ndx_1,ndx_2] =
sqrt(
(normCluster$centers[ndx_1,1] – normCluster$centers[ndx_2,1]) ^ 2 +
(normCluster$centers[ndx_1,2] – normCluster$centers[ndx_2,2]) ^ 2 +
(normCluster$centers[ndx_1,3] – normCluster$centers[ndx_2,3]) ^ 2
)
}
}

## Create Fire Regime Euclidean distance map between historical and future

normDiff$HistClust = normHist$Clust
normDiff$FutClust = normFut$Clust
normDiff$EucDist = NA

normDiff$EucDist = mapply(
function(x,y) ifelse(is.na(y), NA, normClustCtrEucDist[x,y]),
normDiff$HistClust,
normDiff$FutClust
)

normDiffEucDistArr = matrix(normDiff$EucDist,nrow=331,ncol=169)
image(lon,lat,normDiffEucDistArr)

## Create NetCDF file and store results there. This yields a
## NetCDF file that can be used independently of R

x = ncdim_def(‘lon’,’degrees’,lon)
## Reverse the order of latitude values
y = ncdim_def(‘lat’,’degrees’,rev(lat))

histFCNCVar = ncvar_def(‘HistFireCluster’,’classification’,list(x,y),-9999,prec=’short’)
futFCNCVar = ncvar_def(‘FutFireCluster’,’classification’,list(x,y),-9999,prec=’short’)
eucDistNCVar = ncvar_def(‘EucDistance’,’NormalizedDistance’,list(x,y),9.9692099683868690e+36,prec=’float’)

# Reverse the lat order of the values being output
OutHistClust = c(Rev(matrix(normDiff$HistClust,ncol=length(lat),nrow=length(lon)),margin=2))
OutFutClust = c(Rev(matrix(normDiff$FutClust,ncol=length(lat),nrow=length(lon)),margin=2))
OutEucDist = c(Rev(matrix(normDiff$EucDist,ncol=length(lat),nrow=length(lon)),margin=2))

## Output the variables and close the file

ncOut = nc_create(‘NormClusters.nc’,list(histFCNCVar,futFCNCVar,eucDistNCVar))

ncvar_put(ncOut,histFCNCVar,OutHistClust)
ncvar_put(ncOut,futFCNCVar,OutFutClust)
ncvar_put(ncOut,eucDistNCVar,OutEucDist)

nc_close(ncOut)

Background & data used

As a practice, geneocological studies such as this one, utilize common gardens to monitor phenotypic measurements of plants from a dispersed set of populations in a common environment. By collecting seed from a variety of populations in the range of the species, and growing them up in a common garden, the effects of site to site variation are more controlled. Ultimately, the phenotypic traits of each plant observed at all of the common gardens are meant to be tied back to the seed source population so that we might learn about how population-level selective pressures translate to divergent traits (even when these plants are grown in a variety of climates). In theory, planting the same seed source population in multiple climate regimes will illuminate what traits are plastic, (i.e. change within one lifetime of a plant) and what traits change over generational time. For example, if you observe that no matter what common garden a seed source population is planted in, you always see wide leaves, and populations from other seed source populations always have narrow leaves, then it is possible to reach the conclusion that leaf width changes over generational /evolutionary time. This is important because if plants have been adapting to selective pressures over many generations and they are moved to a differing conditions, the traits they have developed might be maladapated to the new conditions. My dataset includes 39 population means (4-5 individuals) for each of sixteen gardens for several phenotypic traits such as leaf width, crown width and others.

One of the problems I hope to deal with is how to both quantify and visualize bluebunch wheatgrass phenotypic trait variability across soil gradients. One technique that may be useful is multiple linear regression. One drawback of my dataset however, is the fact that many soils variables are spatially auto correlated. This lack of independence leads to a violation of one of the assumptions of multiple linear regression. In order to deal with the spatially related variables in my dataset, it seemed appropriate to use a regression method that accounts for this spatial relationship.

Question

Is there a relationship between available water storage (AWS) in the top 25cm of the soil surface and the leaf width in bluebunch wheatgrass populations?

Approach and tools used

In order to make my results easier to interpret, I have restricted my analysis to a single common garden that contains 39 different bluebunch wheatgrass populations.

Tools:

geographically weighted regression analysis in ArcMap.
Spatial join tool in ArcMap
Join field tool in ArcMap.
Hot Spot analysis tool in ArcMap

Steps followed to complete the analysis

Surveyed gssurgo database for continuously variable soil metrics that might be relevant to plant growth (i.e. available water storage).
Joined AWS tabular data to gssurgo raster data for the entire study area.
Spatially joined soils tabular data to common garden source population x y coordinates, and plant phenotypic traits data from common garden monitoring.
Selected a common garden that had existing soils data for all seed source populations.
Reduced dataset to only include soils and trait data for the Wahluke common garden in seed zone 1.
Performed geographically weighted regression analysis.
1. Analysis failed initially due to the wide geographic spread of populations.
2. Set the minimum number of populations to be used in analysis to 8 (5 was too few, and 10 produced a smaller p-value).
Exported regression results to a .csv table
Joined local regression coefficients to seed source population shape file attribute table.
Symbolized local regression coefficients according to sign (negative = blue, positive = red)
Performed a hotspot analysis on mapped regression coefficients to look for significant clustering

Results

Positive relationships between leaf width and available water storage occur in moderately wet areas
Negative relationships between leaf width and available water storage occur in moderately wet to dry areas.
Negative coefficients tend to be near other negative coefficients spatially
Positive coefficients tend to be near other positive coefficients as well

Critique

Limited to the analysis of one plant trait and one common garden at a time.
The resolution and accuracy of the AWS data is questionable.
The AWS values may not vary enough to be biologically significant.

Though diving into R can be a bit intimidating, only a basic understanding of R is required to beginning analysis your spatial data. First you need to install the right packages, there are a lot of spatial analysis packages in R which can be found here: https://cran.r-project.org/web/views/Spatial.html . However, the ones I’ll be using are, “rgdal”, “rgeos”, “maptools”, “dplyr”, “tidyr”, “tmap”.
You can install these by (without the # sign): #install.packages(c("rgdal", "rgeos", "maptools", "dplyr", "tidyr", "tmap"))
In this example we’ll be playing with my data which you will not have access too so here is a tutorial that will definitely help you get started if you are really interested in it. https://cran.r-project.org/doc/contrib/intro-spatial-rl.pdf
First we need to make sure the right libraries(packages) are loaded. If you Google each one of these with followed by “in R” it will come up with the help documentation behind the package and what it does. However, I will mention that the package ‘sp’ is quite important, it is the bases behind all other spatial analysis packages here since it defines the spatial data class in R. It allows R to recognize and deal with spatial vector data and raster data.
x <- c("rgdal", "rgeos", "maptools", "dplyr", "tidyr", "tmap") lapply(x, library, character.only=T)
Now we have the packages lets go ahead and read in my data. The first file I have is just a simple .csv file on different soil properties. read.csv is one way you import data into R that is a .csv.
##reading in a .csv file with all the attributes ##reading in a .csv file with all the attributes NRCS_data <- read.csv ("Data/NRCS_Data/Agg_NRCS_Data.csv", header=T, stringsAsFactors = F)
Next we are going to use readOGR function in the package rgdal.
##This is a polygon shapefile CRA <- readOGR(dsn="Data/NRCS_Data", layer="Final_CRA") ## OGR data source with driver: ESRI Shapefile ## Source: "Data/NRCS_Data", layer: "Final_CRA" ## with 60 features ## It has 4 fields CRA ## class : SpatialPolygonsDataFrame ## features : 60 ## extent : -124.566, -116.463, 41.99208, 46.28576 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0 ## variables : 4 ## names : OBJECTID, Final_CRA, Shape_Leng, Shape_Area ## min values : 1, 1.1, 2.338222, 0.03185210 ## max values : 60, 9.9, 53.866012, 3.83380590
It is important to note that it takes 2 parameters first dsn= is just where the folder with the shapefile is saved. layer= is the name of the shapefiles that you want to read in. We are reading in the shapefile “Final_CRA”, which is a polygon shapefile and storing it in R under the name CRA. You can see it is a spatialpolygonDataframe. There are 2 real important aspects to this data format first CRA@polygons and CRA@data. @polygons refers the the information on the polygons and their coordinates. What @data refers to their attribute table.
head(CRA@data, n=3) ## OBJECTID Final_CRA Shape_Leng Shape_Area ## 0 1 1.1 8.832671 0.458048 ## 1 2 1.6 17.469240 1.406368 ## 2 3 10.1 24.579325 1.115243
you can see there isn’t any useful information associated with each polygon about soil parameters; however, the NRCS_data has interesting information that we want to link to these polygons.
##aggergating my atributes by pH finding the mean pH for each CRA agg_NRCS_data <- aggregate(ph_1_to_1_H20~CRA, FUN=mean, data=NRCS_data) agg_NRCS_data <- rename(agg_NRCS_data, Final_CRA = CRA) head(agg_NRCS_data, n=3) ## Final_CRA ph_1_to_1_H20 ## 1 1.1 5.150000 ## 2 1.6 5.153333 ## 3 10.1 7.153333
This data has pH values and their corresponding CRA.
Using a package called dplyr, I am able to attach the pH data to the CRA polygon data.
##merging the 2 datasets together by CRA CRA@data <- left_join(CRA@data, agg_NRCS_data) ## Joining by: "Final_CRA" ## Warning in left_join_impl(x, y, by$x, by$y): joining character vector and ## factor, coercing into character vector CRA@data <- rename(CRA@data, pH_water = ph_1_to_1_H20) head(CRA@data, n=3) ## OBJECTID Final_CRA Shape_Leng Shape_Area pH_water ## 1 1 1.1 8.832671 0.458048 5.150000 ## 2 2 1.6 17.469240 1.406368 5.153333 ## 3 3 10.1 24.579325 1.115243 7.153333
Simple, now the polygon dataframe has some interesting metadata associated with each polygon. However, suppose you want to just creat a point layer spatial dataframe instead of merging it to your polygon dataframe, which you can do in the sp package. The function SpatialPointDataframe is quite particular in its required parameters. It requires the coordinates to be in matrix format I am pulling the corresponding columns from the NRCS_data dataframe and converting them into a matrix.
##Creates a matrix of lat long coord NRCS_mat <- with(NRCS_data, cbind(longitude, latitude)) row.names(NRCS_mat)<- 1:nrow(NRCS_mat) ##Takes the projection from the CRA shapfile CRA.projection <- CRS(print(proj4string(CRA))) ## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0" ##Creates a spatail dataframe with the projection as our CRA shapefile NRCS <- SpatialPointsDataFrame(coords= NRCS_mat, data = NRCS_data, proj4string = CRA.projection)
These last 2 lines of code take the projection my CRA polygon data has and uses it form my newly formed point data dataframe.
Since both spatial dataframes are not projected I use a code from epsg (googled it) that corresponds with Oregon to project them I am creating new spatial dataframes with the projection with the function spTransform()
NRCS.projected <- spTransform(NRCS, CRS("+init=epsg:2992")) CRA.projected <- spTransform(CRA, CRS("+init=epsg:2992"))
Lastly, we want to see what these layers look like so I use the package tmap for plotting which works a lot like ggplot2. qtm is just like qplot in ggplot2. I tell it the layer I want to plot and I want to fill the polygons with the different pH values in each
qtm(shp=CRA, fill="pH_water")

tm_shape(CRA.projected)+ tm_fill("pH_water")+tm_borders()+tm_shape(NRCS.projected)+ tm_bubbles("red", size = .1, border.col="black")

The second line of code actually plots 2 layers the first layer is the polygon dataframe and the second one is the point dataframe. You do this by playing with the customization of tmap typing ?tmap will help show just how customization tmap package is.
Hopefully, this tutorial helped you get started with spatial data in R.

Question:

Where are the statistically significant regions of high and low Alaska Skate catch rate from commercial longliners in the Bering Sea?

Name of the tool or approach used.

ArcGIS- Hot Spot Analysis

Brief description of steps you followed to complete the analysis.

The dataset that I used to complete this analysis includes over 200 species of fish and four gear types, so I chose to limit the scope of this exercise to one species in one gear type. So I narrowed the dataset by selecting Alaska Skates caught by longliners, and ran a Hot Spot Analysis on this selection. Within the Hot Spot tool in Arc, I selected catch “RATE” as the input field for comparison, retained the default selections for the spatial relationship (“FIXED_DISTANCE_BAND) and distance method (“EUCLIDEAN_DISTANCE”), and ran the analysis.
Narrowing the temporal scope of the analysis, I ran it again on two subsets of the data. One on points from 1998-2007 and the second on points from 2008-2014.

Brief description of results obtained.

AK_skate-lgl_hotspot — Alaska Skate catch rate hot spot analysis 1998-2014

The original analysis includes AK skate catch rates on Bering Sea longliners from 1998-2014 and reveals some interesting patterns about fishing behavior as well as where skate catch rates are higher. There is a clear line of cold spots along the southern edge of the oceanic shelf that runs north of the Aleutian Islands. A few distinct hotspots are also revealed in the western Aleutians near Adak and Atka islands

AK_skate-lgl_hotspot98-07 — Hot spot analysis on Skate catch rate from 1998-2007.

AK_skate-lgl_hotspo08-14t — Hot spot analysis of Skate catch rate 2008-2014

Changing the temporal scope of the analysis revealed some very different patterns than the full time period. There are a number of hypotheses that can be developed for further exploration as to why the patterns change.

Critique of the method – what was useful, what was not?

For this dataset, I think that this tool is a quick and easy way to identify patterns at different scales and could be an approach to hypothesis generation. This dataset is one that the sample points are independent of the researcher and span a large temporal and spatial range. It seems that for this tool these characteristics in the dataset make it more applicable.

My dependent variable will be drop developed in Han and Goetz (2015). It is reciprocal for resistance, which is calculated as the amount of impulse that a county experiences from a shock (the percentage of deviation of the actual employment from the expected employment during the Great Recession).

I want to find what factors affect drop. Then I run an OLS regression using drop as the dependent variables and following independent varaibles:

Income inequality: the Gini coefficient

Income distribution: poverty rate and the share of aggregate income held by households earning $200,000 or more

Control variables: Population growth rate from 2001-2005, % Black or African & American (2000), % Hispanic or Latino (2000)

Capital Stock variables:

Human Capital: % population with Bachelor’s degree or higher, age group (20-29, 30-49), female civilian labor force participation rate (2000)

Natural Capital: natural amenity scale (1999)

Social Capital: social capital index (2005)

Built Capital(2000): median housing value (2000)

Financial Capital (2000): share of dividends, interest and rent(2000)

Economic structure:

Employment share of 20 two-digit NAICS industries (manufacturing, construction, etc. other services (except public administration) omitted)

Significant variables and diagnostic tests results are shown in the following table.

Ind. Var	Coefficient	Ind. Var	Coefficient
Population growth rate 2001-2005	0.205***	Transportation and warehousing	-0.337***
%Black or African American	0.0542*	Information	-0.590**
%Hispanic or Latino	-0.113***	Finance and insurance	-0.710***
% pop with Bachelor’s degree or higher	-0.132*	Educational services	-0.894***
Natural amenity scale	0.00728***	Health care and social assistance	-0.215***
Manufacturing	-0.112*	Accommodation and food services	-0.244**
Wholesale trade	-0.580***	Government and government enterprises	-0.231***
Retail trade	-0.620***	Adjusted R-squared: 0.199	AICc: -4693.152798
#obs:2770	Koenker BP Statistics ***	Joint Wald Statistic***	Jarque-Bera Statistic ***

Among the significant variables, I want to explore population growth rate from 2001 to 2005 because it has the larges coefficient in size among control variables.

Moreover, since the Konker BP statistics are significant, the results are non-stationary. Then I run the GWR for population growth rate 2001-2005.

GWR

The following maps are the distribution and GWR coefficients of population growth rate 01-05

For the distribution of population growth rate 01-05, blue regions show negative population growth while the rest counties are all growing in population. Orange and red counties are high in population growth rate.

For the GWR coefficients, negative relationship between population growth rate and drop exist in blue, grey and yellow regions – higher population growth, lower drop (employment loss)

Positive relationship exist in orange and red regions – higher population growth, higher drop (employment loss)

My explanation are :

In an expanding economy (regions high in population growth), there are more people marginally attached to the labor market. They are easily fired in the Great Recession.

In regions with negative population growth, there more deaths than births and population aging rises.

Older people have higher accumulated savings per head than younger people, but spend less on consumer goods.
Less available active labor
An increase in population growth will decrease employment loss.

However, my local R-squared, range from 0-0.620 with the mean 0.0247. Only in Orange County and San Deigo, CA, local R-squared is higher than 0.5, 0.603 and 0.620 separately. This situation (low local R-squared) happened to all my other significant control and capital stock variables () There might be misspecification problems in my model, I will try adding new explanatory variables.

Population growth rate

Local R-squared, 0-0.620, mean 0.0247
Only in 06059 Orange County and 06073 San Deigo, CA, local R-squared is higher than 0.5, 0.603 and 0.620 separately

Natural amenity scale

Local R-squared, 0-0.519, mean 0.0216
12087 Monroe County, Florida 0.519

% Black or African American

Local R-squared, 0-0.347, mean 0.0326
26095 Luce county, Michigan, 0.347

% Hispanic or Latino

Local R-squared, 0-0.379, mean 0.0276
35029, Luna County, New Mexico, 0.379

% pop with Bachelor’s degree or higher

Local R-squared, 0-0.391, mean 0.055
06073 San Deigo, CA, 0.391

Question

How is the occurrence of landslides in western Oregon related to the rate and timing of rainfall? The Northwest River Forecast Center (NWRFC) archives 6-hour rainfall accumulation from more than 2000 recording stations throughout the Pacific Northwest (Figure 1). The challenge is that landslides do not tend to occur in the immediate proximity of these stations, and spatial interpolation must be performed to estimate rainfall amounts at landslide locations.

Figure 1: Location of rainfall recording stations provided by the NWRFC.

The Tool and its Implementation

Kriging was selected over inverse distance weighting to better account for variations in rainfall from east to west. Performance of kriging occurred in Matlab to allow for a better selection of inputs, and to simplify the task, which involved kriging every 6-hour measurement for December (124 times).

Kriging works by weighting measured values using a semivariogram model. Several semivariograms were examined to better identify which model would best fit the dataset (Figure 2).

Figure 2: Examples of semivariogram fits to NWRFC rainfall data.

Based on Figure 2, the exponential semivariogram appeared to be the best choice. Another option is the search radius (i.e. how far the model looks for data points). This value was also varied to illustrate the notion of a poor semivariogram fit (Figure 3).

Figure 3: Examples of varying the search radius (lag distance). Search radii from left to right: 1 degree, 5 degrees, 0.2 degrees.

Once each of the 124 surfaces were produced, values were extracted to the locations of major landslides from this past winter. The extracted values were later used to produce rainfall versus time plots, which are described in the next section.

Results

To simplify results for this project, only one landslide is shown. The Highway 42 landslide occurred on December 23, 2015, closing the highway for nearly one month and costing the Oregon Department of Transportation an estimated $5 million to repair. Rainfall versus time profiles were produced for three popular semivariograms (spherical, exponential, and Gaussian) to gauge the influence of choosing one method over another (Figure 4).

Figure 4: Comparison of results obtained from the different semivariograms and PRISM.

Figure 4 shows little effect due to changing the semivariogram model, which is likely a result of having limited variability in rainfall measurements and the distribution of recording stations near the location of the Highway 42 landslide.

To verify the results of this exercise, PRISM daily rainfall estimates were downloaded for the corresponding time period, and compared (Figure 4). This comparison shows that, while the PRISM data does not capture spikes in rainfall amount, the overall accumulation of rainfall appears to be similar, implying that kriging was effective for this application.

Question

The Statewide Landslide Information Database for Oregon (SLIDO, Figure 1) is a GIS compilation of point data representing past landslides. Each point is associated with a number of attributes, including repair cost, dimensions, and date of occurrence. For this exercise, I asked whether or not SLIDO could be subjected to a hot spot analysis, and if so, could the results be insightful.

Figure 1: SLIDO (version 3.2).

The Tool

Hot spot analysis is a tool in ArcGIS that spatially identifies areas of high and low concentrations of an inputted weighting parameter. The required input is either a vector layer with a numerical attribute that can serve as the weighting parameter, or a vector layer whose features indicate an equal weight of occurrence. Outputs are a z-score, p-value, and confidence level bin for each feature from the input layer.

Description of Exercise

Performing the hot spot analysis was more than simply running the tool with SLIDO as an input with the weighting field selected. Selecting an input field was easier said than done, as the SLIDO attribute table is only partially completed. Based on a survey of fields in the SLIDO attribute table, it was clear that repair cost was the best choice. All points having a repair cost were then exported to a new layer, which was then inputted into the hot spot analysis. An important note is that this step greatly reduced the number of points, and their scatter, and the output map looks slightly different than Figure 1.

Outputs

The output of this exercise is a comparison of SLIDO points colored by their repair cost with SLIDO points colored by confidence level bin (Figure 2).

Figure 2: Comparison of coloring by hot spots to simply coloring by cost.

Discussion

The second map in Figure 2 shows the presence of a major hot spot and a major cold spot regarding landslide costs in Oregon. The figure shows that, on average, landslides in the northwest corner of the state are more expensive. This observation can only be made because there appears to be a similar density of points, located at similar distances away from their neighbors, across the entire network of points. The figure also shows that single high-cost landslides do not play a major role in the positioning of hot spots, which is a nice advantage of the hot spot analysis.

In general, I think that the hot spot analysis did a good job illustrating a trend that may not have been obvious in the original data.

Bonus Investigation

In the hot spot documentation, it is stated that the analysis is not conducive to small datasets. An example of performing a hot spot analysis on a small dataset is provided in Figure 3. While there may be a trend in points colored by normalized infiltration rate, the hot spot map shows not significant points.

Figure 3: Hot spot analysis on a small dataset.

Goal: Where does statistically significant spatial clusters of high values (hot spots) and low values (cold spots) of drop lie?

Variable: drop (employment loss)

Monthly employment Jan 2003- Dec.2014
Data availabl:2840 counties

Tool : Hot spot analysis

Step:

Add Mean center of population.shp (3141 counties), U.S. county.shp (3141 counties), drop value (2840 counties)
Join the drop to the mean center, export as a new layer file.
Project the data
Run Hot Spot Analysis (Default for Conceptualization of Spatial Relationship, and Distance Method)
Make a Map

Distribution of drop, using natural Break

2 Hot spot Analysis using 3141 counties.

Low employment loss gathers in north part
High employment loss lies in west and southeast part

Questions:

Previous map is made of 3141 counties, however, there are only 2840 counties with drop value. Missing value will be noted as 0, will this affect the result?
Hawaii and Alaska are included. Will exclusion of HI and AK affect the result?

Both: Yes

Redraw the map
Only keep the matched records – 2838 counties
Exclude Hawaii and Alaska

3. Hot spot Analysis using 2838 counties

4. Hot spot Analysis using U.S. contiguous counties, Hawaii and Alaska removed.

Conclusion: Hot Spot analysis is sensitive to spatial outlier. It only identifies low value clusters or high value clusters. What if I want to find an outlier, for example low value unit among high value cluster?

I should use spatial autocorrelation tool- Anselin Local Moran’s I

Background:

In August 2015, the BLM released the National Seed Strategy which calls for the use of genetically appropriate seed for the restoration of damaged ecosystems. Bluebunch wheatgrass is a long-lived native perennial species that grows in most western states in the United States as well as British Columbia. This species has been shown to be genetically adapted to different climate patterns across its range. Adaptation is evident because bluebunch wheatgrass plants from distinct climates exhibit different phenotypic (physical) traits. Bluebunch wheatgrass seeds should therefore only be dispersed in areas where they are adapted to the local climate conditions.

In order to determine how far seed from this species can be moved across its range, species specific seed zones have been developed for the Great Basin ecoregion (St. Clair et al 2013). These seed zones were delineated using spatial analysis of climate patterns and observed plant traits from many populations of bluebunch wheatgrass in different climates in the Great Basin.

In order to test the efficacy of the seed zones for bluebunch wheatgrass common gardens have been established. Common gardens are a way of minimizing the effect of site history (i.e. grazing, fire, and management practices) on the growth habits of bluebunch wheatgrass and illuminate the climate-caused genetic adaptations. Wild seeds are collected from dispersed locations in the Great Basin, reared in a greenhouse to the young adult stage, and co-located into a common garden (see figure 1). Once the plants are placed within the common garden they can be monitored for phenotypic trait variability and these variations can be “mapped” back to their home climate.

Although the relationship between bluebunch wheatgrass phenotypes is fairly well understood, there is a knowledge gap surrounding bluebunch phenotypes and soils. I aim to use data gathered from existing common garden studies and soil maps to determine if relationships between soils and bluebunch phenotypes exist.

Study Design:

Two common gardens are situated within each of 8 seed zones resulting in 16 gardens total. At each common garden, 4-5 populations from each zone are represented. Twice per growing season, phenotypic trait data such as leaf length and width, crown width, and number of reproductive stalks are gathered for each individual plant at all common gardens. The resulting dataset contains population level means for each trait at each garden.

Generalized Common Garden Schematic

(figure 1)

Research Question:

Do phenotypic traits of bluebunch wheatgrass vary with soil order?

Tools and Approaches:

Soils Dataset:

Soils data were gathered from the Geospatial Data Gateway and were downloaded separately for each state in the study region to reduce the file size.
Soils raster layers for each state were loaded into ArcMap.
Tabular data for soil order was joined to raster data using the “Join Field” tool in ArcMap.
Soil order layers were symbolized using graduated colors by category.

Bluebunch Trait Dataset:

Latitude and longitude decimal degrees (X Y coordinates) were added to each population in the common garden plant trait dataset.
Blank cells were replaced with NA values.
X Y locations for each row in the dataset were “jittered” to remove duplicate coordinates (see figure 2 & 3).
- Using the rand() function in excel, two new columns with random positive decimal values were added.
- The difference between the two random value columns was subtracted from the X and Y coordinates for each row in the dataset.
- The resulting “jittered” X and Y coordinates were stored in new columns.
The bluebunch trait dataset was loaded in to ArcMap and visualized using the jittered X and Y coordinates.
Hot spot analysis was performed separately for each of four plant traits; leaf width, leaf length, crown width, and number reproductive stalks and visualized with soil order.

(figure 2) Un-jittered population X Y locations

(figure 3) Jittered population X Y locations

(figure 3) Jittered population locations

Results:

In the hot spot analysis of leaf width one hotspot and three cold spots were revealed. This indicates that a significant number of plants that were sourced from those areas had either wide or narrow leaves.

In the hot spot analysis of leaf length, one hotspot and two cold spots were revealed. This indicates that a significant number of plants that were sourced from those areas had either wide or narrow leaves. In this analysis, the hot spot was in a different location than previously indicating that wider leaves were not necessarily longer or vice versa. Contrastingly, the two cold spots in this analysis were in a similar location as with leaf width. This indicates that short narrow leaves tend to also co-occur.

In the analysis of crown width, the hotspot occurred in the same location as with the leaf width analysis. This leads me to believe that plants from this vicinity tend to be larger in general than plants from other areas.

In the analysis of reproductive stalks only one large hot spot was found. This indicates that the larger plants in this region also tend to produce significantly more flowering stalks per plant than in other areas.

Critique:

■ The jittering effect could be reduced to make it easier to distinguish the populations. It may also be possible that this type of analysis is not appropriate for this data set because the phenotypic trait data is for one population being grown at multiple common gardens and this creates a situation where the data values have the same X Y coordinate. Even though, the jittering effect was done at random, there is still a chance that the results we are seeing are a product of the data transformation itself.

■ 90-95% CI might be too stringent for reasonable genetic inference. In many ecological contexts p-values of less than 95% are a regular occurrence. In this case, leaf width / length, crown width, and the number of reproductive stalks of each plant probably co-vary. It does not appear that this type of analysis is meant to deal with co-variance.

■ Soil traits are de-emphasized. Although soil order was used as the background it was difficult to see a pattern between soils and plant trait hot spots. This may be partially due to the large scale of the analysis or may also be a product of the jittering.

■ Only one phenotypic trait can be visualized at once. Since only one trait can be mapped in the hot spot analysis at once, there is no way to know if the observed similarity in hot spot locations is truly significant.

■ Hot spot analysis does not work with raster data. In the future, I would like to find a way to look at the clustering patterns within soils and it appears that this analysis does not recognize raster data.

■ A less regional, and more zoomed in analysis might improve the interpretation. Since the soils are highly variable in this study area it may be more productive to narrow the scope of inference to include smaller areas. Doing this may make the soil – plant trait interpretation easier.

References:

St. Clair Bradley John, Francis F. Kilkenny, Richard C. Johnson, Nancy L. Shaw, and George Weaver. 2013. “Genetic Variation in Adaptive Traits and Seed Transfer Zones for Pseudoroegneria Spicata (bluebunch Wheatgrass) in the Northwestern United States.” Evolutionary Applications 6 (6): 933–48. doi:10.1111/eva.12077.

For the ArcGIS hot spot analysis I used the tutorial available at https://www.arcgis.com/home/item.html?id=6626d5cc81a745f1b737028f7a519521. Note that the tutorial must be downloaded using the “open” button below the image on the web page. This will download a folder with pdf file of instructions and a dataset for their analysis.

I followed the tutorial, but substituted a dataset of fire ignitions logged by the Oregon Department of Forestry between the years 1960 and 2009 (Fig. 1). This dataset does not include some fires that occurred in the state during that time, for instance some on federal land and many on non-forested land.

Figure 1: Oregon Department of Forestry logged fire occurrences from 1963 to 2009.

The tutorial was straightforward and easy to follow. The only “problem” I had was with the spatial autocorrelation step. My data did not produce any peaks in the z-score graph, so I arbitrarily chose a distance towards the lower end of the graph (tutorial steps 10 f and g).

My results from the hot spot analysis step (Fig. 2) showed areas with hot spots. These tended to be near roads and cities in most cases, with a few instances at high elevation in northeastern Oregon.

Figure 2: Results from hot spot analysis.

Results from interpolating the hot spot results using the IDW tool (tutorial step 12) (Fig. 3) produced anomalous edge effects (Fig. 3). Note the areas of high values spreading from hot spot areas into areas with no data along the coast. A close up image of the Medford area (Fig. 4) shows edge effects spreading into areas with no data.

Figure 3: Hot spot analysis interpolated surface produced by the application of the Arc IDW tool. Note the edge effects, especially along the coast and northern state boarder.

Figure 4: Interpolated hot spot surface produced by the Arc IDW tool along with the original ODF ignitions dataset points. Note the edge effects in the southwest corner of the image spreading from the area of high fire occurrence into areas with no data.

As part of a study I did for a master’s thesis, I used the ODF data along with another dataset to do a logistic regression analysis of ignition occurrences in the Willamette watershed excluding the valley floor. Within this area, the hot spot analysis placed hot spots along some roads and population centers in several instances (Fig. 5A). These results are consistent with the results from my study, which showed that human-caused fires were more likely to occur along roads and near areas of high population, and that lightning-caused fires were likely to occur at high elevations (Fig. 5B). The lack of a match between the two methods at high elevation is due to data points that were used in the logistic regression but not in the hot spot analysis. However, one disadvantage of the hot spot analysis is that it is univariate, or strictly occurrence based. The logistic method allows for correlation and the application of that correlation to the prediction of future occurrence

Figure 5: Results from A) hot spot analysis and B) logistic regression analysis for fire occurrence in the hills and mountains of the Willamette watershed. Redder areas are “hotter” in the hot spot analysis and have a higher ignition probability in the logistic regression analysis.

In conclusion, hot spot analysis is a good technique for finding areas of higher occurrence, but has no predictive power beyond analyzing past occurrences. The ArcGIS IDW can produce an interpolated surface, but it has an edge effect issue that could lead to questionable results. Regression methods, for example logistic regression, may produce results more suited to analyzing correlation of independent variables with event occurrence.

GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Just another blogs.oregonstate.edu site

Category Archives: Tutorials 2016

K-means Clustering in R

Geographically Weighted Regression Analysis of Available Soil Moisture and Leaf Width