GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

I set out this term to delve into a vegetation dataset. It consists of nearly 2000 veg surveys in seven salt marshes across the Pacific Northwest. My goal is to be able to predict how vegetation will change with shifts in climate and increasing sea level. Given the diversity of wildlife that utilize estuaries in various stages of their life cycle, understanding habitat will respond is critical to developing conservation plans. To achieve this, I broke the problem into three stages:

1: Identify vegetation communities from field data

2: Create a habitat suitability model for each community under current conditions

3: Use habitat suitability model to project for changes in climate and sea level to estimate community response

Due to the large number of species identified (42), I first needed to reduce the dataset to only the most common species. I choose to use only the species found in more than 5% of all the survey plots. This left me with 16 species. I then explored how to best combine these species into communities. By modeling communities rather than species, I am assuming each species within a community will respond the same. Given that salt marsh vegetation is generally stratified by elevation, this is a reasonable assumption to begin with but one that I will need to revisit in the future. To determine the communities, I used canonical correspondence analysis (CCA), which can be thought of as a Principle Component Analysis for categorical data. I defined the niche of the communities using 5 environmental variables: elevation (standardized for tidal range), mean daily flooding frequency, distance to channel and bay, distance to bay, and channel density. The resulting CCA graph:

I then used a script in R to determine the optimum number of clusters given the CCA results by minimizing within cluster sum of squares. Using the following graph, and my own interpretation of the CCA results, I settled on using 5 communities.

This figure shows the survey plot locations, coded by community. Notice the differences in complexity across the sites (Bandon has many while Grays Harbor and Nisqually have fewer).

To create a continuous prediction of communities and develop a model to project climate responses, I choose to use the MaxEnt habitat suitability modeling tool. Essentially, MaxEnt compares where a species (or community) occurs against the environment (background). It creates response curves by extracting patterns while maximizing entropy (randomness). MaxEnt can take continuous and categorical data as input, and the number of model parameters (few parameters=smoother response curves) can be controlled through the regularization value (1 is default). You can also control which ‘features’ are used to create the response curves (linear, quadratic, product, hinge, threshold). In an attempt to create a parsimonious model, I only used linear and hinge features, but left regularization set to 1. Results from MaxEnt are logistically scaled (0 to 1). Because I am modeling muliple communities in the same area, I needed a method for determining which community is predicted. The simplest is to choose the community with the highest predicted value. This hasn’t been done in the literature, due to issues with how presence data usually collected. But because this dataset comes from standardized field surveys, and I’m using the same predictor layers for all communities, I’m presuming using the maximum value is legitimate. In addition to the 5 physical predictor layers from the CCA, I added 3 climatic layers to the model; annual precip, max temp in August, min temp in Jul–each are 30 year averages from the PRISM dataset. Here are the predicted communities from MaxEnt:

I used two methods to determine the potential error in using the maximum predicted value for the community classification. First, I found the number of communities in each location with a predicted value of greater than 50%. In the figure below, yellow indicates areas where no community has >50% predicted value, while green represents areas with one community over 50%. There areas with higher community richness (2 or 3) are relatively small, so I have more confidence in this method.

MaxEnt_community_richness

Second, I determined the number of communities within 25% of the maximum predicted value [max value – (max value * 0.25)]. This gives an indication of separation in the predicted values across communities. Here, yellow indicates areas where a single community is separated from the other predicted communities. Green are areas with 2 communities with a close prediction. Given the large proportion of yellow and green, I am again given confidence in using the maximum predicted value for community classification.

MaxEnt_community_richness_prevalence

Here are the ROC AUC curves. AUC is a measure of model fit, with 1 being perfect and 0.5 random. All models except GP2 shows relatively good model fit (over .75 is usually deemed a worthwhile model). The species within Gp2 are the most common generalists and I would not have expected MaxEnt to be able to model this community very well. As I pursue this further, I will likely further split up Gp2 in effort to produce better community classifications.

I have several ‘next steps’ to continue developing this model. First, I would like to include vegetation data from 7 California salt marshes in order to better capture the environmental variation along the coast. Developing elevation response models for each site is necessary in order to project this model under climate change and sea-level rise scenarios. I would also like to explore additional environmental layers, such as soil type and distance to ocean mouth (salinity proxy) to further refine the defined niche.

Humpback whales feed in the temperate high latitudes along the North Pacific Rim from California to Japan during the spring, summer, and fall and migrate in the winter to the near-tropical waters of Mexico, Hawaii, Japan, and the Philippines to give birth and mate (Calambokidis et al. 2001; Calambokidis et al. 2008). Although whales show strong site fidelity to feeding and breeding grounds, genetic analysis of maternally inherited DNA (mitochondrial DNA or mtDNA) reveals greater mixing of individuals on the feeding grounds (Baker et al. 2008). This mixing makes it difficult to determine regional population structure and may complicate management decisions. For example, should the feeding grounds be managed as one population unit or is there evidence to suggest that more than one management unit is present? If more than one, are they affected differently by coastal anthropogenic activities, and therefore, require population specific management strategies?

With this in mind, I decided to explore the spatial pattern of humpback whales from the Western and Northern Gulf of Alaska, a subset of data collected during the SPLASH Project (Structure of Populations, Levels of Abundance, and Status of Humpbacks; http://www.cascadiaresearch.org/splash/splash.htm). Specifically, I am interested in the following questions:

Do whales form clusters? Do whales that are more closely related (have the same mtDNA haplotype) cluster together?
Are there spatial patterns in whale distribution based on depth? Do more closely related whales cluster together based on depth?
Are there spatial patterns in whale distribution based on slope? Do more closely related whales cluster together based on slope?

The bathymetry layer, GEBCO_08 Grid, version 20091120 ( http://www.gebco.net) was used for depth and slope analyses in questions 2 and 3. Depth data were extracted using ArcGIS 10.1 Extract Values to Points tool within the Spatial Analyst Toolbox. Slope values were derived from the bathymetry data using ArcGIS 10.1 and the Slope tool; slope values were then extracted using the Extract Values to Points tool within the Spatial Analyst Toolbox.

**Results presented here are strictly for the purposes of exploring the functionality of the ArcGIS tools found in the Spatial Statistics Toolbox. They should be considered preliminary and should not be reproduced elsewhere.**

Part 1: Average Nearest-Neighbor Analysis

This tool is based on the null hypothesis of complete spatial randomness and calculates a nearest neighbor index based on the average distance from each feature to its nearest neighboring feature. The nearest-neighbor ratio is calculated as the Observed Mean Distance divided by the Expected Mean Distance and has a value of 1 under complete spatial randomness. Values greater than 1 indicate a dispersed pattern, while values less than 1 indicate a clustered pattern.

Clustering in whales

Haplotype	n	Obs Mean Dist (m)	Exp Mean Dist (m)	N-N Ratio	Z-score	p-value	Pattern
All	788	1913.54	19537.25	0.0979	-48.443	0.00000	Clustered
A-	202	5962.24	33738.56	0.1767	-22.385	0.00000	Clustered
A+	220	8753.78	34046.85	0.2571	-21.080	0.00000	Clustered
A3	91	7404.13	34929.24	0.2120	-14.381	0.00000	Clustered
E1	73	16789.95	50701.04	0.3312	-10.932	0.00000	Clustered
E3	46	12843.30	34918.80	0.3679	-8.203	0.00000	Clustered
F2	83	14402.10	51259.82	0.2810	-12.532	0.00000	Clustered

The output of this tool indicates that whales, regardless of mtDNA haplotype, are significantly clustered in the Western and Northern Gulf of Alaska. This result is not entirely surprising, given that humpback whales tend to form small groups on the feeding grounds. However, the results of this tool are very sensitive to changes in the study area, and therefore it is best to use this tool with a fixed study area. This approach was not done for the current analysis. Instead, the area of the minimum enclosing rectangle around the input features was used and this area varied for each haplotype variable.

Based on the results it seems the average nearest neighbor tool may not be the most appropriate tool for discovering spatial patterns in humpback whales. However, it would be worth running the tool again using a fixed study area before discarding its utility for this data set completely.

Alternatively, it would be worth conducting a refined nearest-neighbor analysis in which the variable of interest (mtDNA) is the complete distribution function of all observed nearest neighbor distances (not just the mean nearest-neighbor distance) and use a specified distance with which to test for complete spatial randomness. This method is not currently available within the ArcGIS Spatial Statistic Toolbox and would need be conducted in another software package such as R.

Part 2: Hot Spot Analysis

This tool uses the Getis-Ord Gi* statistic to identify statistically significant hot spots (clusters of high values) and cold spots (clusters of low values) given a set of weighted features. For each feature in the data set, a Gi* statistic is returned as a z-score. The larger the positive z-score, the more intense the clustering of high values (hot spot). The smaller the negative z-score, the more intense the clustering of the low values (cold spot).

Figure 1. The output scale for the hot spot analysis tool. When interpreting the results, it is useful to remember that a feature mapped as bright red may not be because its value is particularly large but because it is part of a spatial cluster of high values. Conversely, a feature mapped as bright blue may not be because its value is particularly small but because it is part of a spatial cluster of low values. Thus, the more positive a z-score is, the hotter the hot spot (darker red), while the more negative a z-score is, the colder a cold spot (darker blue).

Spatial patterns in whale distribution based on depth

Figure 2. Results of hot spot analysis for all whales (n=799) based on depth (m), no mtDNA considered.

The output from this tool shows the presence of several hot and cold spots regardless of mtDNA haplotype (Figure 2). The hot spots (red) indicate that whales in these areas occur at shallower depths and the results are statistically significant. There are also several statistically significant cold spots (blue) where whales are found at deeper depths, often beyond the continental shelf.

Figure 3. Results of hot spot analysis by haplotype based on depth (m).

The output by haplotype also shows the presence of several hot and cold spots, although the location of each varies by haplotype (Figure 3). The A+ and A- haplotypes show statistically significant hot spots in the Northern Gulf of Alaska while the E1 and F2 haplotypes show a less intense cluster of in the same region, although still significant. The E1 haplotype also shows a significant hot spot in the Western Gulf of Alaska. These hot spots reflect whales clustering by haplotype at shallower depths. The A3 and E3 haplotypes have relatively little clustering – no hot spots and a very small cold spot in the western region. In general, for all haplotypes, cold spots are located in the western region or beyond the continental shelf where whales cluster at deeper depths.

Spatial patterns in whale distribution based on slope

Figure 4. Results of hot spot analysis for all whales (n=799) based on slope (degrees), no mtDNA considered.

The output from this tool shows the presence of several hot and cold spots regardless of mtDNA haplotype (Figure 4). The hot spots (red) indicate that whales in these areas occur at steeper slopes and the result is statistically significant. There are also several statistically significant cold spots (blue) where whales are found at flatter slopes.

Figure 5. Results of hot spot analysis by haplotype based on slope (degrees).

The output by haplotype also shows the presence of several hot and cold spots, although the location of each varies by haplotype (Figure 5). The A+, A-, A3 and F2 haplotypes show statistically significant hot spots in the Northern Gulf of Alaska while the A3, F2, and E3 (to a lesser extent) haplotypes also show hot spots in the western region. These hot spots reflect whales clustering by haplotype at steeper slopes. The A+, A-, A3, and F2 haplotypes have statistically significant cold spots in the northern region, while a cold spot for the E1 haplotype occurs in the western region. These cold spots reflect whales clustering by haplotype at flatter slopes.

Reflecting on my results, I initially thought perhaps the hot/cold spot patterns found might be influenced by the uneven sampling effort and differences in sample size. However, on 23 May 2013 Lauren Scott from Esri commented on this very subject in response to a posting by Jen Bauer (http://blogs.oregonstate.edu/geo599spatialstatistics/2013/04/24/discerning-variables-spatial-patterns-within-a-clustered-dataset/#comment-1393). Lauren stated that even if sampling is uneven (e.g., many samples are taken from some areas, while fewer samples are taken at others), the impact to the results of a hot spot analysis will be minimal. She provided the following for further clarification. In areas with many samples, the tool will have more information to compute its result. The tool will “compare the local mean based on lots of samples to the global mean based on ALL samples for the entire study area and decide if the difference is statistically significant or not”. In areas with fewer samples, “the local mean will be computed from only a few observations/samples… the tool will compare the local mean (based on only a few pieces of information) to the global mean (based on ALL samples) and determine if the difference is significant”. Thus, my concern seems to be unwarranted.

In general, the hot spot tool seems to be more useful than the average nearest neighbor tool for the humpback whale data set used here. Statistically significant clustering of whales occurs with and without consideration of mtDNA for both depth and slope. Although preliminary, the results from this tool highlight areas for further investigation using additional spatial analysis techniques.

Challenges discovered with the ArcGIS Spatial Statistics Toolbox

My biggest challenges using the ArcGIS Spatial Statistics Toolbox are twofold. First, many of the tools require the use of a numeric variable (either continuous or discrete) and do not support “out of the box” categorical variables, such as mtDNA haplotype. Thus, in order to look for spatial patterns in haplotypes, I had to split the data up by haplotype, create separate feature classes for each haplotype, and then run the tool several times to get my results. Given that I was working with a small data set, the repetition was relatively painless but I am certain it would be useful to have this process automated (perhaps using model builder or python scripting). Not only would this speed up processing but it would also eliminate the addition of human induced error. Second, the hot spot analysis only allows for the input of one variable at a time. What if one suspected that the spatial pattern of humpback whales (with or without mtDNA consideration) is related to depth and another environmental variable (e.g. sea surface temperature, productivity or currents)? I believe this type of analysis would need to be conducted in another software package such as R.

~~~~~~~~~~~~~~~~~~~~~

Baker, C. S., D. Steel, J. Calambokidis, J. Barlow, A. M. Burdin, P. J. Clapham, E. Falcone, J. K. B. Ford, C. M. Gabriele, U. González-Peral, R. LeDuc, D. Matilla, T. J. Quinn, L. Rojas-Bracho, J. M. Straley, B. L. Taylor, J. Urbán Ramírez, M. Vant, P. R. Wade, D. Weller, B. H. Witteveen, K. Wynne, and M. Yamaguchi. 2008. geneSPLASH: an initial, ocean-wide survey of mitochondrial (mt) DNA diversity and population structure among humpback whales on the North Pacific. Final Report for contract 2006-0093-008, submitted to National Fish and Wildlife Foundation.

Calambokidis, J., E.A. Falcone, T. J. Quinn, A. M. Burdin, P. J. Clapham, J. K. B. Ford, C. M. Gabriele, R. LeDuc, D. Mattila, L. Rojas-Bracho, J. M. Straley, B. L. Taylor, J. Urbán, D. Weller, B. H. Witteveen, M. Yamaguchi, A. Bendlin, D. Camacho, K. Flynn, A. Havron, J. Huggins, N. Maloney, J. Barlow, and P. R. Wade. 2008. SPLASH: Structure of Populations, Levels of Abundance and Status of Humpback Whales in the North Pacific. Final report for Contract AB133F-03-RP-00078 from U.S. Dept of Commerce.

Calambokidis, J., G.H. Steiger, J. M. Straley, L. M., Herman, S. Cerchio, D. R. Salden, U. R. Jorge, J. K. Jacobsen, O. V. Ziegesar, K. C. Balcomb, C. M. Gabriele, M. E. Dahlheim, S. Uchida, G. Ellis, Y. Mlyamura, P. de guevara Paloma Ladrón, M. Yamaguchi, F. Sato, S. A. Mizroch, L. Schlender, K. Rasmussen, J. Barlow, and T. J. Q. Ii. 2001. Movements and population structure of humpback whales in the North Pacific. Marine Mammal Science. 17:769–794.

South Sister Watershed. Log Drive layer created by Rebecca Miller (see Miller, 2010).

I spent the majority of this class in preliminary stages of my research. While I was able to go through nearest neighbor analysis and hot spot analysis following the “Spatial Pattern Analysis of Dengue Fever”, which was helpful, I did not analyze my own data. Nonetheless, I do intend to use spatial statistical tools to analyze the data that I do collect for my summer research, described below.

This summer, I’ll be working with the BLM to assess stream temperature in South Sister Creek. In 2009, the BLM placed several stream enhancement structures (~90 according to some reports) in South Sister, a 3- 4 order tributary to the Smith River in the Coast Range of Oregon nested in the Umpqua Basin. The Creek serves as spawning ground for Oregon Coastal Coho salmon (ESA listed), steelhead trout, and pacific lamprey. Over time, the creek has been degraded by human use. Recent work verified that the Creek had been used as a log drive to transport logs during the 19^th and 20^th centuries which may have contributed to the simplification of the stream (Miller, 2010). Stream cleaning, a restoration practice that removed log jams from the stream, has also reportedly occurred along the stream in the 1980’s (Bureau of Land Management, 2009).

In 2006, 2011, and 2012 the BLM collected data from 9 temperature gages along the stream and its tributaries during the summer months (from mid or late June through September). This summer, those same sites will be monitored (see Figure 1). The BLM is interested in whether or not their efforts made a detectable difference on stream temperature. However, the data collection record is not long enough to detect a change from a temporal perspective.

Nonetheless, there are interesting questions to be asked regarding the relative influence of the enhancement structures, riparian and topographic shade, and substrate on stream temperature at fine spatial scales. Previous research has indicated the importance of several variables on stream temperature, with surface and groundwater inputs, riparian shade/solar inputs, discharge, and hyporheic exchange exerting moderate to high influence (Poole & Berman 2003). Large wood in streams have been linked to increased channel and stream-bed complexity, which may be indicative of hyporheic exchange (Arrigoni et al 2008).

Specifically, the following questions are being asked:

What is the spatial and temporal variability of temperature at a small scale?
How well are the current stationary data loggers (i.e. hobo tidbits) representing the ambient fine-scale patterns of stream temperature?
Which factors dominate or explain the most spatial variation in stream temperature at a fine scale?
1. Log jam density
2. Presence or absence of alluvial substrate (a proxy for hyporheic exchange)
3. Topographic shading
4. Riparian shading

Over the past few weeks I’ve been working on a research design to help answer these questions. To map the heterogeneity and relative influence of log jam density, shade, and substrate on the spatial distribution of stream temperature, the study is employing a stratified sampling approach designed to capture as many combinations as possible of the variables outlined in Figure 2 at varying elevations of the stream. LiDAR data and a shade modeler program will be used to identify areas that are most likely to have 10-2 shade (both topographic and vegetation). An analysis of log jam density (data from previously gathered GPS points taken at the right bank of each in-stream log jam) will be undertaken in ArcMap to divide the study area into reaches that are densely, moderately, and sparsely populated by jams. Data regarding the final stratification layer, streambed substrate, will be gathered during the site layout and field mapping phase of the protocol due to the lack of a priori, spatially explicit knowledge of stream-bed condition (alluvial or bedrock). The field map will be made during the first survey and be used as a reference for locating temperature sampling sites and processing data.

Once my data is collected, I would like to explore the use of Geographically Weighted Regression – however much of my data will be categorical, which may mean GWR will have to be avoided. I welcome any comments related to what types of analysis I should consider doing that might also inform my study design.

Arrigoni, A. S., Poole, G. C., Mertes, L. A. K., O’Daniel, S. J., Woessner, W. W., & Thomas, S. A. (2008). Buffered, lagged, or cooled? Disentangling hyporheic influences on temperature cycles in stream channels. Water Resources Research, 44(9), n/a–n/a. doi:10.1029/2007WR006480

Bureau of Land Management. (2009). South Sisters and Jeff Creek Stream Enhancement Project; Phase IV. Coos Bay: Bureau of Land Management.

Miller, R. (2010). Is the Past Present? Historical Splash-dam Mapping and Stream Disturbance Detection in the Oregon Coastal Province. Oregon State Unviersity. Retrieved 2013

Poole, G. C., & Berman, C. H. (2001). An Ecological Perspective on In-Stream Temperature: Natural Heat Dynamics and Mechanisms of Human-CausedThermal Degradation. Environmental management, 27(6), 787–802.

R: code to assess spatial independence of residuals of a model

This code is useful to check if the assumption of (spatial) independence between residuals of a model is met.

Useful because it can be applied to the residuals of any model we run

It generates a plot of bubbles located according to the X-Y coordinates of the corresponding residuals. The size of the bubbles represents the magnitude of the residuals, while the color indicates if they are positive or negative. If there is spatial independence of the residuals, no pattern should be observed in the disposition of the bubbles. NOTE: requires “sp” package

library(sp)                                                     #Load the sp package
Residuals<-resid(model)          #Generates a vector with the residuals from the model
mydata <- data.frame(Residuals, RawData$XCooords, RawData$YCoords)    # Generate a data frame with the residuals and corresponding coordinates (obtained from the dataset in which the model was ran)
colnames(mydata)<-c(“Residuals”,”X”, “Y”)                   #Assign names to the columns
coordinates(mydata) <- c(“X”,”Y”)                                    #Identify which columns correspond to the coordinates

bubble (data, “Residuals“, col = c(“black”,”grey”), xlab = “X-coordinates”, ylab = “Y-coordinates”) #Plot the bubbles

ArcGIS: Iterator option in Model Builder

This option allows to run the same function over all the files contained inside a folder, freeing the user from doing repetitive tasks.

Radiotelemetry is a common tool for studying animal movement, consisting of attaching a radio tag to an individual and tracking it either remotely with a GPS system or at shorts distances using hand-held antennas.
Until recently, hummingbird movement patterns hadn’t been able to be studied due to lack of transmitters small enough to be carried by such a light animal. The development of miniaturized radio-telemetry devices changed this. Using them, Hadley & Betts were able to conduct a translocation experiment in 2008, while I, in 2012, gathered information on natural (within the home range) movement patterns. This information consisted in time-stamped location points, which seemed perfect for the application of spatio-temporal statistics. In particular, I wanted to see if certain characteristics of the observed movement paths (speed, turning angle) could be used to assess behavioral changes associated to characteristics of the terrain (presence of forest).
As my previous posts show, I was unsuccessful in doing this, as none of the tools or analyses I tried showed evidence of pattern, neither in space or time. It is possible that due to the very high flight speed of these birds (average= 30 mi [48 km]/ hour), we weren’t able to keep up with them while tracking them, leading to un-realistic speed estimates (1.2km/hour).
Even when my data doesn’t have enough precision to answer point-level questions the information on general movement rates of the individuals can still give an insight on how behavior can be affected by the context in which the animals are moving. The advantage of the lack of temporal and spatial correlation between the data points is that I will be able to use traditional statistics to run these analyses.
I also explored a new tool for analyzing telemetry data points: the Dynamic Brownian Bridge Model (dBBMM). This model predicts the areas probably used by the individuals based on their overall movement paths, taking into account not only the location of points but also the sequence in which they took place, incorporating temporal autocorrelation in their calculations. The dBBMM also estimates a parameter (“Brownian motion variance”) that can be used to evaluate the existence heterogeneous behavior along the tracks. High values of Brownian motion variance would indicate more complex paths (and consequent more active behavior), while low values would indicate less variation in the way the individual is moving. I tried to apply this model to my data, but wasn’t able to find a way of estimating the Brownian motion variance. What I was able to do though was generating rasters showing the probability distribution of individuals in space, that’s to say, the areas where it as likely for the birds to be present.

Figure 1. Map showing the probability of observing a particular individual in space, based on data points obtained through radio-telemetry. This particular bird seems to have two centers of activity, joined by a transit area

The code to run this model (in R) can be found at http://www.computational-ecology.com/main-move.html.

I came into this quarter having not yet experienced Dr. Jones’ spatial statistics course (plan to enroll next fall!) nor having any data yet collected for my thesis. Still, it sounded like a great class concept and I thought I’d use the opportunity to explore an interesting set of data, a legacy database of soil pedon measurements from the NCSS:

The data wasn’t in a form I could run in ArcMap; it was stuck in an Access database rather than being a raster or shapefile. But, no worries, right? I figured by about week 3 I could extract something from the dataset worth playing around with (I was also working on this data set for GEO599, Big Data and Spatial Modeling). Well, you know how things take three times as long as you think they will? It took me until week 9 to finally create just a set of mapped pedons of total organic carbon (OC) to 1 meter depth. Just producing a useable national data-set became the quarter’s big project. I learned a great deal about SQL and Python coding but not much about spatial stats.

Okay, that’s my “GEO599 sob story”. I don’t want to get off too easy here since my project didn’t come through. So for the rest of my final blog post, I thought I’d present you all with something more exciting to read about: last May’s geostatistics conference in Newport.:

The following are some notes/observations of mine from the conference. Oh, one more thing. Just for kicks I decided to play catch up are run Hot Spot Analysis on my (finally) created OC carbon data. We already covered most everything about that tool, but I still wanted to do SOMETHING geospatial this term.

One final thing I don’t want to get buried in my report (I’ll repeat again later). The Central Coast GIS Users Group (CCGISUG) is free to join, so if you’re looking to build your professional network, stop reading right now and head to the following link and sign up!

http://www.orurisa.org/Default.aspx?pageId=459070

Central Oregon Coast Geospatial Statistics Symposium Report

On May 30th the Central Coast GIS Users Group (CCGISUG) held a conference on geospatial statistics. As a geographer currently pursing a minor in statistics I can say without embellishment this was the most interesting conference I’d ever attended. Kriging, R code, maps everywhere! I was rapt and attentive for every talk that day, and if we’re all being honest, when has that ever happened fot you over a full conference day?

I arrived a few minutes late and missed the opening overview speech, but walked in just in time to grab a coffee and settle down for EPA wetland researcher Melanie Frazier’s talk on using R for spatial interpolation. Setting a pro-R tone that would persist through the day, she praised R’s surge of increased geospatial capability and wowed us by kriging a map of wetland sand distribution in three lines of code. Some recommended R packages for you all to look into:

sp – base package for vector data work in R
rgdal – facilitates reading and writing of spatial objects (like shapefiles) in R
raster – as the name says, tools for raster work
gstat – interpolation modeling package
plotKML – export R layers to kml for display in Google Earth
colorspace – access to a world of color options
mgcv – regression modeling package

Did you know its ridiculously easy to create variograms in R? Ms. Frazier assured us three lines of code could do it. Here’s an online example I found and she’s pretty spot on about the “three line” rule. Most any single task can be done in three lines of R code. R also supports cross-validation and other goodies. With the variogram, one can run regression kriging code in (gstat? mgcv? sorry I don’t know how she did it) and boom, raster interpolation maps done easy.

One other note from that talk. Ms. Frazier cannot yet release her code to the public, but if you want to hunt for R code examples. try Github.

Next, a USDA modeller, Lee McCoy, showed us all a spatial modelling project looking at eelgrass distribution in wetlands. Not much to report from this one, but I did observe that the USDA was not as concerned about model drivers or mechanistic explanation of their modeled maps. The goal was accurate maps. Period. And if that’s your research goal, great, but I felt it called for a slightly different approach than a masters’ student, a person trying to understand why certain spatial patterns formed, might take. I can’t yet explain how approaches would differ, but for example in research modelling one often throws out one of a pair of highly correlated variables (ex. temperature and precipitation). Mr. McCoy had highly correlated variables in his model and it didn’t bother him in the least.

Next up was Robert Suryan from our very own Oregon State University doing his home department of Fisheries and Wildlife proud. The main research question Mr. Suryan dealt with was modelling and predicting seabird habitat along the coast of the Pacific Northwest. The crux of his study was attempting to develop better predictor layers in support of modelling. For seabirds, chl0rophyll is an important indicator variable, as it’s a measurement of algal population. Those algae feed creatures who are on the lower trophic levels of the food web that ultimately support seabird populations. Remote sensing can pick up chlorophyll presence and typically researchers use the base mean chlorophyll raster as a predictor layer. But … can we do better?

Persistence is an emerging hot topic in spatial modelling. Average chlorophyll is one thing, but food webs only truly get a chance to build on algal food bases if the algea persist long enough to give time for complex food webs to develop. Using some fancy math techniques that I barely understood at the time, and certainly can’t do justice to in explanation to you all so I won’t even try, Mr. Suryan was able to construct a dataset layer of algal persistence from a time series dataset of mean chlorophyll. The resultant dataset exhibited a different spatial pattern than mean chlorophyll, so it was spatially distinct (though related) from its parent data. And, lo and behold, models with the persistence data layer outperformed the other models! Takeaway message: get creative about your base layers and considered whether they canbe manipulated into new data for representative of the processes you are trying to capture in your model.

Did you know the University of Oregon has an infographics lab? Megen Brittell, a Univeristy of Oregon graduate student, works in the lab and demonstrated how R can be used to make custom infographics.There’s a moderate degree of coding knowledge and time investment one needs for this work, but the payoff can be immense in terms of ability to convey information to your audience. This would also be a good time to mention OSU’s own Cartography and Geovisualization Group and the upcoming fall course: Algorithms for Geovisualization.

Whew. It was time for lunch at the conference. Big thanks to CCGISUG for giving away extra sandwiches to poor, starving grad attendees like myself. Free food in hand, it was time to go network. No, I didn’t find my new dream job, but it was fascinating to hear more about the life of working geographers. Main complaint these folk had about their profession: not much time spent in the field. On the other hand, always interesting new problems to tackle. If you’d like to meet fellow GIS users, why not sign yourself up for the free to join Central Coast GIS Users Group (CCGISUG)?

Pat Clinton of the EPA followed up lunch as an emergency fill-in for a no-show presenter. He gave a talk about modelling eelgrass distribution in wetlands that mostly served to reiterate themes presented in Ms. Frazier and Mr. McCoy’s talks. Once again a model’s predictive capability was valued most while placing lesser emphasis on mechanistic processes. I should take a moment to mention that these researchers don’t disregard the “why?” of which parameters work best. It’s more that once you get a map you feel confident with, especially after consultation with the experts, you’re finished. Don’t worry so much about perfect AIC or which variables did exactly what. After all, all those predictor layers are in there for a reason – they have something to do with the phenomenon you’re modelling, right? Not much else to report here except that I learned that a thalweg is term for the centerline of a stream. It can be digitized and one variable Mr. Clinton derived for his model was distance from the thalweg – another great example of getting creative in generating your own predictor layers through processing of existing data sources (in this case, the thalweg line itself).

Rebecca Flitcroft was next up with a talk about stream models. This had potential to be a specialized talk of little general interest, but instead Ms. Flitcroft wowed us all with a broad-based discussion of the very basics of what it means to perform spatial data analysis and how stream, which are networks, are quite different systems in many ways than the open, Cartesian grid area that we’re all most familiar with. This is a tough one to discuss without visuals, but here’s one example of how streams have need to be treated uniquely: with a habitat patch, an animal can enter or exit from any direction while in a stream a fish can only go upriver or downriver, and those directions are quite different in that one is against the flow and one is with the flow. Traditional geostats like nearest neighbor analysis are stuck in Euclidian distacne and won’t work on streams, where the distance is all about how far down the network (which can wind any which way) one must travel. There was much discussion of the development of geostats for streams and we learned that there is currently a dramatic lack of quality software to handle stream analysis (the FLOWS add-on to ArcMap is still the standard, but it only works in ArcMap 9 and wasn’t ported to ArcMap 10).

Ms. Flitcroft also directly mentioned something that was an underlying theme of the whole conference: “The ecology should drive analysis, not the available statistics” I found this quote incredibly relevant to our work in GEO599 exploring the potential of the ArcMap Spatial Statistic tools.

Betsy Breyer joined Ms. Brittell as the sole student presenters on the day. Hailing from Portland State University, Ms. Breyer gave a talk that could have fit right in with GEO599 – she discussed geographically weighted regression (GWR) from the ArcMap Spatial Statistics toolbox! She took the tool to task, with a main message that the tool is good for data exploration but bad for any type of test of certainty. GWR uses a moving window to create regressions that weigh closer points more heavily than far away points. The theory is that this will better tease out local variability in data. Some weaknesses of this method are that it can’t distinguish stationary and is susceptible to local multicollinearity problems (which isn’t a problem in global models). It is recommended to play with the “distance band”, essentially the size of that moving window, for best results. Takeaway – large datasets (high density?) are required for GWR to really be able to play to its strength, which is picking up local differences in your data set. It’s especially worth exploring this tool if your data is highly non-stationary.

OSU Geography closed the conference out with a presentation by faculty member Jim Graham. If you haven’t yet, it’s worth knocking on Jim’s very open office door to hear his well-informed thoughts on habitat modeling, coming from a coding / data manipulation background. Do it quick though, because Jim’s leaving to accept a tenure position at Humboldt State University this summer. For the geospatial conference, Mr. Graham discussed the ever-present problem of uncertainty in spatial analysis. He gave his always popular “Why does GBIF species data think there’s a polar bear in the Indian Ocean” example as an opener for why spatial modeling must be ever watchful for error. Moreover, Mr. Graham discussed emergent work looking at how to quantify levels of uncertainty in maps. One option is running Monte Carlo simulations of one’s model and seeing how much individual pixels deviate in the different model results. Another way to go is “jiggling” your points around spatially. It’s possible there’s spatial uncertainty in the point to begin with, and if Tobler’s Law holds (things near each other are more alike than things further away) then your robust model shouldn’t deviate much, right? Well, in a first attempt at this method, Dr. Graham found that edges of his model of red snapper habitat in the Gulf of Mexico displayed a large amount of uncertainty on the habitat borders. This prompts some interesting questions of what habitat boundaries really mean biologically. Food for thought.

With that the conference came to a close. We stacked chairs, cleaned up, and heading to Rouge Brewery for delicious craft beer and many jokes about optimizing table spacing for the seating of 20+ thirst geographers. Some final thoughts:

1) Go learn R.

2) No, seriously, learn R. It’s is emerging as a powerful spatial statistics package at a time when ArcMap is focusing more on cloud computing and other non-data analysis features (the Spatial Statistics Toolbox notwithstanding)

3) Spatial models should ultimately tell a story. Use the most pertinent data layers you find, and get creative in processing existing layers to make new, even better indicators of your variable of interest. Then get a model that looks right, even if it isn’t scoring the absolute best on AIC or other model measures.

4) There’s a rather broad and large community of spatial statisticians out there. It’s a field you might not hear as much about in daily life, but there’s over a hundred geographers just in Central Oregon. If you’re looking for some work support once you leave GEO599, a.k.a. Researcher’s Anonymous, well there’s plenty of folk who are also cursing at rasters and vectors out there.

– Max Taylor, OSU master’s student of pedometrics

BONUS MAP: Many weeks late, here’s my map of hot spots for my project data – organic carbon, underlain by a model of organic carbon using SSURGO data

Over the last couple of weeks I have been working to better define my research focus:

A geographical approach to understand how the local spatial structure of urban green space shapes the way in which communities evolve. I hope to inform the Environmental Justice, Resilience Theory, and Adaptation literature as well. ( I anticipate adding to this and/or changing it entirely).

Below is a diagram of the Land Use and Society Model, which represents the dynamic feedback process where by a particular land use activity in the human/cultural circle may be modified by a new set of resource management signals issued from the legal/ political circle in response to new awareness of the impacts of existing practices on the physical world. I will use a version of the Land Use and Society Model to help sort out my thoughts and ideas about my research. For example, the process of urbanization, the removal of native vegetation and implementation of impervious surfaces has created environmental impacts on the micro climate within urban areas (ie: heat island effect) let’s say that to mitigate this impact the state and local sectors enforce the implementation or modification of recreation areas/parks. It is the enforcement of certain resource management regulations and how they effect the social and economic components of this model that interest me most.

Below is an adapted model that I created which will focus on the cultural, social, and economic interactions as they relate to urban green space.

I want to detect spatial changes in social/economic composition and environmental benefits of communities over time. I will then quantify the change in urban green space spatial distribution and relate this back to access, in order to understand who has access and how that access has changed spatially and temporally.

I anticipate a number of scenarios/hypotheses to arise:

1. If ∆ in urban green space access > 0, then ∆ in social/economic composition, and environmental benefit > 0

If there is a positive change in urban greenspace, then there will be a positive change in the social/economic composition and environmental benefit of the community as well.

2. If ∆ in urban green space access < 0, then ∆ in social/economic composition, and environmental benefit < 0

If there is no change in urban green space , then there will be no change in the social/economic composition and environmental benefit of the community.

3. Alternative Hypothesis – If ∆ in urban green space access > 0, then ∆ in social/economic composition, and environmental benefit < 0

If there is a positive change in urban green space access, then there will be a negative change in the social/economic composition and environmental benefit of the community.

Limitations:

– How will green space be formally defined?

I anticipate using a number of classifications for green space (park type, canopy coverage, greenness – NDVI ) thus I wonder How will this be further quantified? Can I use an index?

– Measurement of Access

Proximity ≠ access

– Determining Migration

The data does not tell me where people go when they leave…

Can I detect the concept of “horizontal gentrification?”

In the North Pacific, humpback whales feed in various locations along the Pacific Rim including in the US, Canada, Russia and eastern Asia during summer. In winter, they migrate south to mate and calve along Pacific coasts as well as the offshore islands of Mexico, Hawaii, and Japan (including Ogasawara and Ryukyu Islands). Fidelity to feeding areas is high, and is thought to be maternally directed; mothers take their calves to their specific feeding ground, and these offspring subsequently return to this region each year after independence.

This maternally directed fidelity is reflected in studies of maternally inherited mitochondrial DNA (mtDNA). In an ocean-wide survey of genetic diversity and subsequent analysis of population structure in North Pacific humpback whales (Structure of Populations, Levels of Abundance, and Status of Humpbacks; SPLASH), sequencing of the mtDNA control region resolved 28 unique mtDNA haplotypes showing marked frequency differences among breeding grounds (overall F_ST=0.106, p<0.001, n=825) and among feeding regions (overall F_ST=0.179, p<0.001, n=1031; Baker et al. 2008).

Despite genetic evidence of regional population structure in the North Pacific (i.e. separation of humpback whales into various stocks), there have been few studies to investigate the possibility of finer-scale structure within a single North Pacific feeding ground. For example, it is unclear whether maternally directed site fidelity at smaller scales within southeastern Alaska results in discernible differences in haplotype and sex frequencies.

For my final investigation in this course, I decided to look at fine-scale population structure of humpback whales in southeastern Alaska by exploring spatial patterns in haplotype and sex distribution. Specifically, I wanted to answer the following questions:

Are haplotypes (A+, A-, E2) differentially distributed by latitude?
Are males and females differentially distributed by latitude?
Are certain maternal lineages more spatially clustered than others?
Are males or females more spatially clustered?

Methods and Results

First, I isolated haplotype and sex layers by using the “split layer by attribute” tool in XToolsPro. I then went into Excel and produced latitude bins throughout southeastern Alaska (54.1-54.5, 54.6-55, 55.1-55.5, 55.6-56, 56.1-56.5, 56.6-57, 57.1-57.5, 57.6-58, 58.1-58.5, 58.6-59, 59.1-59.5). Next, I totaled the number of encounters of each class variable in each bin and calculated the percent of each class variable in each bin.

Haplotype Distribution:

Sex Distribution:

It appears as though there is a peak in percent of sex and haplotypes observations between 56.6-58.5 degrees. After looking closer at this, I realized that this peak is a function of my bin selection. After visualizing my population distribution within each bin, it is clear that most of my encounters occurred between 56.6-58.5 degrees. However, there are some patterns in differential class variable percents. For example, more A+ haplotypes are found near 58 degrees than A- and E2 haplotypes. Also, the E2 haplotype seems to be more represented at lower latitudes than A+ and A- haplotypes. Males and females seem to be fairly similar in their latitudinal distribution.

Nearest Neighbor Analysis:

All haplotype classes are significantly clustered. The E2 haplotype has the highest z-score and is therefore the least clustered. The A- haplotype appears to be most clustered with the lowest z-score. Based on the z-score, males appear to be more spatially clustered than females, although both are significantly clustered. A nearest neighbor ratio of 1 indicates that the observed mean distance is equal to the expected mean distance based on a random distribution. Smaller nearest neighbor ratios indicate a larger deviation from 1 and, therefore, a more clustered class variable. It should be noted that the study area varies for each class variable. In my analysis, I was not able to standardize the study area to make these comparisons more meaningful. I am curious to know how these values vary across a standardized study area and with equal sample sizes.

Lauren suggested me to use the above mentioned tools. Here are what I learned about those tools through ArcGIS10 Help.

Generate Spatial Weights Matrix: Constructs a spatial weights matrix (.swm) file to represent the spatial relationships among features in a dataset.

Generate Network Spatial Weights: Constructs a spatial weights matrix file (.swm) using a Network dataset, defining feature spatial relationships in terms of the underlying network structure.

Note, you have to turn on Network Analyst Extensions to use this tool.

It seems like I have to manually assign the relationship of each network, which sounds like a very cumbersome work as there are more than 100,000 streams to deal with. I may be able to utilize fdr (the output of FlowDirection) to expedite the process.

Stay tuned!

Hello,

This post comes out of a series of discussions around the general topic of “What I’d really like to know about each Spatial Statistic tool?”. The current help section contains a wealth of information, but there are still some lingering questions:

1) In layman terms, what does each tool do?

– There’s a great post in the help section on sample applications. But they’re grouped by spatial question, and then a tool is listed. It’d be wonderful if a similar set of examples of simple-to-understand terms was listed by tool. So, for example, I could look at the incremental spatial correlation explanation and read that it answers questions about “At which distances is spatial clustering most pronounced?

2) What type of data sets are appropriate for use with a given tool and why?

– Each of the tools is based on a mathematical statistic of some kind. You have thoughtfully included a Best Practices section along with most of the tools. But there’s no mention of the reason for each suggestion. I realize this is a big ask, but if there was some explanation of the mathematical theory behind what will go wrong when best practices aren’t followed, it would really help gain a deeper understanding of tool results. For example, for Hot Spot analysis there’s a suggestion to use 30 input features. But why? Is this because the tool is built off the principle of the Central Limit Theorem?
3) A tool-picking flowchart. There are so many great tools out there. What the above questions really deal with come down to a question of “How do I pick a tool?”. I’d love to be able to load up a flowchart that tried to assess my spatial question. Am I concerned with just spatial patterns in and of themselves, or do I want to learn about spatial distribution of values associated with features? Once I find several tools of interest, I’d like to read about what their potential weaknesses are? Will the tool vary greatly if I change sample extent? Will strongly spatially clustered data skew results? Is zero inflation a problem? A lot this is the responsibility of the user to figure out, but it’s these types of questions we’re asking a lot in our class, which often works with non-ideal data sets.Thanks,

– Max Taylor

Just another blogs.oregonstate.edu site

Pacific coast salt marsh vegetation communities

Spatial patterns of Western and Northern Gulf of Alaska humpback whales in relation to depth and slope

Patterns of stream temperature in a small coastal watershed

Two useful discoveries

Considerations on using radio-telemetry to study hummingbird movement patterns

USIRA Spatial Statistics Conference Report (and my GEO599 sob story)

Defining my research focus

Investigating spatial patterns in haplotype and sex of humpback whales in southeastern Alaska

Generate Spatial Weights Matrix & Generate Network Spatial Weights

A Wish List for Tool Descriptions

Contact Info