Maxwell Taylor, Author at GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

I came into this quarter having not yet experienced Dr. Jones’ spatial statistics course (plan to enroll next fall!) nor having any data yet collected for my thesis. Still, it sounded like a great class concept and I thought I’d use the opportunity to explore an interesting set of data, a legacy database of soil pedon measurements from the NCSS:

The data wasn’t in a form I could run in ArcMap; it was stuck in an Access database rather than being a raster or shapefile. But, no worries, right? I figured by about week 3 I could extract something from the dataset worth playing around with (I was also working on this data set for GEO599, Big Data and Spatial Modeling). Well, you know how things take three times as long as you think they will? It took me until week 9 to finally create just a set of mapped pedons of total organic carbon (OC) to 1 meter depth. Just producing a useable national data-set became the quarter’s big project. I learned a great deal about SQL and Python coding but not much about spatial stats.

Okay, that’s my “GEO599 sob story”. I don’t want to get off too easy here since my project didn’t come through. So for the rest of my final blog post, I thought I’d present you all with something more exciting to read about: last May’s geostatistics conference in Newport.:

The following are some notes/observations of mine from the conference. Oh, one more thing. Just for kicks I decided to play catch up are run Hot Spot Analysis on my (finally) created OC carbon data. We already covered most everything about that tool, but I still wanted to do SOMETHING geospatial this term.

One final thing I don’t want to get buried in my report (I’ll repeat again later). The Central Coast GIS Users Group (CCGISUG) is free to join, so if you’re looking to build your professional network, stop reading right now and head to the following link and sign up!

http://www.orurisa.org/Default.aspx?pageId=459070

Central Oregon Coast Geospatial Statistics Symposium Report

On May 30th the Central Coast GIS Users Group (CCGISUG) held a conference on geospatial statistics. As a geographer currently pursing a minor in statistics I can say without embellishment this was the most interesting conference I’d ever attended. Kriging, R code, maps everywhere! I was rapt and attentive for every talk that day, and if we’re all being honest, when has that ever happened fot you over a full conference day?

I arrived a few minutes late and missed the opening overview speech, but walked in just in time to grab a coffee and settle down for EPA wetland researcher Melanie Frazier’s talk on using R for spatial interpolation. Setting a pro-R tone that would persist through the day, she praised R’s surge of increased geospatial capability and wowed us by kriging a map of wetland sand distribution in three lines of code. Some recommended R packages for you all to look into:

sp – base package for vector data work in R
rgdal – facilitates reading and writing of spatial objects (like shapefiles) in R
raster – as the name says, tools for raster work
gstat – interpolation modeling package
plotKML – export R layers to kml for display in Google Earth
colorspace – access to a world of color options
mgcv – regression modeling package

Did you know its ridiculously easy to create variograms in R? Ms. Frazier assured us three lines of code could do it. Here’s an online example I found and she’s pretty spot on about the “three line” rule. Most any single task can be done in three lines of R code. R also supports cross-validation and other goodies. With the variogram, one can run regression kriging code in (gstat? mgcv? sorry I don’t know how she did it) and boom, raster interpolation maps done easy.

One other note from that talk. Ms. Frazier cannot yet release her code to the public, but if you want to hunt for R code examples. try Github.

Next, a USDA modeller, Lee McCoy, showed us all a spatial modelling project looking at eelgrass distribution in wetlands. Not much to report from this one, but I did observe that the USDA was not as concerned about model drivers or mechanistic explanation of their modeled maps. The goal was accurate maps. Period. And if that’s your research goal, great, but I felt it called for a slightly different approach than a masters’ student, a person trying to understand why certain spatial patterns formed, might take. I can’t yet explain how approaches would differ, but for example in research modelling one often throws out one of a pair of highly correlated variables (ex. temperature and precipitation). Mr. McCoy had highly correlated variables in his model and it didn’t bother him in the least.

Next up was Robert Suryan from our very own Oregon State University doing his home department of Fisheries and Wildlife proud. The main research question Mr. Suryan dealt with was modelling and predicting seabird habitat along the coast of the Pacific Northwest. The crux of his study was attempting to develop better predictor layers in support of modelling. For seabirds, chl0rophyll is an important indicator variable, as it’s a measurement of algal population. Those algae feed creatures who are on the lower trophic levels of the food web that ultimately support seabird populations. Remote sensing can pick up chlorophyll presence and typically researchers use the base mean chlorophyll raster as a predictor layer. But … can we do better?

Persistence is an emerging hot topic in spatial modelling. Average chlorophyll is one thing, but food webs only truly get a chance to build on algal food bases if the algea persist long enough to give time for complex food webs to develop. Using some fancy math techniques that I barely understood at the time, and certainly can’t do justice to in explanation to you all so I won’t even try, Mr. Suryan was able to construct a dataset layer of algal persistence from a time series dataset of mean chlorophyll. The resultant dataset exhibited a different spatial pattern than mean chlorophyll, so it was spatially distinct (though related) from its parent data. And, lo and behold, models with the persistence data layer outperformed the other models! Takeaway message: get creative about your base layers and considered whether they canbe manipulated into new data for representative of the processes you are trying to capture in your model.

Did you know the University of Oregon has an infographics lab? Megen Brittell, a Univeristy of Oregon graduate student, works in the lab and demonstrated how R can be used to make custom infographics.There’s a moderate degree of coding knowledge and time investment one needs for this work, but the payoff can be immense in terms of ability to convey information to your audience. This would also be a good time to mention OSU’s own Cartography and Geovisualization Group and the upcoming fall course: Algorithms for Geovisualization.

Whew. It was time for lunch at the conference. Big thanks to CCGISUG for giving away extra sandwiches to poor, starving grad attendees like myself. Free food in hand, it was time to go network. No, I didn’t find my new dream job, but it was fascinating to hear more about the life of working geographers. Main complaint these folk had about their profession: not much time spent in the field. On the other hand, always interesting new problems to tackle. If you’d like to meet fellow GIS users, why not sign yourself up for the free to join Central Coast GIS Users Group (CCGISUG)?

Pat Clinton of the EPA followed up lunch as an emergency fill-in for a no-show presenter. He gave a talk about modelling eelgrass distribution in wetlands that mostly served to reiterate themes presented in Ms. Frazier and Mr. McCoy’s talks. Once again a model’s predictive capability was valued most while placing lesser emphasis on mechanistic processes. I should take a moment to mention that these researchers don’t disregard the “why?” of which parameters work best. It’s more that once you get a map you feel confident with, especially after consultation with the experts, you’re finished. Don’t worry so much about perfect AIC or which variables did exactly what. After all, all those predictor layers are in there for a reason – they have something to do with the phenomenon you’re modelling, right? Not much else to report here except that I learned that a thalweg is term for the centerline of a stream. It can be digitized and one variable Mr. Clinton derived for his model was distance from the thalweg – another great example of getting creative in generating your own predictor layers through processing of existing data sources (in this case, the thalweg line itself).

Rebecca Flitcroft was next up with a talk about stream models. This had potential to be a specialized talk of little general interest, but instead Ms. Flitcroft wowed us all with a broad-based discussion of the very basics of what it means to perform spatial data analysis and how stream, which are networks, are quite different systems in many ways than the open, Cartesian grid area that we’re all most familiar with. This is a tough one to discuss without visuals, but here’s one example of how streams have need to be treated uniquely: with a habitat patch, an animal can enter or exit from any direction while in a stream a fish can only go upriver or downriver, and those directions are quite different in that one is against the flow and one is with the flow. Traditional geostats like nearest neighbor analysis are stuck in Euclidian distacne and won’t work on streams, where the distance is all about how far down the network (which can wind any which way) one must travel. There was much discussion of the development of geostats for streams and we learned that there is currently a dramatic lack of quality software to handle stream analysis (the FLOWS add-on to ArcMap is still the standard, but it only works in ArcMap 9 and wasn’t ported to ArcMap 10).

Ms. Flitcroft also directly mentioned something that was an underlying theme of the whole conference: “The ecology should drive analysis, not the available statistics” I found this quote incredibly relevant to our work in GEO599 exploring the potential of the ArcMap Spatial Statistic tools.

Betsy Breyer joined Ms. Brittell as the sole student presenters on the day. Hailing from Portland State University, Ms. Breyer gave a talk that could have fit right in with GEO599 – she discussed geographically weighted regression (GWR) from the ArcMap Spatial Statistics toolbox! She took the tool to task, with a main message that the tool is good for data exploration but bad for any type of test of certainty. GWR uses a moving window to create regressions that weigh closer points more heavily than far away points. The theory is that this will better tease out local variability in data. Some weaknesses of this method are that it can’t distinguish stationary and is susceptible to local multicollinearity problems (which isn’t a problem in global models). It is recommended to play with the “distance band”, essentially the size of that moving window, for best results. Takeaway – large datasets (high density?) are required for GWR to really be able to play to its strength, which is picking up local differences in your data set. It’s especially worth exploring this tool if your data is highly non-stationary.

OSU Geography closed the conference out with a presentation by faculty member Jim Graham. If you haven’t yet, it’s worth knocking on Jim’s very open office door to hear his well-informed thoughts on habitat modeling, coming from a coding / data manipulation background. Do it quick though, because Jim’s leaving to accept a tenure position at Humboldt State University this summer. For the geospatial conference, Mr. Graham discussed the ever-present problem of uncertainty in spatial analysis. He gave his always popular “Why does GBIF species data think there’s a polar bear in the Indian Ocean” example as an opener for why spatial modeling must be ever watchful for error. Moreover, Mr. Graham discussed emergent work looking at how to quantify levels of uncertainty in maps. One option is running Monte Carlo simulations of one’s model and seeing how much individual pixels deviate in the different model results. Another way to go is “jiggling” your points around spatially. It’s possible there’s spatial uncertainty in the point to begin with, and if Tobler’s Law holds (things near each other are more alike than things further away) then your robust model shouldn’t deviate much, right? Well, in a first attempt at this method, Dr. Graham found that edges of his model of red snapper habitat in the Gulf of Mexico displayed a large amount of uncertainty on the habitat borders. This prompts some interesting questions of what habitat boundaries really mean biologically. Food for thought.

With that the conference came to a close. We stacked chairs, cleaned up, and heading to Rouge Brewery for delicious craft beer and many jokes about optimizing table spacing for the seating of 20+ thirst geographers. Some final thoughts:

1) Go learn R.

2) No, seriously, learn R. It’s is emerging as a powerful spatial statistics package at a time when ArcMap is focusing more on cloud computing and other non-data analysis features (the Spatial Statistics Toolbox notwithstanding)

3) Spatial models should ultimately tell a story. Use the most pertinent data layers you find, and get creative in processing existing layers to make new, even better indicators of your variable of interest. Then get a model that looks right, even if it isn’t scoring the absolute best on AIC or other model measures.

4) There’s a rather broad and large community of spatial statisticians out there. It’s a field you might not hear as much about in daily life, but there’s over a hundred geographers just in Central Oregon. If you’re looking for some work support once you leave GEO599, a.k.a. Researcher’s Anonymous, well there’s plenty of folk who are also cursing at rasters and vectors out there.

– Max Taylor, OSU master’s student of pedometrics

BONUS MAP: Many weeks late, here’s my map of hot spots for my project data – organic carbon, underlain by a model of organic carbon using SSURGO data

Hello,

This post comes out of a series of discussions around the general topic of “What I’d really like to know about each Spatial Statistic tool?”. The current help section contains a wealth of information, but there are still some lingering questions:

1) In layman terms, what does each tool do?

– There’s a great post in the help section on sample applications. But they’re grouped by spatial question, and then a tool is listed. It’d be wonderful if a similar set of examples of simple-to-understand terms was listed by tool. So, for example, I could look at the incremental spatial correlation explanation and read that it answers questions about “At which distances is spatial clustering most pronounced?

2) What type of data sets are appropriate for use with a given tool and why?

– Each of the tools is based on a mathematical statistic of some kind. You have thoughtfully included a Best Practices section along with most of the tools. But there’s no mention of the reason for each suggestion. I realize this is a big ask, but if there was some explanation of the mathematical theory behind what will go wrong when best practices aren’t followed, it would really help gain a deeper understanding of tool results. For example, for Hot Spot analysis there’s a suggestion to use 30 input features. But why? Is this because the tool is built off the principle of the Central Limit Theorem?
3) A tool-picking flowchart. There are so many great tools out there. What the above questions really deal with come down to a question of “How do I pick a tool?”. I’d love to be able to load up a flowchart that tried to assess my spatial question. Am I concerned with just spatial patterns in and of themselves, or do I want to learn about spatial distribution of values associated with features? Once I find several tools of interest, I’d like to read about what their potential weaknesses are? Will the tool vary greatly if I change sample extent? Will strongly spatially clustered data skew results? Is zero inflation a problem? A lot this is the responsibility of the user to figure out, but it’s these types of questions we’re asking a lot in our class, which often works with non-ideal data sets.Thanks,

– Max Taylor

EDITOR’S NOTE: The data used for this analysis contained spatially duplicated values. Meaning there are many instances of two separate measurements taken from the same site. These values might be temporally different (same site measured more than once on different dates), distinct but adjacent (different samples from same field, close enough to have the same lat/long within rounding error), or true duplicates. Data was not cleaned for these error prior to analysis

Introduction

This week I fiddled around with the two Mapping Clusters tools in the Spatial Statistics toolbox. Here’s a rundown of each tool:

Cluster and Outlier Analysis

Given a set of weighted features, identifies statistically significant hot spots, cold spots, and spatial outliers using the Anselin Local Moran’s I statistic.

Hot Spot Analysis

Given a set of weighted features, identifies statistically significant hot spots and cold spots using the Getis-Ord Gi* statistic.

Methodology

I wanted to challenge each of these tools to see if they could cope with a common natural resources data phenomenon: zero-value inflation. The problem of a large proportion of zero values is common with data obtained from ecological studies involving counts of abundance, presence–absence, or occupancy rates. The zero values are real and cannot simply be tossed out, but because 0.0 is a singular value, it results in a major spike in the data distribution. This can adversely affect many common modeling techniques. The graphs below are from my data set of organic matter in Oregon soil pedons and have this zero-value complication:

At left, the full data set with an enormous spike in the first bar (caused by all the zero values). At right, the same data set with the almost 400 zero-value data points removed. It’s still right skewed, but looks like a more reasonable distribution spread.

I ran each data set through both cluster analysis tools and visually compared results. There were noticeable differences between all four maps. In this “spot the difference” exercise, some major changes included: Hot spot in the northeast in full data only. Willamette Valley everywhere from cold to hot. Bigger east-west trend in the no-zero data sets.

Discussion

Clearly, having a large number of zero-values in one’s dataset will skew cluster analysis results. This was to be expected. In my data, the zero values were enough to overwhelm the generally “warm” signature in the Willamette Valley. Also, the values in the north-east were “significantly large” when balanced against a large zero data set, but “not significant” in the more balanced no-zero data set.

Regarding the Hot Spot Analysis versus the Cluster and Outlier Analysis, for this data set the Cluster & Outlier tool resulted in more stringent (smaller) hot and cold spots. Also, outlier values of cold in hot spots and hot in cold spots can been seen in the Cluster and Outlier Data Set, but vanish and it fact take the same label as the surrounding data in the Hot Spot Analysis (see below).

At left, a close-up of soutwest Oregon in the Hot Spot analysis of the no-zero dataset. At right, the Cluster and Outlier analysis of the same area. Note the two dots in the middle that are classified as “hot” in the left image but “low, surrounded by high” in the right image.

Conclusion

I would be very wary of using these tools with zero-inflated and skewed data. This study was crude in that I completely removed zero values rather than re-casting them as low (but not zero) values. Nonetheless, there were drastic differences between all four maps in large sample (more than 1000 point) dataset.

Between Hot Spot or Cluster and Outlier, for this data I would prefer to use the Cluster and Outlier data set. The same general hot spot pattern is maintained between the two. It seems there is a trade-off between smoothness and information regarding outliers within hot and cold spots. For scientific mapping purposes it would seem very important to be aware of those outliers.

My research seeks to quantify and explain patterns of variability as they relate to specific soil properties (such as nutrients, physical structure, ect.) There are patterns in the data itself (distribution shapes such as normal or bimodal, skewness, and variance) and in the spatial distribution of those data values.

I wish to learn more about tools that can characterize these data distributions and spatial patterns (emphasis on spatial for this class). It is especially a challenge because soil variables often don’t follow well-known distribution such as Gaussian or Exponential. This leaves me wary about the use of certain mathematical tools that requires assumptions such as a normal distribution. The central limit theorem does not apply when we move beyond questions about the mean.

At this point I do not have a specific question in mind, and I should also mention I haven’t collected any data for this project yet (I’m just starting lab analysis this spring). I do not have a specific spatial question, rather I’d to learn about various classification and interpolation methods.

Some background info on soil for the curios

A blurb from the Soil Science Society of America on the importance of soil:

Soil provides ecosystem services critical for life; soil acts as a water filter and a growing medium; provides habitat for billions of organisms, contributing to biodiversity; and supplies most of the antibiotics used to fight diseases. Humans use soil as a holding facility for solid waste, filter for wastewater, and foundation for our cities and towns. Finally, soil is the basis of our nation’s agroecosystems which provide us with feed, fiber, food and fuel.

On the source of soil variability:

Soil is HIGHLY heterogeneous. It is a mix of weathered rock minerals, plant organic matter, liquid, and gas. It’s been forming for thousands of years. A multitude of environmental variables affect that formation at spatial scales from nanometer bacterial interaction to varying climate across landscapes. The real challenge is that variability increases as a function of spatial area under consideration. The variability of a 0.5m X 0.5m plot is different than that of a 5m x 5m and is different than a 50m x 50m plot and so on.

A graphical look at shifting soil scale and methods of characterizing variability

For my “first take” on the Spatial Statistics Resources blog, I learned more about the mathematical statistics contained within the tools of the Spatial Statistics toolbox. I quickly realized that the tools can be grouped by common mathematical principle. For example, all hot spot identification is found using something called the Getis-Ord Gi* statistic. Looking at the Desktop 10 Help website list of sample applications, most tools are listed with an associated mathematical statistic (usually listed in parentheses). For example:

Question: Is the data spatially correlated?

Tool: Spatial Autocorrelation (Global Moran’s I)

Some of the mathematical concepts I am fairly well acquainted with, like ordinary least squares. Others I had never heard of. The Getis-Ord statistic is one I’d never encountered before. I used one of my primary research tools, the internet, and found the statistic was developed in the mid-nineties by the method’s namesake statisticians.

Link to the 1995 paper on the Getis-Ord statistic

But one need not always consult the internet at large. ESRI provides some explanation of each tool in various articles scattered around the Spatial Statistics folder from Desktop 10.0 Help. I’ve begun assembling a list with the link to each math principle/tool/statistics below. I would like to learn about these statistics, what their strengths and weaknesses are, and especially when it is not appropriate to use them (what are the assumptions?).

List of Mathematical Principles/Statistics Underlying the Suite of Available Spatial Statistics

Analyzing Patterns:

How Multi-Distance Spatial Cluster Analysis (Ripley’s K-function) works

How Spatial Autocorrelation (Global Moran’s I) works

How High/Low Clustering (Getis-Ord General G) works

Mapping Clusters:

How Hot Spot Analysis (Getis-Ord Gi*) works

How Cluster and Outlier Analysis (Anselin Local Moran’s I) works

Measuring Geographic Distributions:

How Directional Distribution (Standard Deviational Ellipse) works

Modeling Spatial Relationships:

Geographically Weighted Regression (GWR) (Spatial Statistics)