GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Here is a simple tutorial in ArcGIS on using spatial statistics tools. Not much new here but I found it useful and interesting because it uses the tools to help inform analysis with other tools in the toolset.

Go to blog website:
- Follow linkà http://www.arcgis.com/home/item.html?id=020e1d9e9d2d44f9ae7da8663cf3bac5
- Googleà spatial statistics blog arcgis->Tutorial/Training->Spatial Pattern Analysis in ArcGIS 10
Download tutorial data: Save to your student folder
Follow along!

Here are the web links and full reference of the Isaak et al. (2014) paper I showed while going through the spatial stream network model tutorial in class on Monday. I (and many others) have been waiting a long time for the availability of methods such as these and now we have them!

SSN/STARS – Spatial Statistics on Stream Networks

OLD ESRI Blog post (tools are now updated for ArcGIS 10.2)

NorWeST Regional Stream Temperature

Daniel J. Isaak, Erin E. Peterson, Jay M. Ver Hoef, Seth J. Wenger, Jeffrey A. Falke, Christian E. Torgersen, Colin Sowder, E. Ashley Steel, Marie-Josee Fortin, Chris E. Jordan, Aaron S. Ruesch, Nicholas Som and Pascal Monestiez. 2014. Applications of spatial statistical network models to stream data. Wiley Interdisciplinary Reviews: Water. DOI: 10.1002/wat2.1023

I would like to show you how to use a script tool within the ArcGIS toolbox to call R remotely, have it perform some function on your data, and return the results into a new shapefile projected onto your workspace in ArcMap. I won’t be doing that.

The scripting tool works. I have used it myself. But it does not work in the computer lab at school. The problem has to do with how R handles packages. When installing a package at school, R complains that students do not have permission to modify its main library, and automatically places new packages into a user-specific library. R does not have any trouble finding this user library when running on its own, but when calling R inside ArcMap it nonsensically forgets where the user-library is located on the S: drive. Considering all spatial statistics in R requires user-installed packages like maptools, there is no point to calling R and asking it to do anything if it can’t find its own library. So instead I am going to give a plain vanilla tutorial on how to open a shapefile in R, lay some statistics smackdown on it, and save the results in a new shapefile and some tables you can open in ArcMap. You might want to grab a cup of coffee, this is going to be boring.

If, for some reason, you actually want to know how to call R remotely using a python script tool within ArcGIS, I will leave a trail of breadcrumbs for you at the end of the tutorial. It is not that hard.

First, locate the folder “demo_geo599” within Geo599/Data/ and copy the folder to your student folder.

Open the program R, press Ctrl + O and navigate to your copy of the demo_geo599 folder. Open the file “demo_nb_GLM.r”.

Install the packages required for this demo using the code at the top of the file. Once the packages are installed, you still need to load them. Note that you can load packages using either the command “library” or “require”, why I don’t know.

The next step to getting this demo working requires you to change a line of code. Where the file says “Set Working Directory”, below it you will find the command “setwd()”. Inside those braces you need to replace the file path with the path to the demo folder in your student folder on the S: drive. R interprets the file path as a string, so be sure to surround the file path with single or double quotation marks, which is how you declare a string in R. By running this command, you direct all output to this file path, and you can manipulate file names within this path without specifying the

The next line of code uses the package maptools to import the shapefile using the command “readShapePoints()”. This line of code will run so long as your file path specifies the demo folder. The shapefile “Local_Farms.shp” contains covariate data on some of the farms serving Corvallis Farmer’s Market, the ones that had responded to my survey at the time of making this demo. Examine the file contents using the “head()” command, and note that the shapefile data resembles an R data.frame. Although it includes a few additional properties, in many ways you can manipulate the shapefile data the same way you would a data.frame, which is how we will add the results of our regression to the data and output a new shapefile.

I adapted this file from the script that python calls, as I mentioned above, so this next bit of code is a legacy from adapting that file. There are easier ways to specify a formula. In this case, I specify the names of the variables as strings, and then paste them together into a formula. Formulas in R follow a syntax: y ~ x1 + x2 + x3. Before we fit the model, we are going to perform some initial diagnostics. The package GGally include a function called “ggpairs()”. This function makes a scatterplot of every variable in a data.frame against every other variable and presents the results in the lower triangle of a matrix. In the upper triangle it prints the correlation between the two variables. If you run the command alone, these results will appear in the graphic window. Above the line is the command you use to print the graph as a portable network graphic (.png), and the command “dev.off()” tells R that you are done specifying the contents of the .png file. While you likely have no interest in keeping a pretty picture of the output, this format is essentially when you are calling R remotely from ArcGIS. This is because the R script runs in full and exits within the python script and you can’t see any of the graphs. So after I run the file remotely in ArcGIS, I open the working directory and examine the graphs. From the perspective of diagnostics, it is better to work in R directly. If you need to change the scale of the response variable or, you know, respond to the diagnostics in any way, you can’t do that when running the script remotely in ArcGIS.

To make sure that we only compare variables of interest, I preface the ggpairs() call with a few lines of code that build a data.frame containing only the variables we want to compare. The line “#var_tab$Acreage <- as.integer(var_tab$Acreage)” begins with a #, which is how you specify comments in R. That means this line is “commented out”. We will be running a negative binomial regression, and R assumes you will provide count data as the dependent variable. The regression still operates with float data, but R will issue warnings, which you can safely ignore. If you do not want to see warnings, simply round off all the decimals by uncommenting this line and running it.

Note that I have commented out another line in the script: #ct <- qt(.95, df=(nrow(var_tab) – length(independentVars))). I didn’t end up using this because the package I used to report coefficient results did not allow me to specify different confidence intervals using critical values, but this is too bad. I do not have enough farms to justify a .95 probability and should be using a t distribution to determine confidence intervals.

You may be wondering: why negative binomial regression? Well, “number of acres” is essentially count data. I have no reason for assuming it to be normally distributed around any particular mean acreage. In fact, I expect smaller farms to be more common than larger, so it is not unreasonable to hypothesize that the distribution of farms would match a poisson or negative binomial distribution. Turns out, the poisson regression was over-dispersed, so the negative binomial is the appropriate distribution in this case. It seems to me that the poisson is always over-dispersed, and I am getting into the habit of skipping straight to negative binomial. For an interesting analysis of the suitability of different distributions for analyzing ecological data, I recommend a paper by Walter Stroup called “Rethinking the Analysis of Non-Normal Data in Plant and Soil Science”.

Go ahead and run the regression and examine the results using the command “summary()”. Note that the spatial component of my analysis “miles”, which represents the distance of the farm from the Corvallis Farmer’s Market in miles, is not significant. This is a shame because I hypothesized that farms would become smaller the closer they were to the market. In particular, I believe that there is a new business model for farms emerging based on small acreages producing high quality food intended for direct-to-consumer local markets. These results show that older farms tend to have larger acreage, and the higher proportion of their production that goes to local markets, the smaller the acreage tends to be.

The package texreg is for formatting model results into the LaTeX language, which many academics use for publishing because it specializes in formatting model formulas and symbols. The function “plotreg()” makes an attractive graphic of the coefficient results. The next few lines of code simply print a Q-Q plot and a plot of the residual vs. fitted values, so we can test the model for normality and independence. The errors are reasonably normally distributed. You can see in the residuals vs. fitted plot that most of the farms are very small in size, but the errors seem independent of the farm size. Hard to tell with so few data points!

Next we append the residuals and how many standard errors each residual deviates from the mean to the shapefile data. This is the same syntax used to add data to a data.frame. The function writeSpatialShape() resides in the maptools package and does exactly what it sounds like, depositing the new shapefile in the working directory. Then, since we cannot see the output from R when calling the script remotely in ArcGIS, the code prints the results in a .dbf file. The function write.dbf() is in the package foreign.

Information criteria are all the rage now, so the final part of the code makes a .dbf table that includes every possible combination of independent variables included in the original model fit against the dependent variable, with the resulting AIC and BIC. Unless you have an abiding fascination with for and if loops, you can safely ignore the algorithm I use to permutate through every unique combination of independent variables. It stores all the combinations as vectors of numbers in a list. The code after it connects the variable names to the numbers and then pastes them together into formulas. This part of the code is only interesting if you work with list type data and want to know how to extract and manipulate data contained in lists. Hint: use the command “unlist()”.

Use write.dbf again to produce a table you can load in ArcGIS, and feel free to run the line containing just the name of the table “diag_tab” to view the diagnostics table in R.

This concludes a poor tutorial in R.

If you are interested in how ArcGIS and R can talk to each other, here are a few tips to get you started.

I found this video by Mark Janikas describing a python script that calls R to perform a logit regression and return the results to a shapefile in ArcGIS. The video is not super helpful in explaining how this process works, its intended audience is a conference full of pro programmers who are hip to hacking lingo, which is not me. At the end, however, he refers the interested to this github site that hosts the python and r scripts.

One important note: certain ArcGIS directories have to be added to the search path for Python. You can manually add these directories to the search path by modifying the path variable in the environmental variables according to the instructions in the folder demo_geo599/R_Tools/Doc. However, I did this and it did not work, so I used a workaround. I manually add the paths using the code in the python file “pathlib.py” which I include in the demo folder. This file and the file “arcpyWithR.py” need to be in the python search path somewhere for my example file to work. And yes it will work on your home machine, just not in the lab. I have added a python file and R file titled “GLMwithR” to the folder R_Tools/Scripts.

If you open the python file, you will note the first line reads “import pathlib”, which adds the necessary directories to the search path. Scrolling down, you will see the function “ARCPY.GetParameterAsText()” called six times. Each of these commands receives user input through the ArcGIS GUI interface. You can add the toolbox inside the R_Tools folder to your ArcGIS toolbox by right-clicking on the directory in the toolbox, selecting “Add Toolbox…” and navigating to the “R Tools” toolbox in the R_Tools folder. I added my script to this toolbox by right-clicking on the box and selected “Add -> Script…”. I would refer you to the help file on creating script tools. The hardest part is actually configuring the parameters. If you right-click on the “Negative Binomial GLM” tool and select “Properties”, you can view the parameters by selecting the “Parameter” tab.

Note that there are six parameters. These parameters need to be specified in the same order as they appear on the python script, and each parameter needs to be configured properly. In the Property field, you have to select the proper values for Type, MuliValue, Obtained from, etc., or the script will fail to work. It’s pretty annoying.

I don’t understand all the python code myself, but the middle portion passes all the inputs to R in the list called “args”. Feel free to open the r script “GLMwithR.r” in the folder R_Tools/Scripts and note the differences between it and the tutorial file we just worked through.

Python uses the functions “ARCPY.gp.GetParameterInfo()” and “ARCPY.SetParameterAsText()” to read the new shapefile and the two .dbf tables respectively that come from the R script. The whole process of sending info in between Python and R is still mysterious to me, so I can’t really shed any light on how or why the python script works.

That concludes my trail of breadcrumbs. Good luck!

The goal of this project was to use R to run statistical operations on data from ESRI shapefiles, and see about getting the results of these operations back into a format that can be used in ArcGIS. This was my first experience using R, and I was surprised at how easy it was (or at least parts of it), though having experience with Python and Java helps. This may be pretty basic for those who already have used R, but hopefully it will be a useful introduction for neophytes like myself.

An easy way to import ESRI shapefile data into R is through the use of the maptools package. You can install the package in the R gui by selecting Packages>Install Package(s), selecting your mirror site (we have one at USA (OR)), and finding maptools in the list. You’ll also need the rgl, spatstat, sm, and misc3d packages to run all the commands here. Note that Digital Earth uses R 3.0.1, for which spatstat is not available, so some of these commands are not possible in the lab.

For this exercise, we’ll be using the Mount Hood earthquake data (hood3d.shp):

To read a shapefile, you can use one of the following commands in R, depending on the type of shapefile:

readShapeLines(file.choose()) readShapePoints(file.choose()) readShapePoly(file.choose())

The file.choose command is an alternative to hard-coding a file path and name; it will bring up a file selection dialog that you can use to interactively select your file.

Run the following command to create a SpatialPointsDataFrame object called shape, which will contain the geometry and attributes of the shapefile:

shape <- readShapePoints(file.choose())

You can view the data with the summary or attributes commands:

Note that the X, Y, and Z coordinates are named here as coords.x1, coords.x2, and coords.x3. We can assign each to a more intuitively-named vector (or list for those who are more used to Python):
x <- shape$coords.x1 y <- shape$coords.x2 z <- shape$coords.x3

Now we can do a number of things with these points. One of the more fun options is to compute the 3d density of these points and create density isosurfaces. This uses the kde3d (kernel density in 3d) and contour3d functions in the misc3d package.

Our data is in a UTM projection, which means that the units are meters, which means that the distances between points are large numbers. I haven’t yet been able to get good results using the raw data in the kde3d function, but do get good results when dividing the input values by 1000:
mx <- x/1000 my <- y/1000 mz <- z/1000

Through trial and error, I got decent results using the following parameters:

with(shape, { d <- kde3d(mx, my, mz, h=0.2, n = 150, lims=c(range(mx),range(my),range(mz))) contour3d(d$d, 0.02, d$x, d$y, d$z, color = "red", color2 = "gray", scale=TRUE ) })

The result displays in an interactive window that can be zoomed and rotated:

There is another package that depends on the misc3d package, called sm. It has a density function that calls contour3d and configures some more advanced parameters behind the scenes. It’s also somewhat simpler to use:

m <- cbind(x,y,z) sm.density(m, panel=TRUE, panel.plot = TRUE)

Note that the raw x, y, and z coordinates worked here, though it is necessary to create a matrix with them first. The result:

Notice the slider for the size of kernel used in the density calculation, as well as the bounding frame and 3d axes. There are a number of arguments that can be used in both the density and contour3d functions to customize the appearance of the data and the layout. By either using them or by writing your own R code, you could conceivably make 3d density plots to whatever level of sophistication you wish.

In addition to visualizing your data, you can perform whatever statistical analysis you would use with any other dataset in R. For example, we can easily make histograms of depth or magnitude:

layout(matrix(c(1,2),1,2,byrow=TRUE)) hist(shape$Depth, breaks=20) hist(shape$Magnitude, breaks=40)

The layout function allows for the creation of two side-by-side plots when given a matrix of one row and two columns; otherwise the second plot would overwrite the first:

Histograms can just as easily be made in ArcGIS, but there are much more complex data operations available in R. For example, it is possible to approximate the above density surfaces and get them back to ArcScene with nearest neighbor distances.

First, we make a 3d points object using the x, y, and z values as well as their ranges. In this case, this object is called p3. Then, use spatstat’s nndist.pp3 function with p3 as an argument to get a list of distances. You can then send the result back to shape; simply referring to a column that doesn’t exist yet will create it:

p3 <- pp3(x,y,z, as.box3(range(x), range(y), range(z))) dists <- nndist.pp3(p3) shape$nndist<-dists

The final bit of R code is very similar to the first:

writePointsShape(shape, file.choose())

This brings up the same dialog as before, which allows you to name the output file. R will warn you that the file does not exist and ask you to make it, though you’ll need to make sure that you give the file a .shp extension.

If you look at the results of the summary function from earlier, you can see that the projection information did not get imported with the attribute data. However, the coordinates are all the same as the original shapefile, so you can simply define the coordinate system to be the same as the original. If you have the original loaded, its coordinate system will appear in the Layers list in the Spatial Reference selection dialog:

The attribute table for this shapefile includes new columns, including nndist, the one we just added:

You could use this new column to adjust the color or size of the points. Although it is not possible, as far as I know, to create a 3D isosurface in ArcScene, you can approximate it by creating 3D buffers, which can use a field value as the radius. Since nndist is small in clustered data, and we would prefer larger spheres for our clusters, it’s best to create a new field calculated from the first. In the table options menu, add a field. Make it a float value. Right-click on the header after it is created, select Field Calculator, and enter your conversion. After some trial and error, I found that the conversion that I liked best was 1/nndist*20000.

Once you create the column, you can use the Buffer 3D tool found in 3D Analyst Tools> 3D Features to create the buffers. You’ll want to use the size field rather than a constant value for distance:

Now the clusters can be viewed in ArcScene:

So, that’s about it for this project. These operations were accomplished with very little code; most of the work is done by the packages and functions already in R, and it’s possible to make your own as well. The most difficult part of this process is learning how to make the various functions work. A great resource is the help(“f”) function, where f is a function you are trying to use. Running this command will bring up the documentation for that function, usually with examples.

Dr. Narayanaraj kindly provided some example R code that served as the starting point for my research in this project. I’ll upload the files to the Data folder in the S drive.

The seismic data used in this exercise is from the Advanced National Seismic System catalog, maintained by the University of California Berkely.

If you read my posts from last week you would have noticed that I had a mini-breakthrough with regard to coming up with a method to seasonally adjust my input air pollution concentrations for my LUR model. This week I proceeded with using these seasonally adjusted annual concentrations in my LUR model. I predicted naphthalene concentrations at two sites in Eugene and compared the seasonally adjusted model with the no seasonal adjustment model. The results are below (and exciting!). The table demonstrates how, for each monitoring location, my predictive model’s accuracy is improved markedly with the seasonal adjustments (e.g. 14.6% vs. 22.6% and 1.1% vs. 10.3%). This encouraging and provides preliminary evidence that modeling temporal variation will improve annual estimates of air pollution exposures.

Monitoring Site

LRAPA Observed

OLS + Seasonally Adjusted Estimated Concentration

(% Difference)

OLS Without Seasonal Adjusted Estimated concentration

(% Difference)

Bethel Site

0.0789

0.0674

(-14.6%)

0.06112

(-22.6%)

Amazon Site

0.0533

0.0527

(-1.1%)

0.0478

(-10.3%)

So, I’ve been able to get a few more datasets from ENVISION, including flooding and erosion data for the 1980 and every decade after that. The flooding and erosion risk was calculated via ENVISION using estimates of sea level rise from a report by the National Research Council which is specific to the Pacific Northwest. I am currently struggling to map the non-georeferenced tables of point data from the years 2000-2040 in order to map it and perform hotspot analysis, but I am sure I can figure out how to display it and/or join it to spatially referenced data eventually.

Future analyses includes:

1. I was hoping to compare hotspot results from decade to decade, using some sort of tool that compares point layers and changes over time.

2. I’d like to integrate and collect the data from the 1980-2040 span (so 60 years) and see which areas will have the greatest flooding over that time. Could help with long-term planning.

3. Since flooding and erosion is expected to occur along the coast and will be true for all 60 years of data, how can I parse out the more interesting information a bit further inland and make the map outputs more useless (and no crowded by this dominant factor)? This question may resolve itself if I perform task 1 since no change will be shown in places that encounter flooding/erosion frequently.

4. Potentially take these analyses and perform them on non-spatial attributes of the data i.e., property values, critical structures, etc.

5. Combine analyses with demographic data. Are the hotspots in areas of low income or other demographic community?

Many things I can do with the dataset if I can figure out how to handle the data. Cheers!

Having finally collected a full list of farms serving the Corvallis Farmer’s Market, I have a fork in the road ahead as to what kind of data I shall collect, and what type of spatial problem I shall analyze. On Saturday I sent out a survey to the email accounts I had among the farms on my list, about 50 out of the 114 farms on my list. As of today, 18 have responded. That is a high rate of response for one email. One trend I have noticed is that over the past decade a new business model is emerging: small acreage, high quality, sustainably grown produce intended for local markets. I could continue down this path by surveying the remaining farms over the phone and with additional rounds of emails. My question would be whether farms nearer to Corvallis tend to be newer in ownership and smaller in size at a statistically significant level, indicating that Corvallis as an urban center is supporting and driving this new business model. I would continue to collect information on local farms through the First Alternative Co-op and through local wholesale distributors.

An alternative strategy is to consider the Farmer’s Markets of other cities in the Willamette Valley. Salem and Portland both have very detailed information on their vendors, and Eugene has somewhat fewer, about 50 farms total listed. Then I can ask whether the size of the city, determined by the area of the city limits polygon data, is predictive of the number of farms serving their Farmer’s Market, or if some cities have more Farmer’s Market relative to others, after considering the relative size of the urban centers.

—

The Extent of the Problem

To those of you who struggled to help me this last week with a “bug” in my points layer, I am happy to announce a fix for the problem. The “extent” setting in my Geoprocessing Environment was set – why I don’t know – to an area much smaller than the file on which I was working. I changed the extent to cover the full area of all layers in the file, and suddenly the excluded points in my point layer started responding to field calculations and tool actions. What a pain! Glad to have it over with.

—

Python R U Talking?

Some of you also know I have been wrestling with a Python script intended to take GIS info, call R to perform some function on it, and then import the results back into ArcGIS. Well I finally got it to work, but it wasn’t easy. The R Toolbox that Mark Janakis made had some bugs in both the Python script and the R script, at least the way it worked for my version of R, and debugging was trial and error. But I am sure you will hear about it in my Tutorial later. Peace!

I proceeded with performing the seasonal adjustments I proposed at the end of my last blog post (see “Temporal Data” post). Briefly, I used the ratios of mean monthly naphthalene concentrations over the annual naphthalene concentrations for all air monitoring data (i.e. both sites combined). These ratios were then used as adjustment factors to simulate monthly observations for the NATA data that I have. The monthly simulated data was then averaged over a year-long period to obtain the “seasonally adjusted” annual estimates.

The graph below displays the seasonally adjusted versus the unadjusted naphthalene annual concentrations. The adjustment factors caused the annual estimates to increase slightly. These seasonal estimates will now be used as my simulated input data for estimation of annual concentrations for my LUR. model.

There is a local R user group in Portland, Oregon. R is a free software environment for statistical computing and graphics (http://www.r-project.org/). Their goal is to support and share R experience and knowledge among its users in the Portland community. They would like to hear how you use and enjoy the R language and statistical programming environment.

For more information, visit http://www.meetup.com/portland-r-user-group/

I have been busy trying to figure out how exactly to incorporate seasonal variation of naphthalene ambient air concentrations into my Land Use Regression (LUR) model. To start off, I obtained time series data of naphthalene concentrations from the Lane Regional Air Protection Agency (LRAPA), just to see if there is indeed a significant amount of variation temporally. The graph below portrays the variation of mean monthly naphthalene concentrations for two air monitoring sites in Eugene (Amazon Park and Bethel) between April 2010 through April 2011.

While this graph above is somewhat informative, it does not quite describe the relationship in full. The graph below presents the data disaggregated by month without averaging over the months, which indicates a great deal of variation within months and between months. This suggests a complex temporal variation that would likely require a non-linear type model. I therefore decided to use all of the data, rather than aggregated means, to better characterize the temporal relationship (i.e. still retaining the within month variations).

Next I modeled the change in monthly log transformed naphthalene concentrations (since naphthalene was clearly not normally distributed and the transformation lead to a log-normal distribution) using month as a predictor and air monitoring location as an additional predictor. The seasonal predictor was fit with a smoothing function using generalized additive model in order to allow for the obvious non-linear trend between month and naphthalene concentrations.

The following graph displays the relationship between month and log naphthalene concentration using the GAM function in R to allow for a non-linear trend. The use of a smooth function proved to be a better fit given the obvious visual trend and that seasonal adjustment was only significantly predictive once treated as a non-linear trend (the linear model was not significant for season [again not surprisingly!]). This work confirmed the notion of a complex relationship between pollutant concentration and month, even after adjusting for the air monitoring location.

As far as next steps is concerned, I was thinking of using this temporal GAM modelling approach for my LUR model. I would use the historical dataset from LRAPA to simulate temporal observations (i.e. monthly) for my NATA data set. If I’m able to simulate this dataset, it would enable me to include time in my LUR model (using a GAM for month), and therefore enable me to see if adding temporal relationships will improve my predictive LUR model. I view this as a novel way of incorporating temporal variation as an explanatory variable. One approach could be taking the ratios of monthly to annual observed LRAPA naphthalene concentrations, and multiplying these monthly ratio with the NATA annual estimates. The simulated data would then be inputs for performing the LUR GAM model.

Just another blogs.oregonstate.edu site

Spatial Pattern and Analysis in ArcGIS 10.0+

Spatial Stream Network Modeling

A Poor Tutorial in R

Working with shapefile data in R

Temporal Breakthrough!

Continuing with Envision

Surveys or Cities

Seasonal Adjustments

Portland R User Group

Temporal Data

Contact Info