A Poor Tutorial in R

I would like to show you how to use a script tool within the ArcGIS toolbox to call R remotely, have it perform some function on your data, and return the results into a new shapefile projected onto your workspace in ArcMap. I won’t be doing that.

The scripting tool works. I have used it myself. But it does not work in the computer lab at school. The problem has to do with how R handles packages. When installing a package at school, R complains that students do not have permission to modify its main library, and automatically places new packages into a user-specific library. R does not have any trouble finding this user library when running on its own, but when calling R inside ArcMap it nonsensically forgets where the user-library is located on the S: drive. Considering all spatial statistics in R requires user-installed packages like maptools, there is no point to calling R and asking it to do anything if it can’t find its own library. So instead I am going to give a plain vanilla tutorial on how to open a shapefile in R, lay some statistics smackdown on it, and save the results in a new shapefile and some tables you can open in ArcMap. You might want to grab a cup of coffee, this is going to be boring.

If, for some reason, you actually want to know how to call R remotely using a python script tool within ArcGIS, I will leave a trail of breadcrumbs for you at the end of the tutorial. It is not that hard.

First, locate the folder “demo_geo599” within Geo599/Data/ and copy the folder to your student folder.

Open the program R, press Ctrl + O and navigate to your copy of the demo_geo599 folder. Open the file “demo_nb_GLM.r”.

Install the packages required for this demo using the code at the top of the file. Once the packages are installed, you still need to load them. Note that you can load packages using either the command “library” or “require”, why I don’t know.

The next step to getting this demo working requires you to change a line of code. Where the file says “Set Working Directory”, below it you will find the command “setwd()”. Inside those braces you need to replace the file path with the path to the demo folder in your student folder on the S: drive. R interprets the file path as a string, so be sure to surround the file path with single or double quotation marks, which is how you declare a string in R. By running this command, you direct all output to this file path, and you can manipulate file names within this path without specifying the

The next line of code uses the package maptools to import the shapefile using the command “readShapePoints()”. This line of code will run so long as your file path specifies the demo folder. The shapefile “Local_Farms.shp” contains covariate data on some of the farms serving Corvallis Farmer’s Market, the ones that had responded to my survey at the time of making this demo. Examine the file contents using the “head()” command, and note that the shapefile data resembles an R data.frame. Although it includes a few additional properties, in many ways you can manipulate the shapefile data the same way you would a data.frame, which is how we will add the results of our regression to the data and output a new shapefile.

I adapted this file from the script that python calls, as I mentioned above, so this next bit of code is a legacy from adapting that file. There are easier ways to specify a formula. In this case, I specify the names of the variables as strings, and then paste them together into a formula. Formulas in R follow a syntax: y ~ x1 + x2 + x3. Before we fit the model, we are going to perform some initial diagnostics. The package GGally include a function called “ggpairs()”. This function makes a scatterplot of every variable in a data.frame against every other variable and presents the results in the lower triangle of a matrix. In the upper triangle it prints the correlation between the two variables. If you run the command alone, these results will appear in the graphic window. Above the line is the command you use to print the graph as a portable network graphic (.png), and the command “dev.off()” tells R that you are done specifying the contents of the .png file. While you likely have no interest in keeping a pretty picture of the output, this format is essentially when you are calling R remotely from ArcGIS. This is because the R script runs in full and exits within the python script and you can’t see any of the graphs. So after I run the file remotely in ArcGIS, I open the working directory and examine the graphs. From the perspective of diagnostics, it is better to work in R directly. If you need to change the scale of the response variable or, you know, respond to the diagnostics in any way, you can’t do that when running the script remotely in ArcGIS.

To make sure that we only compare variables of interest, I preface the ggpairs() call with a few lines of code that build a data.frame containing only the variables we want to compare. The line “#var_tab$Acreage <- as.integer(var_tab$Acreage)” begins with a #, which is how you specify comments in R. That means this line is “commented out”. We will be running a negative binomial regression, and R assumes you will provide count data as the dependent variable. The regression still operates with float data, but R will issue warnings, which you can safely ignore. If you do not want to see warnings, simply round off all the decimals by uncommenting this line and running it.

Note that I have commented out another line in the script: #ct <- qt(.95, df=(nrow(var_tab) – length(independentVars))). I didn’t end up using this because the package I used to report coefficient results did not allow me to specify different confidence intervals using critical values, but this is too bad. I do not have enough farms to justify a .95 probability and should be using a t distribution to determine confidence intervals.

You may be wondering: why negative binomial regression? Well, “number of acres” is essentially count data. I have no reason for assuming it to be normally distributed around any particular mean acreage. In fact, I expect smaller farms to be more common than larger, so it is not unreasonable to hypothesize that the distribution of farms would match a poisson or negative binomial distribution. Turns out, the poisson regression was over-dispersed, so the negative binomial is the appropriate distribution in this case. It seems to me that the poisson is always over-dispersed, and I am getting into the habit of skipping straight to negative binomial. For an interesting analysis of the suitability of different distributions for analyzing ecological data, I recommend a paper by Walter Stroup called “Rethinking the Analysis of Non-Normal Data in Plant and Soil Science”.

Go ahead and run the regression and examine the results using the command “summary()”. Note that the spatial component of my analysis “miles”, which represents the distance of the farm from the Corvallis Farmer’s Market in miles, is not significant. This is a shame because I hypothesized that farms would become smaller the closer they were to the market. In particular, I believe that there is a new business model for farms emerging based on small acreages producing high quality food intended for direct-to-consumer local markets. These results show that older farms tend to have larger acreage, and the higher proportion of their production that goes to local markets, the smaller the acreage tends to be.

The package texreg is for formatting model results into the LaTeX language, which many academics use for publishing because it specializes in formatting model formulas and symbols. The function “plotreg()” makes an attractive graphic of the coefficient results. The next few lines of code simply print a Q-Q plot and a plot of the residual vs. fitted values, so we can test the model for normality and independence. The errors are reasonably normally distributed. You can see in the residuals vs. fitted plot that most of the farms are very small in size, but the errors seem independent of the farm size. Hard to tell with so few data points!

Next we append the residuals and how many standard errors each residual deviates from the mean to the shapefile data. This is the same syntax used to add data to a data.frame. The function writeSpatialShape() resides in the maptools package and does exactly what it sounds like, depositing the new shapefile in the working directory. Then, since we cannot see the output from R when calling the script remotely in ArcGIS, the code prints the results in a .dbf file. The function write.dbf() is in the package foreign.

Information criteria are all the rage now, so the final part of the code makes a .dbf table that includes every possible combination of independent variables included in the original model fit against the dependent variable, with the resulting AIC and BIC. Unless you have an abiding fascination with for and if loops, you can safely ignore the algorithm I use to permutate through every unique combination of independent variables. It stores all the combinations as vectors of numbers in a list. The code after it connects the variable names to the numbers and then pastes them together into formulas. This part of the code is only interesting if you work with list type data and want to know how to extract and manipulate data contained in lists. Hint: use the command “unlist()”.

Use write.dbf again to produce a table you can load in ArcGIS, and feel free to run the line containing just the name of the table “diag_tab” to view the diagnostics table in R.

This concludes a poor tutorial in R.

If you are interested in how ArcGIS and R can talk to each other, here are a few tips to get you started.

I found this video by Mark Janikas describing a python script that calls R to perform a logit regression and return the results to a shapefile in ArcGIS. The video is not super helpful in explaining how this process works, its intended audience is a conference full of pro programmers who are hip to hacking lingo, which is not me. At the end, however, he refers the interested to this github site that hosts the python and r scripts.

One important note: certain ArcGIS directories have to be added to the search path for Python. You can manually add these directories to the search path by modifying the path variable in the environmental variables according to the instructions in the folder demo_geo599/R_Tools/Doc. However, I did this and it did not work, so I used a workaround. I manually add the paths using the code in the python file “pathlib.py” which I include in the demo folder. This file and the file “arcpyWithR.py” need to be in the python search path somewhere for my example file to work. And yes it will work on your home machine, just not in the lab. I have added a python file and R file titled “GLMwithR” to the folder R_Tools/Scripts.

If you open the python file, you will note the first line reads “import pathlib”, which adds the necessary directories to the search path. Scrolling down, you will see the function “ARCPY.GetParameterAsText()” called six times. Each of these commands receives user input through the ArcGIS GUI interface. You can add the toolbox inside the R_Tools folder to your ArcGIS toolbox by right-clicking on the directory in the toolbox, selecting “Add Toolbox…” and navigating to the “R Tools” toolbox in the R_Tools folder. I added my script to this toolbox by right-clicking on the box and selected “Add -> Script…”. I would refer you to the help file on creating script tools. The hardest part is actually configuring the parameters. If you right-click on the “Negative Binomial GLM” tool and select “Properties”, you can view the parameters by selecting the “Parameter” tab.

Note that there are six parameters. These parameters need to be specified in the same order as they appear on the python script, and each parameter needs to be configured properly. In the Property field, you have to select the proper values for Type, MuliValue, Obtained from, etc., or the script will fail to work. It’s pretty annoying.

I don’t understand all the python code myself, but the middle portion passes all the inputs to R in the list called “args”. Feel free to open the r script “GLMwithR.r” in the folder R_Tools/Scripts and note the differences between it and the tutorial file we just worked through.

Python uses the functions “ARCPY.gp.GetParameterInfo()” and “ARCPY.SetParameterAsText()” to read the new shapefile and the two .dbf tables respectively that come from the R script. The whole process of sending info in between Python and R is still mysterious to me, so I can’t really shed any light on how or why the python script works.

That concludes my trail of breadcrumbs. Good luck!

GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Just another blogs.oregonstate.edu site

Leave a reply Cancel reply

Contact Info