{"id":1111,"date":"2014-05-18T17:32:05","date_gmt":"2014-05-19T00:32:05","guid":{"rendered":"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/?p=1111"},"modified":"2014-05-24T18:32:59","modified_gmt":"2014-05-25T01:32:59","slug":"poor-tutorial-r","status":"publish","type":"post","link":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/2014\/05\/18\/poor-tutorial-r\/","title":{"rendered":"A Poor Tutorial in R"},"content":{"rendered":"<p>I would like to show you how to use a script tool within the ArcGIS toolbox to call R remotely, have it perform some function on your data, and return the results into a new shapefile projected onto your workspace in ArcMap.\u00a0 I won&#8217;t be doing that.<\/p>\n<p>The scripting tool works.\u00a0 I have used it myself.\u00a0 But it does not work in the computer lab at school.\u00a0 The problem has to do with how R handles packages.\u00a0 When installing a package at school, R complains that students do not have permission to modify its main library, and automatically places new packages into a user-specific library.\u00a0 R does not have any trouble finding this user library when running on its own, but when calling R inside ArcMap it nonsensically forgets where the user-library is located on the S: drive.\u00a0 Considering all spatial statistics in R requires user-installed packages like maptools, there is no point to calling R and asking it to do anything if it can&#8217;t find its own library.\u00a0 So instead I am going to give a plain vanilla tutorial on how to open a shapefile in R, lay some statistics smackdown on it, and save the results in a new shapefile and some tables you can open in ArcMap.\u00a0 You might want to grab a cup of coffee, this is going to be boring.<\/p>\n<p>If, for some reason, you actually want to know how to call R remotely using a python script tool within ArcGIS, I will leave a trail of breadcrumbs for you at the end of the tutorial.\u00a0 It is not that hard.<\/p>\n<p>First, locate the folder &#8220;demo_geo599&#8221; within Geo599\/Data\/ and copy the folder to your student folder.<\/p>\n<p>Open the program R, press Ctrl + O and navigate to your copy of the demo_geo599 folder.\u00a0 Open the file &#8220;demo_nb_GLM.r&#8221;.<\/p>\n<p>Install the packages required for this demo using the code at the top of the file.\u00a0 Once the packages are installed, you still need to load them.\u00a0 Note that you can load packages using either the command &#8220;library&#8221; or &#8220;require&#8221;, why I don&#8217;t know.<\/p>\n<p>The next step to getting this demo working requires you to change a line of code.\u00a0 Where the file says &#8220;Set Working Directory&#8221;, below it you will find the command &#8220;setwd()&#8221;.\u00a0 Inside those braces you need to replace the file path with the path to the demo folder in your student folder on the S: drive.\u00a0 R interprets the file path as a string, so be sure to surround the file path with single or double quotation marks, which is how you declare a string in R.\u00a0 By running this command, you direct all output to this file path, and you can manipulate file names within this path without specifying the<\/p>\n<p>The next line of code uses the package maptools to import the shapefile using the command &#8220;readShapePoints()&#8221;.\u00a0 This line of code will run so long as your file path specifies the demo folder.\u00a0 The shapefile &#8220;Local_Farms.shp&#8221; contains covariate data on some of the farms serving Corvallis Farmer&#8217;s Market, the ones that had responded to my survey at the time of making this demo.\u00a0 Examine the file contents using the &#8220;head()&#8221; command, and note that the shapefile data resembles an R data.frame.\u00a0 Although it includes a few additional properties, in many ways you can manipulate the shapefile data the same way you would a data.frame, which is how we will add the results of our regression to the data and output a new shapefile.<\/p>\n<p>I adapted this file from the script that python calls, as I mentioned above, so this next bit of code is a legacy from adapting that file.\u00a0 There are easier ways to specify a formula.\u00a0 In this case, I specify the names of the variables as strings, and then paste them together into a formula.\u00a0 Formulas in R follow a syntax:\u00a0 y ~ x1 + x2 + x3.\u00a0\u00a0 Before we fit the model, we are going to perform some initial diagnostics.\u00a0 The package GGally include a function called &#8220;ggpairs()&#8221;.\u00a0 This function makes a scatterplot of every variable in a data.frame against every other variable and presents the results in the lower triangle of a matrix.\u00a0 In the upper triangle it prints the correlation between the two variables.\u00a0 If you run the command alone, these results will appear in the graphic window.\u00a0 Above the line is the command you use to print the graph as a portable network graphic (.png), and the command &#8220;dev.off()&#8221; tells R that you are done specifying the contents of the .png file.\u00a0 While you likely have no interest in keeping a pretty picture of the output, this format is essentially when you are calling R remotely from ArcGIS.\u00a0 This is because the R script runs in full and exits within the python script and you can&#8217;t see any of the graphs.\u00a0 So after I run the file remotely in ArcGIS, I open the working directory and examine the graphs.\u00a0 From the perspective of diagnostics, it is better to work in R directly.\u00a0 If you need to change the scale of the response variable or, you know, respond to the diagnostics in any way, you can&#8217;t do that when running the script remotely in ArcGIS.<\/p>\n<p><a href=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/nb_cov_test.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1113\" alt=\"nb_cov_test\" src=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/nb_cov_test.png\" width=\"480\" height=\"480\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_cov_test.png 480w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_cov_test-150x150.png 150w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_cov_test-300x300.png 300w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/><\/a><\/p>\n<p>To make sure that we only compare variables of interest, I preface the ggpairs() call with a few lines of code that build a data.frame containing only the variables we want to compare.\u00a0 The line &#8220;#var_tab$Acreage &lt;- as.integer(var_tab$Acreage)&#8221; begins with a #, which is how you specify comments in R.\u00a0 That means this line is &#8220;commented out&#8221;.\u00a0 We will be running a negative binomial regression, and R assumes you will provide count data as the dependent variable.\u00a0 The regression still operates with float data, but R will issue warnings, which you can safely ignore.\u00a0 If you do not want to see warnings, simply round off all the decimals by uncommenting this line and running it.<\/p>\n<p>Note that I have commented out another line in the script: #ct &lt;- qt(.95, df=(nrow(var_tab) &#8211; length(independentVars))).\u00a0 I didn&#8217;t end up using this because the package I used to report coefficient results did not allow me to specify different confidence intervals using critical values, but this is too bad.\u00a0 I do not have enough farms to justify a .95 probability and should be using a t distribution to determine confidence intervals.<\/p>\n<p>You may be wondering: why negative binomial regression?\u00a0 Well, &#8220;number of acres&#8221; is essentially count data.\u00a0 I have no reason for assuming it to be normally distributed around any particular mean acreage.\u00a0 In fact, I expect smaller farms to be more common than larger, so it is not unreasonable to hypothesize that the distribution of farms would match a poisson or negative binomial distribution.\u00a0 Turns out, the poisson regression was over-dispersed, so the negative binomial is the appropriate distribution in this case.\u00a0 It seems to me that the poisson is always over-dispersed, and I am getting into the habit of skipping straight to negative binomial.\u00a0 For an interesting analysis of the suitability of different distributions for analyzing ecological data, I recommend a paper by Walter Stroup called &#8220;Rethinking the Analysis of Non-Normal Data in Plant and Soil Science&#8221;.<\/p>\n<p>Go ahead and run the regression and examine the results using the command &#8220;summary()&#8221;.\u00a0 Note that the spatial component of my analysis &#8220;miles&#8221;, which represents the distance of the farm from the Corvallis Farmer&#8217;s Market in miles, is not significant.\u00a0 This is a shame because I hypothesized that farms would become smaller the closer they were to the market.\u00a0 In particular, I believe that there is a new business model for farms emerging based on small acreages producing high quality food intended for direct-to-consumer local markets.\u00a0 These results show that older farms tend to have larger acreage, and the higher proportion of their production that goes to local markets, the smaller the acreage tends to be.<\/p>\n<p><a href=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/nb_coefs.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1112\" title=\"\" alt=\"nb_coefs\" src=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/nb_coefs.png\" width=\"480\" height=\"480\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_coefs.png 480w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_coefs-150x150.png 150w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_coefs-300x300.png 300w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/><\/a><\/p>\n<p>The package texreg is for formatting model results into the LaTeX language, which many academics use for publishing because it specializes in formatting model formulas and symbols.\u00a0 The function &#8220;plotreg()&#8221; makes an attractive graphic of the coefficient results.\u00a0 The next few lines of code simply print a Q-Q plot and a plot of the residual vs. fitted values, so we can test the model for normality and independence.\u00a0 The errors are reasonably normally distributed.\u00a0 You can see in the residuals vs. fitted plot that most of the farms are very small in size, but the errors seem independent of the farm size.\u00a0 Hard to tell with so few data points!<\/p>\n<p><a href=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/nb_norm_test.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1115\" title=\"\" alt=\"nb_norm_test\" src=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/nb_norm_test.png\" width=\"480\" height=\"480\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_norm_test.png 480w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_norm_test-150x150.png 150w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_norm_test-300x300.png 300w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/><\/a> <a href=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/nb_indep_test.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1114\" title=\"\" alt=\"nb_indep_test\" src=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/nb_indep_test.png\" width=\"480\" height=\"480\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_indep_test.png 480w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_indep_test-150x150.png 150w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/nb_indep_test-300x300.png 300w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/><\/a><\/p>\n<p>Next we append the residuals and how many standard errors each residual deviates from the mean to the shapefile data.\u00a0 This is the same syntax used to add data to a data.frame.\u00a0 The function writeSpatialShape() resides in the maptools package and does exactly what it sounds like, depositing the new shapefile in the working directory.\u00a0 Then, since we cannot see the output from R when calling the script remotely in ArcGIS, the code prints the results in a .dbf file.\u00a0 The function write.dbf() is in the package foreign.<\/p>\n<p>Information criteria are all the rage now, so the final part of the code makes a .dbf table that includes every possible combination of independent variables included in the original model fit against the dependent variable, with the resulting AIC and BIC.\u00a0 Unless you have an abiding fascination with for and if loops, you can safely ignore the algorithm I use to permutate through every unique combination of independent variables.\u00a0 It stores all the combinations as vectors of numbers in a list.\u00a0 The code after it connects the variable names to the numbers and then pastes them together into formulas.\u00a0 This part of the code is only interesting if you work with list type data and want to know how to extract and manipulate data contained in lists.\u00a0 Hint: use the command &#8220;unlist()&#8221;.<\/p>\n<p>Use write.dbf again to produce a table you can load in ArcGIS, and feel free to run the line containing just the name of the table &#8220;diag_tab&#8221; to view the diagnostics table in R.<a href=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/diag_tab.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1116\" title=\"\" alt=\"diag_tab\" src=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/diag_tab.png\" width=\"946\" height=\"274\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/diag_tab.png 946w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/diag_tab-300x86.png 300w\" sizes=\"auto, (max-width: 946px) 100vw, 946px\" \/><\/a><\/p>\n<p>This concludes a poor tutorial in R.<\/p>\n<p>If you are interested in how ArcGIS and R can talk to each other, here are a few tips to get you started.<\/p>\n<p>I found <a href=\"http:\/\/video.esri.com\/watch\/1925\/integrating-open_dash_source-statistical-packages-with-arcgis\">this video by Mark Janikas<\/a> describing a python script that calls R to perform a logit regression and return the results to a shapefile in ArcGIS.\u00a0 The video is not super helpful in explaining how this process works, its intended audience is a conference full of pro programmers who are hip to hacking lingo, which is not me.\u00a0 At the end, however, he refers the interested to <a href=\"https:\/\/github.com\/Esri\/R-toolbox-py\">this github site<\/a> that hosts the python and r scripts.<\/p>\n<p>One important note: certain ArcGIS directories have to be added to the search path for Python.\u00a0 You can manually add these directories to the search path by modifying the path variable in the environmental variables according to the instructions in the folder demo_geo599\/R_Tools\/Doc.\u00a0 However, I did this and it did not work, so I used a workaround.\u00a0 I manually add the paths using the code in the python file &#8220;pathlib.py&#8221; which I include in the demo folder.\u00a0 This file and the file &#8220;arcpyWithR.py&#8221; need to be in the python search path somewhere for my example file to work.\u00a0 And yes it will work on your home machine, just not in the lab.\u00a0 I have added a python file and R file titled &#8220;GLMwithR&#8221; to the folder R_Tools\/Scripts.<\/p>\n<p>If you open the python file, you will note the first line reads &#8220;import pathlib&#8221;, which adds the necessary directories to the search path.\u00a0 Scrolling down, you will see the function &#8220;ARCPY.GetParameterAsText()&#8221; called six times.\u00a0 Each of these commands receives user input through the ArcGIS GUI interface.\u00a0 You can add the toolbox inside the R_Tools folder to your ArcGIS toolbox by right-clicking on the directory in the toolbox, selecting &#8220;Add Toolbox&#8230;&#8221; and navigating to the &#8220;R Tools&#8221; toolbox in the R_Tools folder.\u00a0 I added my script to this toolbox by right-clicking on the box and selected &#8220;Add -&gt; Script&#8230;&#8221;.\u00a0 I would refer you to the help file on <a href=\"http:\/\/help.arcgis.com\/en\/arcgisdesktop\/10.0\/help\/index.html#\/\/001500000006000000.htm\">creating script tools<\/a>.\u00a0 The hardest part is actually configuring the parameters.\u00a0 If you right-click on the &#8220;Negative Binomial GLM&#8221; tool and select &#8220;Properties&#8221;, you can view the parameters by selecting the &#8220;Parameter&#8221; tab.<\/p>\n<p><a href=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/parameters.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1118\" title=\"\" alt=\"parameters\" src=\"http:\/\/blogs.oregonstate.edu\/geo599spatialstatistics\/files\/2014\/05\/parameters.png\" width=\"442\" height=\"575\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/parameters.png 442w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/1572\/files\/2014\/05\/parameters-230x300.png 230w\" sizes=\"auto, (max-width: 442px) 100vw, 442px\" \/><\/a><\/p>\n<p>Note that there are six parameters.\u00a0 These parameters need to be specified in the same order as they appear on the python script, and each parameter needs to be configured properly.\u00a0 In the Property field, you have to select the proper values for Type, MuliValue, Obtained from, etc., or the script will fail to work.\u00a0 It&#8217;s pretty annoying.<\/p>\n<p>I don&#8217;t understand all the python code myself, but the middle portion passes all the inputs to R in the list called &#8220;args&#8221;.\u00a0 Feel free to open the r script &#8220;GLMwithR.r&#8221; in the folder R_Tools\/Scripts and note the differences between it and the tutorial file we just worked through.<\/p>\n<p>Python uses the functions &#8220;ARCPY.gp.GetParameterInfo()&#8221; and &#8220;ARCPY.SetParameterAsText()&#8221; to read the new shapefile and the two .dbf tables respectively that come from the R script.\u00a0 The whole process of sending info in between Python and R is still mysterious to me, so I can&#8217;t really shed any light on how or why the python script works.<\/p>\n<p>That concludes my trail of breadcrumbs.\u00a0 Good luck!<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I would like to show you how to use a script tool within the ArcGIS toolbox to call R remotely, have it perform some function on your data, and return the results into a new shapefile projected onto your workspace in ArcMap.\u00a0 I won&#8217;t be doing that. The scripting tool works.\u00a0 I have used it&hellip; <a href=\"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/2014\/05\/18\/poor-tutorial-r\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":5803,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[257],"tags":[],"class_list":["post-1111","post","type-post","status-publish","format-standard","hentry","category-tutorials"],"_links":{"self":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/posts\/1111","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/users\/5803"}],"replies":[{"embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/comments?post=1111"}],"version-history":[{"count":8,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/posts\/1111\/revisions"}],"predecessor-version":[{"id":1136,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/posts\/1111\/revisions\/1136"}],"wp:attachment":[{"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/media?parent=1111"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/categories?post=1111"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dev.blogs.oregonstate.edu\/geo599spatialstatistics\/wp-json\/wp\/v2\/tags?post=1111"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}