After considerable experimentation with a variety of ArcGIS’s Spatial Statistics tools, including Hot Spot Analysis, Cluster Analysis, Spatial Autocorrelation, Geographically Weighted Regression, and Ordinary Least Squares, I think I may have found a viable method for analyzing my SSURGO Soils dataset. For my final class presentation for this course, I employed the Grouping Analysis tool to explore the spatial patterns and clusters of high clay content within the sub-AVAs of the northern Willamette Valley. The visual correspondence between the resulting groups and the soil orders (i.e. taxonomy) was surprisingly accurate.
Reading through the literature on ESRI’s webpage about Grouping Analysis, I learned that one should start the Grouping Analysis using one variable, incrementally adding more with each subsequent run of the analysis. Following suit, I have experimented with both the addition of more variables as well as the total number of ideal groups for a given data set. While the soils present in each of the sub-AVAs are incredibly heterogenous and diverse, they do share some similarities, particularly with regard to clay content and soil taxonomy.
The results published here reflect an analysis using the variables of percent clay content, Ksat, Available Water Storage at 25cm, 50cm, and 150cm, respectively; choosing to parse the data into 5 groups. I also took advantage of the “Evaluate Optimal Number of Groups parameter” option within the toolbox, which generates additional statistics meant to identify the number of groups that will most readily distinguish one’s data set into distinct groups.
In addition, I generated Output Report Files with each run so that I could explore the statistical results in more depth. I’ve attached these for those of you who are interested in seeing what the results look like. I find it interesting that for almost all of my AVA data sets save for one, the resulting reports are suggesting that 15 is the optimal number of groups. I’m not sure if this is because 15 is the maximum number of groups that the tool can generate, or if this is a result of the particular variables I am using as inputs.
Additional variables that I plan on adding include percent sand, percent silt, bulk density, percent organic matter, and parent material. I am also considering incorporating raster data sets of slope, aspect, landform, vegetation zone, precipitation, minimum temperature, and maximum temperature. Performing multiple iterations of the Grouping Analysis will help me to identify a suitable combination of these variables, as well as the optmimal number of groups. Once those have been identified, I plan on performing the same analysis on each AVA, and then on buffered polygons of the AVAs at distances of 500m, 1000m, 1500m, 2000m, 2500m, and 3000m. In so doing, I hope to identify the degree to which different sub-AVAs in the northern Willamette Valley differ from directly adjacent landscapes. This will allow me to articulate those sub-AVAs which best correspond to the underlying soil classes in those areas.
Doug,
Great Job on this Grouping Analysis tool. I was wondering if you were going to be taking into account variables such as air quality or water quality (particulates etc)? Could this have an affect on the quality of the AVA?
Thanks Candice!
I know that air and water quality could have an impact on the quality of the grapes themselves. Many vineyards post signs reminding people to drive slowly in order to minimize the amount of dust generated, particularly in the late growing season near harvest. The dust (and any particulate matter) tends to settle on the grapes, and the concern is that the dust then ends up in the wine when the grapes are harvested.
As far as the impact that air quality and/or water quality could have on an regional scale (i.e. at the AVA or sub-AVA level), I am not sure how one would effectively quantify the impact of those variables. Nonetheless, it goes without saying that most people wouldn’t want their wine made from grapes grown adjacent to a coal power plant, or irrigated using contaminated water. Your comment does remind me, however, of the issues related to increased soil salinity due to overdrawing the water table in places like California’s Central Valley.
-Doug
Doug,
I was just thinking if you have tried to find more powerful computers to run your operations. When I used to work for a consulting firm, the person who was using SSURGO had multiple computers, so there was one very powerful computer dedicated for modifying/processing the SSURGO dataset.
Jim Graham knows people who got powerful computers. I thought Lawrence Sim’s computer or the computer sitting in his office was one of those.
Thanks Peggy!
One of the limitations in this analysis is my laptop, to be certain. For most of the heavy-lifting of my research, I have used “beefier” machines in Digital Earth, or downstairs in the Grad Lab. Unfortunately for me, none of those machines is running 10.1, and ArcGIS 10 does not have Grouping Analysis bundled with it. Perhaps I should ask Lawrence if I can borrow his machine for an evening run – great suggestion!
-Doug
Thanks for this post, Doug. I might try it on some of my data. I have a question: did you use a single shapefile with percent clay content, Ksat, and Available Water Storage attributes or did the tool require multiple raster/feature inputs?
Hi Kate,
The Grouping Analysis tool requires that all of the variables be within one shapefile. This was an issue that I did a poor job of conveying last week during my presentation. As you probably now know, the NRCS Soil Data Viewer extension for ArcGIS only outputs one variable at a time. In order to run Grouping Analysis, I had to run multiple joins between these respective output shapefiles (AWS25, AWS50, AWS150, Percent Clay, & Ksat) so that all of the required variables were in a single shapefile. I would advise caution when doing this, however, as there is great potential for mismatched records when performing that many joins. As an alternative, I think that I might take a cue from Max, and perform my own custom joins within the corresponding Access Databases, and then execute a single join to my shapefile.
Thanks for the great question!
-Doug
Hi Doug,
those maps look great! They do look almost identical.
If I am correct you are trying to find a model that will be able to predict the sub_AVAs based on terrain characteristics, right? How will you determine when “close enough is good enough”? With this I mean, what will be your parameter to determine that you have determined that you’ve reached the best model possible? A certain percentage of overlap between your predicted areas and the sub_AVAs ones?
Hi Noelia,
This is a tough, but very crucial question. To be honest, I am not sure what that threshold is or will be. I think that moving forward, it will be important for me to work closely with Julia and Jay Noller of Crop and Soil Science (both of whom are on my Committee) to scrutinize the statistics I’m generating in order to home in on the most suitable combination. Some of the variables I am currently using I know of from my work in the wine industry, others have been suggested by my committee members, and others I have gleaned from reviewing viticulture literature.
Thanks for the great comment and question!
-Doug
Nice work Doug,
A couple of things come to mind. First, I’m pretty sure Arc’s grouping analysis maxes out at 15, which is rather unfortunate. Ultimately, you are trying to determine how similar sub-AVAs are to each other, in order to refine regional classifications, correct? If so, you may want to explore canonical correspondence analysis (CCA), which I used for my project to group veg communities and is essentially a principal components analysis that retains categorical values. Rather than veg, you could group sub-AVAs based on soil and environmental characteristics. Then, using a cluster analysis (ie k-means), you could determine which AVAs should remain classified together. This would take out outside of Arc and into something like R–I can share my code with you if you’re interested in pursuing it.
Hi Kevin,
I agree it’s unfortunate that ArcGIS maxes out at 15 groups. Then again, this brings to mind the push-and-pull of lumping and splitting, especially with regard to categorical data. I wonder if anyone has developed a comparable Grouping Analysis code in R that allows for greater numbers of groups than 15?
I’m not familiar with CCA, but it sounds really compelling. I’ve been envisioning the use of Grouping Analysis in the same vein as a PCA, but in many instances retaining categorical values would be preferable. If you’re willing to share your CCA code (presumably in R?), I would be obliged!
Thanks for the great comments and suggestions!
-Doug
Nice work Doug!
One of my objectives is to find out how similar the encounter locations for melon-headed whales are and I will definitely look into this tool. I may be bugging you about your work in the near future!
Keep up the good work, especially if it means more wine!
Hi John,
Feel free to hit me up with any questions about the Grouping Analysis tool – I would be glad to share what little I know!
Thanks for the comment and, yes – more wine, please!
-Doug
hey Doug, in your discussion above you state: “The visual correspondence between the resulting groups and the soil orders (i.e. taxonomy) was surprisingly accurate.”
Do you plan to extend the statistical analysis beyond the “visual correspondence” and quantify this relationship between soil orders and the groups? Good luck!
hi Doug,
Good progress. If the Soil Taxonomy is based on multivariate classification of soil characteristics in the database you analyzed, your results should perfectly replicate the soil taxomony map. But there are some differences. Can you create some error overlays showing the locations where (1) your multivariate classification combines polygons which are classified as separate soil orders in the Taxonomy, and (2) your multivariate classification splits polygons which are classified as a single order in the Taxonomy?
Julia