First of all, I would like to say I learned about Ordinary Least Squares (OLS) and Geographically Weighted Regression (GWR) tools from ArcGIS help 9.3, 10.0, and 10.1. This final report is based on the knowledge I gained through those help menus.
My learning goal in this class was to learn ArcGIS Spatial Statistics Tools that help answer my project questions. For my project, I am examining whether or not various types of dissolved organic matter (DOM) sources are controlled by factors such as land use types, stream orders, other water quality parameters, and seasonal flow patterns. At the beginning of the quarter, I was going to use SSN & STARS and FLoWS, which are add-in spatial statistical tools for stream networks; however, they did not work as they are for ArcGIS 9.3.
I moved onto GWR, which is one of the Spatial Statistics Tools. These are the steps and tools I had to follow and use to run GWR.
- OLS to select independent variables
- Spatial Autocorrelation (Moran’s I) on residuals to make sure that I did not miss any significant independent variables
- GWR
- Spatial Autocorrelation (Moran’s I)
Because I have not gotten to the point to identify different DOM sources, I used dissolved organic carbon (DOC) concentrations as a dependent variable. For independent variables, I examined land use (forest, urban, and agriculture), stream order, absorbance at 254 nm, and water quality parameters such as dissolved organic nitrogen (DON), nitrate (NO3–), ammonium (NH4+), and ortho-phosphate (PO43+).
The data used were collected at the end of the dry season in 2012. Thus, the temporal variation was not considered here. I did not have time to create catchment and watershed layers for each sampling location, so I eyeballed the percentage of each land use for the purpose of completing this class project. I used data from 21 sampling sites.
- OLS and Moran’s I
Plotting scatter plots with Excel (I could have used the ArcGIS scatter plot tool) before running OLS helped me understand my data. Once I plotted scatter plots, it was evident that absorbance at 254 nm was highly related to DOC concentrations. Absorbance at 254 nm is used to describe aromaticity of DOM. It makes sense why these two variables were correlated because high aromatic compounds mean high DOC (note, high DOC does not mean high in aromaticity). My samples must have been aromatic at the end of the dry season in 2012. Thus, I decided to use absorbance at 254 nm as the first independent variable. OLS generates two forms of outputs: a report and a residual layer.
As I expected, the result was good. About 88% of dependent variable’s variance was explained by absorbance at 254 nm by looking at Adjusted R-Sqaured. Asterisk on the probability (p-value) indicates the particular independent variable is statistically significant (p < 0.05). Another information to look at is Corrected Akaike Information Criterion (AICc). The low AICc indicates good model fits with the observed data.
According to Wikipedia,(http://en.wikipedia.org/wiki/Akaike_information_criterion)
AIC = 2k – 2ln(L)
where k is the number of parameters and L is the maximized value of the likelihood function.
AICc = AIC + 2k(k+1)/(n-k-1)
where n is the number of sample.
Another output of OLS is a residual layer. Red color indicates the observed dependent variable value is higher than the corresponding predicted value. Blue color indicates the observed dependent variable value is lower than the predicted value.
Before moving onto GWR, we must run Moran’s I on residuals to see if they are clustered. If they were, I still have an important independent variable that I can add to the OLS model. The result of the Moran’s I indicated the residuals were not clustered; however, I decided to add one more independent variable for a practice purpose.
Assuming residuals were clustered, the next thing we have to do is to check scatter plots to see if there are other independent variables that are correlated to the dependent variable. I found DON and DOC were correlated, so I decided to add DON as the second independent variable. I expected to see the correlation between DOC and the percentage of each land use type in each catchment; however, there was no strong relationship between them.
As I was running OLS, I realized I had to make the field numerical type to be DOUBLE.
OLS was run again, and the report showed me the improvement in the OLS model as the adjusted R2 value was increased to 0.947 and AICc was decreased to – 7.67. The term, 2k(k+1)/n-k-1, in the AICc equation is a positive value (i.e. 0.67). When I plugged in other values, L turned out to be about 477. The L with one independent variable of absorbance at 254 nm was about 0.097.
I decided to use both absorbance at 254 nm and DON because I am satisfied with improved values of adjusted R2 and AICc. And the sign of coefficients are both positive as expected from previous scatter plots.
I ran Moran’s I on the new residuals and found no clustering. Now it was time for me to check other statistical results from OLS.
There are two statistical terms to tell if each independent variable is statistically significant: Probability and Robust Probability. If Koenker Statistics is significant (look for the aretisk), there are regional variations and I can trust only robust probability. My report showed a significant Koenker Statistics. This makes sense because I have regional variations among sampling sites. Clustered sampling sites around the H.J. Andrews Experimental Forest represent undisturbed forest. In Corvallis, clustered sampling sites represent urban and agricultural land use.
Additionally, I have to check if those independent variables are redundant in describing the dependent variable by examining Variance Inflation Factor (VIF). This value needs to be less than 7.5. If it is greater than 7.5, I have to remove the corresponding independent variable. Absorbance at 254 nm and DON were not redundant as their VIFs were about 3.1.
Lastly, the residuals cannot be distributed normally because normally distributed residuals indicate that the model is biased. You can create a histogram or check if Jarque – Bera Statistics is significant in the report output. My model was not biased.
2. GWR
Although Koenker Statistics from OLS suggested there was a regional variance, the result of a regional model of GWR did not improve from a global model of OLS.
GWR generates a layer for each independent variable. Standard Error of Coefficient describes which part of the region in the study area is significant for the particular independent variable. As you can see, there is no variation in Standard Error of Coefficient for both absorbance at 254 nm and DON. That is probably why the ALCc and Adjusted R2 of OLS model and GWR model turned out to be the same.
3. Next steps
a) Investigate why GWR result was not improved from OLS result even though Koenker Statistics suggested it should improve.
b) Create a layer with stream networks (using Generate Network Spatial Weights?), so I can use it as a spatial weighting for GWR to improve GWR model.
c) Consider interpolative my pointe features first. My inputs were all point features. I wonder if it is beneficial to interpolate first.
d) Find out if it is a problem to run OLS and GWR when my sampling sites are clustered.
Hi Peggy,
Your work is very interesting and you did a great job explaining the steps. I was wondering if you knew how much more improvement on the GWR you were expecting? Also, are you planning to go back to ArcGIS9.3 to run any analysis to see if you can compare the results to the GWR?
This is super cool. I didn’t even know you could do OLS in ArcGIS!
Have you been able to figure out how to read those Standard Errors? (given that they all have so similar values!!)
Noelia,
No I haven’t figured out that, yet. It is still strange to me. Maybe the Adjusted R^2 and AICc were too good, so it could not improve? It still does not make sense though because those two values were exactly the same for OLS and GWR even though Koenker statistics was significant.
Hi Peggy,
Thanks for the thorough report. My first thought regarding question 1 is wondering how bandwidth size was chosen. GWR essentially runs a window of subsets of the landscape. Getting similar results from your global OLS model could mean the window for GWR was too large.
It’s also possible that while there are clearly different overall values for the agricultural and forested areas, there many not be differences in how each region is correlated with absorbance or DON.
Something else to consider – you mentioned that absorbance accounted for 88% of the variance – that’s a really strong overall driver. You were already looking at residuals for secondary drivers. It might be interesting to model just the residuals themselves, maybe picking up another variable or two, and see if you get more localized variation in these secondary drivers after accounting for absorbance.
One more final thought: at the Geostatistics presentation I went to, one of the presenters talked about GWR and one of the recommendations was to use the tool only with high density data sets. So some of the problems you’re encountering may be related to only having 21 sample points. Note that this isn’t a problem of whether those points are clustered spatially, it’s just that too few points overall to pick up local variation. When you start subdividing 21 points you’re only performing the local regression on maybe 5-10 points, and sample size is getting quite small at that point.
So, just some thoughts I had. I’m still learning more about how these tools work myself, great to read about your experiences exploring these tools.
– Max
P.S. There’s an AGU Chapman Conference this fall that your research would fit quite well in. Conference title is Soil-mediated Drivers of Coupled Biogeochemical and Hydrological Processes Across Scales
Website:
http://chapman.agu.org/soilmediated/
Peggy,
Very nice work! It would be useful to develop a set of questions about what relationships you expect, and why. These could be of two types: (1) what chemical properties might be expected to be related to one another, and why? and (2) what spatial trends do you expect to see in the variables, or the relationships among variables, and why?
Julia