First of all, I would like to say I learned about Ordinary Least Squares (OLS) and Geographically Weighted Regression (GWR) tools from ArcGIS help 9.3, 10.0, and 10.1. This final report is based on the knowledge I gained through those help menus.
My learning goal in this class was to learn ArcGIS Spatial Statistics Tools that help answer my project questions. For my project, I am examining whether or not various types of dissolved organic matter (DOM) sources are controlled by factors such as land use types, stream orders, other water quality parameters, and seasonal flow patterns. At the beginning of the quarter, I was going to use SSN & STARS and FLoWS, which are add-in spatial statistical tools for stream networks; however, they did not work as they are for ArcGIS 9.3.
I moved onto GWR, which is one of the Spatial Statistics Tools. These are the steps and tools I had to follow and use to run GWR.
- OLS to select independent variables
- Spatial Autocorrelation (Moran’s I) on residuals to make sure that I did not miss any significant independent variables
- GWR
- Spatial Autocorrelation (Moran’s I)
Because I have not gotten to the point to identify different DOM sources, I used dissolved organic carbon (DOC) concentrations as a dependent variable. For independent variables, I examined land use (forest, urban, and agriculture), stream order, absorbance at 254 nm, and water quality parameters such as dissolved organic nitrogen (DON), nitrate (NO3–), ammonium (NH4+), and ortho-phosphate (PO43+).
The data used were collected at the end of the dry season in 2012. Thus, the temporal variation was not considered here. I did not have time to create catchment and watershed layers for each sampling location, so I eyeballed the percentage of each land use for the purpose of completing this class project. I used data from 21 sampling sites.
- OLS and Moran’s I
Plotting scatter plots with Excel (I could have used the ArcGIS scatter plot tool) before running OLS helped me understand my data. Once I plotted scatter plots, it was evident that absorbance at 254 nm was highly related to DOC concentrations. Absorbance at 254 nm is used to describe aromaticity of DOM. It makes sense why these two variables were correlated because high aromatic compounds mean high DOC (note, high DOC does not mean high in aromaticity). My samples must have been aromatic at the end of the dry season in 2012. Thus, I decided to use absorbance at 254 nm as the first independent variable. OLS generates two forms of outputs: a report and a residual layer.
As I expected, the result was good. About 88% of dependent variable’s variance was explained by absorbance at 254 nm by looking at Adjusted R-Sqaured. Asterisk on the probability (p-value) indicates the particular independent variable is statistically significant (p < 0.05). Another information to look at is Corrected Akaike Information Criterion (AICc). The low AICc indicates good model fits with the observed data.
According to Wikipedia,(http://en.wikipedia.org/wiki/Akaike_information_criterion)
AIC = 2k – 2ln(L)
where k is the number of parameters and L is the maximized value of the likelihood function.
AICc = AIC + 2k(k+1)/(n-k-1)
where n is the number of sample.
Another output of OLS is a residual layer. Red color indicates the observed dependent variable value is higher than the corresponding predicted value. Blue color indicates the observed dependent variable value is lower than the predicted value.
Before moving onto GWR, we must run Moran’s I on residuals to see if they are clustered. If they were, I still have an important independent variable that I can add to the OLS model. The result of the Moran’s I indicated the residuals were not clustered; however, I decided to add one more independent variable for a practice purpose.
Assuming residuals were clustered, the next thing we have to do is to check scatter plots to see if there are other independent variables that are correlated to the dependent variable. I found DON and DOC were correlated, so I decided to add DON as the second independent variable. I expected to see the correlation between DOC and the percentage of each land use type in each catchment; however, there was no strong relationship between them.
As I was running OLS, I realized I had to make the field numerical type to be DOUBLE.
OLS was run again, and the report showed me the improvement in the OLS model as the adjusted R2 value was increased to 0.947 and AICc was decreased to – 7.67. The term, 2k(k+1)/n-k-1, in the AICc equation is a positive value (i.e. 0.67). When I plugged in other values, L turned out to be about 477. The L with one independent variable of absorbance at 254 nm was about 0.097.
I decided to use both absorbance at 254 nm and DON because I am satisfied with improved values of adjusted R2 and AICc. And the sign of coefficients are both positive as expected from previous scatter plots.
I ran Moran’s I on the new residuals and found no clustering. Now it was time for me to check other statistical results from OLS.
There are two statistical terms to tell if each independent variable is statistically significant: Probability and Robust Probability. If Koenker Statistics is significant (look for the aretisk), there are regional variations and I can trust only robust probability. My report showed a significant Koenker Statistics. This makes sense because I have regional variations among sampling sites. Clustered sampling sites around the H.J. Andrews Experimental Forest represent undisturbed forest. In Corvallis, clustered sampling sites represent urban and agricultural land use.
Additionally, I have to check if those independent variables are redundant in describing the dependent variable by examining Variance Inflation Factor (VIF). This value needs to be less than 7.5. If it is greater than 7.5, I have to remove the corresponding independent variable. Absorbance at 254 nm and DON were not redundant as their VIFs were about 3.1.
Lastly, the residuals cannot be distributed normally because normally distributed residuals indicate that the model is biased. You can create a histogram or check if Jarque – Bera Statistics is significant in the report output. My model was not biased.
2. GWR
Although Koenker Statistics from OLS suggested there was a regional variance, the result of a regional model of GWR did not improve from a global model of OLS.
GWR generates a layer for each independent variable. Standard Error of Coefficient describes which part of the region in the study area is significant for the particular independent variable. As you can see, there is no variation in Standard Error of Coefficient for both absorbance at 254 nm and DON. That is probably why the ALCc and Adjusted R2 of OLS model and GWR model turned out to be the same.
3. Next steps
a) Investigate why GWR result was not improved from OLS result even though Koenker Statistics suggested it should improve.
b) Create a layer with stream networks (using Generate Network Spatial Weights?), so I can use it as a spatial weighting for GWR to improve GWR model.
c) Consider interpolative my pointe features first. My inputs were all point features. I wonder if it is beneficial to interpolate first.
d) Find out if it is a problem to run OLS and GWR when my sampling sites are clustered.