First of all, I would like to say I learned about Ordinary Least Squares (OLS) and Geographically Weighted Regression (GWR) tools from ArcGIS help 9.3, 10.0, and 10.1. This final report is based on the knowledge I gained through those help menus.

My learning goal in this class was to learn ArcGIS Spatial Statistics Tools that help answer my project questions. For my project, I am examining whether or not various types of dissolved organic matter (DOM) sources are controlled by factors such as land use types, stream orders, other water quality parameters, and seasonal flow patterns. At the beginning of the quarter, I was going to use SSN & STARS and FLoWS, which are add-in spatial statistical tools for stream networks; however, they did not work as they are for ArcGIS 9.3.

I moved onto GWR, which is one of the Spatial Statistics Tools. These are the steps and tools I had to follow and use to run GWR.

  1. OLS to select independent variables
  2. Spatial Autocorrelation (Moran’s I) on residuals to make sure that I did not miss any significant independent variables
  3. GWR
  4. Spatial Autocorrelation (Moran’s I)

Because I have not gotten to the point to identify different DOM sources, I used dissolved organic carbon (DOC) concentrations as a dependent variable. For independent variables, I examined land use (forest, urban, and agriculture), stream order, absorbance at 254 nm, and water quality parameters such as dissolved organic nitrogen (DON), nitrate (NO3), ammonium (NH4+), and ortho-phosphate (PO43+).

The data used were collected at the end of the dry season in 2012. Thus, the temporal variation was not considered here. I did not have time to create catchment and watershed layers for each sampling location, so I eyeballed the percentage of each land use for the purpose of completing this class project. I used data from 21 sampling sites.

My study sites

  1. OLS and Moran’s I

Plotting scatter plots with Excel (I could have used the ArcGIS scatter plot tool) before running OLS helped me understand my data. Once I plotted scatter plots, it was evident that absorbance at 254 nm was highly related to DOC concentrations. Absorbance at 254 nm is used to describe aromaticity of DOM. It makes sense why these two variables were correlated because high aromatic compounds mean high DOC (note, high DOC does not mean high in aromaticity). My samples must have been aromatic at the end of the dry season in 2012. Thus, I decided to use absorbance at 254 nm as the first independent variable. OLS generates two forms of outputs: a report and a residual layer.

As I expected, the result was good. About 88% of dependent variable’s variance was explained by absorbance at 254 nm by looking at Adjusted R-Sqaured. Asterisk on the probability (p-value) indicates the particular independent variable is statistically significant (p < 0.05). Another information to look at is Corrected Akaike Information Criterion (AICc). The low AICc indicates good model fits with the observed data.

OLS result 1

According to Wikipedia,(http://en.wikipedia.org/wiki/Akaike_information_criterion)

AIC = 2k – 2ln(L)

where k is the number of parameters and L is the maximized value of the likelihood function.

AICc = AIC + 2k(k+1)/(n-k-1)

where n is the number of sample.

Another output of OLS is a residual layer. Red color indicates the observed dependent variable value is higher than the corresponding predicted value. Blue color indicates the observed dependent variable value is lower than the predicted value.

OLS residual layer 1

Before moving onto GWR, we must run Moran’s I on residuals to see if they are clustered. If they were, I still have an important independent variable that I can add to the OLS model. The result of the Moran’s I indicated the residuals were not clustered; however, I decided to add one more independent variable for a practice purpose.

Assuming residuals were clustered, the next thing we have to do is to check scatter plots to see if there are other independent variables that are correlated to the dependent variable. I found DON and DOC were correlated, so I decided to add DON as the second independent variable. I expected to see the correlation between DOC and the percentage of each land use type in each catchment; however, there was no strong relationship between them.

As I was running OLS, I realized I had to make the field numerical type to be DOUBLE.

DOC vs. DON

DOC vs. Land Use Types

OLS was run again, and the report showed me the improvement in the OLS model as the adjusted R2 value was increased to 0.947 and AICc was decreased to – 7.67. The term, 2k(k+1)/n-k-1, in the AICc equation is a positive value (i.e. 0.67). When I plugged in other values, L turned out to be about 477. The L with one independent variable of absorbance at 254 nm was about 0.097.

I decided to use both absorbance at 254 nm and DON because I am satisfied with improved values of adjusted R2 and AICc. And the sign of coefficients are both positive as expected from previous scatter plots.

I ran Moran’s I on the new residuals and found no clustering. Now it was time for me to check other statistical results from OLS.

OLS result 2

There are two statistical terms to tell if each independent variable is statistically significant: Probability and Robust Probability. If Koenker Statistics is significant (look for the aretisk), there are regional variations and I can trust only robust probability. My report showed a significant Koenker Statistics. This makes sense because I have regional variations among sampling sites. Clustered sampling sites around the H.J. Andrews Experimental Forest represent undisturbed forest. In Corvallis, clustered sampling sites represent urban and agricultural land use.

Additionally, I have to check if those independent variables are redundant in describing the dependent variable by examining Variance Inflation Factor (VIF). This value needs to be less than 7.5. If it is greater than 7.5, I have to remove the corresponding independent variable. Absorbance at 254 nm and DON were not redundant as their VIFs were about 3.1.

Lastly, the residuals cannot be distributed normally because normally distributed residuals indicate that the model is biased. You can create a histogram or check if Jarque – Bera Statistics is significant in the report output. My model was not biased.

The histogram of residuals

2. GWR

Although Koenker Statistics from OLS suggested there was a regional variance, the result of a regional model of GWR did not improve from a global model of OLS.

GWR result

GWR generates a layer for each independent variable. Standard Error of Coefficient describes which part of the region in the study area is significant for the particular independent variable. As you can see, there is no variation in Standard Error of Coefficient for both absorbance at 254 nm and DON. That is probably why the ALCc and Adjusted R2 of OLS model and GWR model turned out to be the same.

Abs 254 nm, Std. Error of Coefficient

DON, Std. Error of Coefficient

 

3. Next steps

a) Investigate why GWR result was not improved from OLS result even though Koenker Statistics suggested it should improve.

b) Create a layer with stream networks (using Generate Network Spatial Weights?), so I can use it as a spatial weighting for GWR to improve GWR model.

c) Consider interpolative my pointe features first. My inputs were all point features. I wonder if it is beneficial to interpolate first.

d) Find out if it is a problem to run OLS and GWR when my sampling sites are clustered.

 

Lauren suggested me to use the above mentioned tools. Here are what I learned about those tools through ArcGIS10 Help.

Generate Spatial Weights Matrix: Constructs a spatial weights matrix (.swm) file to represent the spatial relationships among features in a dataset.

Generate Network Spatial Weights: Constructs a spatial weights matrix file (.swm) using a Network dataset, defining feature spatial relationships in terms of the underlying network structure.

Note, you have to turn on Network Analyst Extensions to use this tool.

It seems like I have to manually assign the relationship of each network, which sounds like a very cumbersome work as there are more than 100,000 streams to deal with. I may be able to utilize fdr (the output of FlowDirection) to expedite the process.

Stay tuned!

This tool is not relevant to my study that examines spatial variations of DOM (dissolved organic matter) and nutrients in streams, but I wanted to learn how it works, so I tried this tool. The result will tell me how clustered my sampling sites are based on the study site extent. I could visually see my sampling sites were clustered within the study site, and the result supported my visual observation. Note, I did not use all of my sampling sites.

Here are some notes:

I project the point Shapefile FIRST. This will determine the units of my results. (I was not sure about this, but I found it was true during the discussion on Monday.)

I used EUCLIDEAN_DISTANCE.

I had to check on “Generate Report” to create a graphical result, which can be accessed from “Results” window –> HTML Report File: NearestNeighbor_Results. I could save this report as PDF.
All the results are in the log/results window. It personally helps me to disable background processing and have an actual log window to obtain a text summary of the results.

Average Nearest Neighbor Summary
Observed Mean Distance: 4318.596812
Expected Mean Distance: 9383.554326
Nearest Neighbor Ratio: 0.460230
z-score: -4.732046
p-value: 0.000002

 

Unfortunately, I could not easily find a corresponding tool in SSN and STARS. I have to look into their documents.

The data I will be using for this class are water quality data that I have been collecting from 21 locations in the upper Willamette River Basin. The water quality parameters that I will use for this class are dissolved organic carbon and nitrate.

I would like to learn 1) what the ArcGIS add-in toolboxes of SSN & STARS and FLoWS can do, 2) the concepts of each function offered by those toolboxes, and 3) run them on my data to answer one of my questions: are my water quality data varying spatially?

As I was exploring the Spatial Statistics Resources web-page, I quickly realized most of the spatial statistical tools offered by ESRI are not applicable to my project. My project explores spatial and temporal variations of water quality (dissolved organic carbon sources to be precise) in rivers of the Willamette River Basin. Those ESRI spatial statistical tools are not applicable to my project because 1) points are not representing actual observation points of organisms or diseases for my project but rather representing water quality sampling locations that were selected by me and 2) not only Euclidean distance but also in-stream distances, flow directions, and stream networks affect statistical significance.

I found add-in toolboxes for SSN & STARS and FLoWS that address those two issues mentioned above. These toolboxes were developed by the U.S. Forest Service (USFS). Unfortunately the currently available toolboxes are for ArcGIS 9.3, but the USFS states they are planning to publish new toolboxes for ArcGIS10 later this year.

 

http://webcache.googleusercontent.com/search?q=cache:5SIzWb38eREJ:blogs.esri.com/esri/arcgis/2013/01/29/ssn-stars-tools-for-spatial-statistical-modeling-on-stream-networks/+spatial+statistics+arcgis+water&cd=1&hl=en&ct=clnk&gl=us

 

Things I would like to accomplish by the next class period are to 1) download those two toolboxes and 2) see if they seem to work with ArcGIS10. Note, I am not planning on publishing data modified using those toolboxes developed for ArcGIS 9.3; however, these goals will help me explore what kinds of tools are available through these toolboxes and learn the concept of tools that I am interested in using.