GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

My research is focused on developing a web-based forage species selection tool and estimating potential yield of key forage species grown in Oregon and Sichuan Province.

Our goal is to match appropriate species with each eco-region. Related to this class, the problem is how we can use the GIS spatial analyst tools to define and display a workable number of forage production eco-regions based on topography, climate, soil characteristics, land-use and land-cover, and agricultural systems.

Although there have been several important studies directed at defining ecoregions (Bailey, 1996; Thorson, 2003; Omernick, 2004), these have been based primarily on the Köppen Climate Classification System (1936) and broad groupings of species suitable for each zone. They are not helpful in quantifying potential annual forage yield or seasonal production profiles required for rational management of forage-livestock systems.

To provide useful guidance to Oregon and Sichuan Province farmers and ranchers, our agroecozone classification systems will use a hierarchical approach beginning with climate, with modifications due to physiography and land-use/land-cover, and soil characteristics.

Level I: Climate (Thermal Units and Precipitation)

Climate was chosen as the foundational level of the classification system due to the essential nature of temperature and precipitation in plant growth and development. Base spatial layers for climate factors will include extreme monthly cold and hot temperature, mean monthly maximum and minimum temperature, mean annual temperature, and mean annual, seasonal, and monthly precipitation. Climate-based indices will be developed to predict forage crop growth and development. These will include solar radiation and photosynthetically active radiation, accumulated thermal units (with various base temperatures), growing season length, and vernalization chilling days. For agricultural systems that include irrigation a soil water balance model will be applied.

Level II: Physiography, Land-use/Land-cover [Topography (DEM), MODIS Images]

The second level of the classification systems will involve physiography and land-use and land-cover. A DEM will be used to underlay the climate layers and identify terrain slope, with the following rankings: >60°, not useful; 60°— 50°, 30% can be useful for livestock; 50°— 40°50% can be useful; 40°— 30°, can be used as pasture; and >30°, useful as grassland (Zhang, Grassland management, 2010). Current land-use and land-cover will be characterized from current and historical MODIS satellite images, with particular focus on cropland, pastureland, and rangeland areas.

Level III: Soil Characteristics (pH, Drainage, Salinity)

Soil characteristics will be the final level of the hierarchy, since, to a large degree, these can be modified in agricultural systems. Spatial data layers will be obtained for soil type, pH, drainage classification, and salinity. Site specific data will be obtained for more detailed fertility information.

As I started to describe in class, my project will be dealing with output results from the model software ENVISION. ENVISION is a GIS-based tool for scenario based community and regional planning, and environmental assessments. It combines a spatially explicit polygon-based representation of a landscape (IDUs or Individual Decision Units in my case), a set of application-defined policies, landscape change models, and models of ecological, social, and economic services to simulate land use change and provide decision-makers, planners, and the public with information about resulting effects.

The ENVISION project I am involved with is centered on Tillamook County and its coastal communities. Through a series of stakeholder meetings (which have included a range of people such as private landowners and state land use commissioners) our group identified several land use policies to implement in the ENVISION model. The policies were then grouped into three types of management responses: the baseline (or status quo), ReAlign, and two types of Hold the Line (high vs. low management) scenarios. These policy scenarios have been combined with current physical parameters of the coastline such as dune height and beach width, and will be also linked with climate change conditions at low, medium, and high levels for the next 30, 50, and 100 years.

Since ENVISION is GIS-based already, I am having a tough time coming up with a problem that complements the project in ArcGIS. ENVISION does a great job of visualizing the changes expected for each location along the coast via videos, graphs (see below), and can even include economic estimations.

Therefore, it may be best to explore the capabilities of software like R to analyze the output data. One idea would be to calculate the probability of occurrence for these different events and total number of occurrence. I need to take a deeper look into how these events are calculated to begin with, and determine the inherent estimates of probability and uncertainty. This type of analysis would help determine whether this type of exercise is beneficial for stakeholders and would help answer their own questions of trust in the results.

Another idea would be to focus on specific results from ENVISION and try to determine exactly how one policy is affecting the coastline and creating such disparate results. For instance, the graph below shows Numbers of Flooded and Eroded Structures in Pacific City under three types of scenarios. What is causing the large number of eroded/ flooded structures between 2035 and 2040? Why is there such a small difference between ReAlign and Hold the Line strategies if they are employing such different options? Some of these questions may be answered with a greater understanding of ENVISION, however, these are the types of questions that may be asked by stakeholders and it would be prudent to provide more quantitative answers that ArcGIS or R could glean.

My initial goal was to explore local food production over time near Corvallis, but I am getting ready to change topics because I cannot find enough information on farms in the area to discriminate crop types, either by visual assessment or ownership. The federal data I could find on crop types did not list information more granular than the county level. Land cover data categorizes farmland as “herbaceous” or “barren” and is not much help. So I attempted visual assessment of orthographic imagery. Here is the Willamette Valley around Corvallis:

If I zoom in on a parcel, this is the level of detail:

Clearly agricultural, but I couldn’t tell you what. That was 2011, here is the same land in 2005:

Is that supposed to be grass? What degree of certainty do I have? Not enough for analysis.

Here is the adjacent parcel:

Clearly two different crop types, but is one hay and the other grass seed? Don’t ask a city slicker like me.

The second strategy I tried was to determine ownership. Certain farms produce specific types of crops, and other farms have a reputation for selling their food locally. But I could not find the equivalent of the White and Yellow pages for GIS, or even a shapefile with polygons divided by tax lots. Instead, I tried looking at the water rights. Water rights data identifies the owner in a set of point data, and also displays a polygon associated with each right, showing the area of farmland using that right. I selected only water rights related to agriculture, so municipal and industrial water rights would not show up in the layer. Here is a closeup of water rights data layered on top of the orthographic data:

The water right for the parcel in the center on the right belongs to Donald G Hector for the purpose of irrigation. An article in the Gazette-Times records the passing of Donald’s wife in 2004 from Alzheimer’s after being married to Donald for 53 years. Businessprofiles.com lists the farm as currently inactive. Other than that, I could not find much about Mr. Hector or his farm.

There is a more significant problem with using water rights data to determine farm ownership, which you might intuit from the picture above. There are many parcels of land that are not associated with water rights. In fact, only around 15% of Oregon’s crops are irrigated crops. Once I zoom out, this becomes obvious:

The large blue area at the bottom left is the Greenberry Irrigation District, meaning a utility maintains the local irrigation infrastructure, and taxes farmers individually.

When I was interning at the WSDA, they had enough data to construct a map of the kind of information I want, but they could not publicize it because of privacy concerns, and I think that is the problem I am running into here. I need some NSA style access.

Or a new spatial problem!

For my spatial problem I will examine the role of spatial autocorrelation and seasonality in developing a land use regression (LUR) model. In particular I am interested in optimizing the incorporation of spatial autocorrelation and seasonality for prediction of air pollution in the City of Eugene.

For those unfamiliar with a LUR, it essentially combines GIS variables that are predictive of air pollution concentrations along with actual air pollution measurements in order to predict air pollution at unmonitored locations using ordinary least squares (OLS) regression. The problem with a typical LUR model is that they don’t account for spatial autocorrelation. The value of accounting for spatial autocorrelation is due to the fact that spatially based data, such as air pollution, is typically spatially correlated.

This past quarter in my GEO580 course I developed a LUR that did account for spatial autocorrelation by modeling the covariance of air pollutant concentrations of adjacent zip code boundaries, using a spatial CAR model. For this class I wish to develop this idea even further by using multiple techniques, namely geographically weighted regression (GWR), a spatial CAR model, and OLS to compare the model results to actual air pollution measurements. This work will require me to use both ArcGIS spatial analyst toolbox and the R statistical software.

As mentioned above, I am interested in including seasonal trends in air pollutant variation in order to see if inclusion of seasonal variation is capable of improving model estimates. To do this I propose to incorporate seasonal ratios to annual ratios of air pollutant concentrations.

To keep this work focused I will use data on just one air pollutant, as opposed to last quarter wherein I developed a LUR for seven different pollutants. By focusing on just one pollutant I hope to keep the work efficient and effective toward achieving my goals in this class. Ideally, this work will help to inform my dissertation proposal work.

The goal of this effort was to assess if we could use spatial statistical tools provided within ArcGIS 10.1 to evaluate geospatial relationships to differentiate between natural subsurface leakage pathways and anthropogenic leakage pathways. Evidence for gas migration within the subsurface is well documented through petroleum system and CO2 sequestration related studies. In addition, there are a number of examples of paleo- and modern natural gas seepage from deep subsurface systems to the near-surface in both onshore and offshore environments. Gas/fluid migration are key to the documentation of fluid flux relative to both resource/storage potential estimates and improving constraints on potential environmental impacts associated with gas migration from subsurface to groundwater systems and the atmosphere. Migration of gas from subsurface systems is particularly visible in the marine environment where gas plumes into the water column have been observed and studied by a number of researchers worldwide.
Migration of gas from subsurface systems to near surface systems has been the target of relatively few investigations, although a better understanding of these processes is of increasing interest due to the focus on potential storage of CO2 in geologic media and on methane recovery using hydrofracking techniques. To date, many of the studies evaluating the flux and migration of gas from subsurface to near surface systems have focused on anthropogenic concerns such as potential leakage related to CO2 storage, landfills, and natural gas storage facilities. Given the variability of the natural geologic system and the limited amount of existing field data and studies examining these systems, questions about the timing and nature of gas migration in many regions persist. Evaluating available data to assess whether interactions between subsurface and near-surface systems are related to naturally occurring mechanisms or are the result of anthropogenic activities offers additional challenges. However, improving understanding of mid to large-scale, field- and regional-scale controls on gas flux will help further constrain the relationships between geologic media, and the nature and timing of gas flux in a variety of subsurface systems. Through the evaluation of field datasets utilizing laboratory, geospatial and geostatistical analytical methods patterns related to gas occurrences and concentrations in subsurface systems can be identified.

To support these evaluations in a test scenario for CO2 geologic storage potential in the Appalachian Basin’s Oriskany formation, we utilized geostatistical tools to assess a variety of parameters associated with the reservoir characterization of the Oriskany formation including thickness, porosity, subsurface depth to top of formation, pressure, temperature, and a variety of pore-filling media relationships such as brine density, salinity, etc. The data leveraged for this analysis was largely based upon borehole geophysical log interpretations from West Virginia, Ohio, and Pennsylvania. The distribution of these data were irregular, and not every data point had values for all the reservoir parameters being evaluated. The variability of the wellbore data points is highlighted in the figure below.

To fill in the gaps for areas of low data density or even no data density to develop a complete dataset with values for all parameters to assist with CO2 storage estimations, each parameters needs to have a value (at least a minimum, maximum, and average value) for all parameters. As a result, the following tools and approaches were evaluated within ArcGIS for this dataset:
1. Use the min, max, and average values for the entire dataset and assign those values to all empty grids
2. Inverse Distance Weighting (IDW) evaluated to fill in the empty grids
3. Kriging also evaluated as a tool to fill in the empty grids
Example of these results for porosity data:

The graph above shows the relative distribution and compares average % porosity values for the raw data points against interpolated values using ArcGIS’ kriging versus inverse distance weighting (IDW) approaches. While offering generally similar results, the IDW approach for this dataset appears to interpolate and fill in data gaps in a manner that more closely follows the trends of the actual average porosity data versus that of the kriging interpolation algorithm.
Summary and preliminary conclusions:
1. Recommend using IDW interpolations to fill gaps
a. Best fit to original well data for parameters
b. Represent natural fluctuations in data range
c. Correlates parameters across spatial distances
2. Intent to represent “field-” to basin-scale features and variability not reservoir/borehole scale features.
3. Help inform technology development, risk evaluation, knowledge gaps
4. For extrapolation to more complex systems and problems dealing with 3D, multi-system components:
a. Basic geostatistical tools are sufficient for preliminary analyses and filling in gaps
b. But more complex analyses will require other software/approaches

After tinkering with several different tools and datasets, my end goal for this class became to digitally reproduce my manual methodology for mapping wetlands.

The reason my methodology is manual (rather than digital) is because temporal and spatial context are very important when mapping wetlands in the Willamette Valley using Landsat satellite imagery.

For example, riparian wetlands along the Willamette River have the same spectral signal as poplar farms and orchards in the same vicinity. If using a single date to classify wetlands, a digital classification may confuse land use types. However, looking backward and forward in time, I can tell which areas are stable wetlands and which are rotated crops. Additionally, I have other layers such as streams, soils, elevation, and floodplain to see whether it is logical for a wetland to exist in a certain area.

When I classify wetlands I have several different data layers available to aid in decision making:

Spectral

(40) annual Landsat tasseled cap images from 1972-2012

“Ancillary”

Lidar inundation raster
Streams
Soils
Others

My options for classifying wetlands using my satellite imagery include:

Spectral-only supervised classification -Use time series as additional channels
Object oriented classification
Incorporate ancillary with spectral
Mix

Arc seemed like the perfect application to attempt to incorporate my spectral Landsat data with my other layers.

At first I thought using one of the regression tools on a mapped sample of my study area could allow me to classify the rest of my study area using OLS predictions. However, when I looked at my data, the relationships between datasets did not appear robust enough.

Instead, I decided do to a binary raster overlay to find possible locations of wetlands. I selected a small study area focused around South Corvallis because it had a good mix of both riparian forest stands as well as hardwood orchards.

I included spectral data from a 2011 tasseled cap image using the indices brightness, wetness, and greenness. I calculated the spectral stats for BGW using areas of known riparian wetlands. A new binary raster was created for B,G, and W; pixels with data around the mean +/- one standard deviation were given a “1” and all other pixels were given a “0”.

I also included a 1 meter Lidar two year inundation floodplain map. Areas in floodplain were reclassified as “1” with all other space as “0” as well as a 100m distance raster of the streams in the area

All layers were given equal weight in the raster overlay. The result was a raster with little to no error of omission but high error of commission.

(Areas of yellow indicate manually mapped and validated riparian wetlands; green is result of raster overlay).

Just for comparison, I decided to run a spectral classification in ENVI using my manually mapped wetlands as regions of interest (i.e. training sites).

The result presented increased error of omission but decreased error of commission.

You can run spectral classification in Arc but the process is not streamlined and can become convoluted. Additionally, ENVI automatically extrapolated classification of an image based on training data; Arc is clunky when it comes to prediction and extrapolation.

My final thoughts on the Arc spatial statistics toolbox are that:

Arc spatial stats are (mostly) useless when it comes to categorical/non-continuous data
- Especially useless when it comes to mixing continuous with non-continuous
Raster tools are not multi-channel friendly (I had to process each band separately)
Classification and prediction are convoluted, multistep processes in arc
- Operations that ENVI, R, etc. do flawlessly
I should probably expand to other software for serious analysis
- eCognition, habitat suitability mappers, etc.

–

After considerable experimentation with a variety of ArcGIS’s Spatial Statistics tools, including Hot Spot Analysis, Cluster Analysis, Spatial Autocorrelation, Geographically Weighted Regression, and Ordinary Least Squares, I think I may have found a viable method for analyzing my SSURGO Soils dataset. For my final class presentation for this course, I employed the Grouping Analysis tool to explore the spatial patterns and clusters of high clay content within the sub-AVAs of the northern Willamette Valley. The visual correspondence between the resulting groups and the soil orders (i.e. taxonomy) was surprisingly accurate.

Reading through the literature on ESRI’s webpage about Grouping Analysis, I learned that one should start the Grouping Analysis using one variable, incrementally adding more with each subsequent run of the analysis. Following suit, I have experimented with both the addition of more variables as well as the total number of ideal groups for a given data set. While the soils present in each of the sub-AVAs are incredibly heterogenous and diverse, they do share some similarities, particularly with regard to clay content and soil taxonomy.

The results published here reflect an analysis using the variables of percent clay content, Ksat, Available Water Storage at 25cm, 50cm, and 150cm, respectively; choosing to parse the data into 5 groups. I also took advantage of the “Evaluate Optimal Number of Groups parameter” option within the toolbox, which generates additional statistics meant to identify the number of groups that will most readily distinguish one’s data set into distinct groups.

In addition, I generated Output Report Files with each run so that I could explore the statistical results in more depth. I’ve attached these for those of you who are interested in seeing what the results look like. I find it interesting that for almost all of my AVA data sets save for one, the resulting reports are suggesting that 15 is the optimal number of groups. I’m not sure if this is because 15 is the maximum number of groups that the tool can generate, or if this is a result of the particular variables I am using as inputs.

chehalem_grouping5

dundee_grouping5

eola_amity_grouping5

ribbon_ridge_grouping5

yamhill_carlton_grouping5

Additional variables that I plan on adding include percent sand, percent silt, bulk density, percent organic matter, and parent material. I am also considering incorporating raster data sets of slope, aspect, landform, vegetation zone, precipitation, minimum temperature, and maximum temperature. Performing multiple iterations of the Grouping Analysis will help me to identify a suitable combination of these variables, as well as the optmimal number of groups. Once those have been identified, I plan on performing the same analysis on each AVA, and then on buffered polygons of the AVAs at distances of 500m, 1000m, 1500m, 2000m, 2500m, and 3000m. In so doing, I hope to identify the degree to which different sub-AVAs in the northern Willamette Valley differ from directly adjacent landscapes. This will allow me to articulate those sub-AVAs which best correspond to the underlying soil classes in those areas.

My goal for this class was to learn what tools available in ArcGIS’s Spatial Statistics toolbox might be useful in helping to answer my research questions regarding spatial-temporal relationships between mountain pine beetle outbreak\spread and coinciding environmental characteristics such as topography, climate, and forest attributes. I quickly learned that the spatial statistics tools in ArcMap are not suitable for my data. All tools expect vector data as inputs and I am working with raster image data. I can do conversions from raster to vector, but it really doesn’t make sense computationally and theoretically. Also, I found that my image data needed a lot more pre-processing to get it to a point where it could be analyzed than I originally thought. I spent the majority of the class time preparing my data, which included: georegistration, disturbance detection, and disturbance causal agent identification. Once I had completed these steps for a pilot image scene I was able to do a quick look at outbreak elevation histograms through time, and also annual climate prior to and during the outbreak.

The remainder of this post will walk through the steps I took to process the imagery and present some initial findings on mountain pine beetle outbreak and environment relationships.

The first step in image preprocessing was to properly georegister mis-registered images. Please click on the images below to see an example of a before and after image registration correction.

Figure 1. Satellite image before and after georegistration correction (click images to see animation)

The registration process was coded and automated using the R “Raster” package. The first step was to sample the image with points and extract small image chips or windows from around the point in the reference image and the misregistered image (see figure below).

Figure 2. Sample reference and misregistered image and extract image chips from around the sample points.

Within the registration program, for each sample point, the misregistered image chip is “slid” over the reference image chip at all defined x and y offset distances and cross correlation coeffiecients are calculated for each shift based on the intersecting pixel values. Please click on the figure below to see an example of the moving window cross correlation.

Click image to see animation

Figure 3. Animation example of moving window cross correlation analysis between reference and misregistered image chips for iterative xy offset shifts.

A cross correlation surface is produced for each sample point. The apex of the conical peaks represent the offset that is needed to correct the misregisted image. A 2nd order polynomial equation is created from all legitimate peaks and used to warp the misregistered image into its correct geographic position.

Figure 4. Example cross correlation surfaces showing the best offset matches between the reference image and misregistered image chips

With all images in their correct geographic position, the “LandTrendr” change detection program was applied to the long time series of Landsat imagery to identify pixels that have been disturbed. A discriminant analysis of empirical variables related to spectral, shape, and topographic characteristics of identified disturbances was conducted to predict disturbance agent from a training set of labeled disturbances. The figure below depicts a mountain pine beetle outbreak identified in central Colorado. Click on the image to see the outbreak start and progression (colors represent magnitude change to forest: low-high\blue-red)

click image to see animation

Figure 5: Mountain pine beetle outbreak spread as captured by annual Landsat satellite imagery

From the outbreak progression video above, you can see that the outbreak appears to move up slope. To find out if this is truly the case I extracted elevation values for all insect affected pixels per year and plotted the histogram of both elevation value density and frequency. Please click on the images below to see the shift in elevation and area affected as the outbreak progresses.

Click images to see animation

Figure 6. Animated histograms depicting the progression of mountain pine beetle progression up slope.

I was also curious about what the annual climate was doing before and during the outbreak. I extracted PRISM climate data from 1985 to 2008 in the region of the outbreak and plotted it with the count of insect-disturbed pixels. The figure shows that maximum and minimum annual temperature begin to increase 1 to 2 standard deviations from mean about 5 years before the outbreak really takes off. This corresponds with a 3-4 year drop in annual precipitation. These conditions could have drought stressed the forests and provided a highly productive climate for the beetle to reproduce multiple times in a season and avoid freeze-kill.

Figure 7. Graph showing annual deviation from mean for PRISM climate variables and insect-disturbed pixels for 23 years.

In closing, I found the ArcMap spatial statistics unable to work with my raster format data, but was able to make a lot of progress in data preparation and analysis and exploration of satellite image detected insect outbreaks and corresponding environmental factors.

First of all, I would like to say I learned about Ordinary Least Squares (OLS) and Geographically Weighted Regression (GWR) tools from ArcGIS help 9.3, 10.0, and 10.1. This final report is based on the knowledge I gained through those help menus.

My learning goal in this class was to learn ArcGIS Spatial Statistics Tools that help answer my project questions. For my project, I am examining whether or not various types of dissolved organic matter (DOM) sources are controlled by factors such as land use types, stream orders, other water quality parameters, and seasonal flow patterns. At the beginning of the quarter, I was going to use SSN & STARS and FLoWS, which are add-in spatial statistical tools for stream networks; however, they did not work as they are for ArcGIS 9.3.

I moved onto GWR, which is one of the Spatial Statistics Tools. These are the steps and tools I had to follow and use to run GWR.

OLS to select independent variables
Spatial Autocorrelation (Moran’s I) on residuals to make sure that I did not miss any significant independent variables
GWR
Spatial Autocorrelation (Moran’s I)

Because I have not gotten to the point to identify different DOM sources, I used dissolved organic carbon (DOC) concentrations as a dependent variable. For independent variables, I examined land use (forest, urban, and agriculture), stream order, absorbance at 254 nm, and water quality parameters such as dissolved organic nitrogen (DON), nitrate (NO₃^–), ammonium (NH₄⁺), and ortho-phosphate (PO₄³⁺).

The data used were collected at the end of the dry season in 2012. Thus, the temporal variation was not considered here. I did not have time to create catchment and watershed layers for each sampling location, so I eyeballed the percentage of each land use for the purpose of completing this class project. I used data from 21 sampling sites.

OLS and Moran’s I

Plotting scatter plots with Excel (I could have used the ArcGIS scatter plot tool) before running OLS helped me understand my data. Once I plotted scatter plots, it was evident that absorbance at 254 nm was highly related to DOC concentrations. Absorbance at 254 nm is used to describe aromaticity of DOM. It makes sense why these two variables were correlated because high aromatic compounds mean high DOC (note, high DOC does not mean high in aromaticity). My samples must have been aromatic at the end of the dry season in 2012. Thus, I decided to use absorbance at 254 nm as the first independent variable. OLS generates two forms of outputs: a report and a residual layer.

As I expected, the result was good. About 88% of dependent variable’s variance was explained by absorbance at 254 nm by looking at Adjusted R-Sqaured. Asterisk on the probability (p-value) indicates the particular independent variable is statistically significant (p < 0.05). Another information to look at is Corrected Akaike Information Criterion (AICc). The low AICc indicates good model fits with the observed data.

According to Wikipedia,(http://en.wikipedia.org/wiki/Akaike_information_criterion)

AIC = 2k – 2ln(L)

where k is the number of parameters and L is the maximized value of the likelihood function.

AICc = AIC + 2k(k+1)/(n-k-1)

where n is the number of sample.

Another output of OLS is a residual layer. Red color indicates the observed dependent variable value is higher than the corresponding predicted value. Blue color indicates the observed dependent variable value is lower than the predicted value.

Before moving onto GWR, we must run Moran’s I on residuals to see if they are clustered. If they were, I still have an important independent variable that I can add to the OLS model. The result of the Moran’s I indicated the residuals were not clustered; however, I decided to add one more independent variable for a practice purpose.

Assuming residuals were clustered, the next thing we have to do is to check scatter plots to see if there are other independent variables that are correlated to the dependent variable. I found DON and DOC were correlated, so I decided to add DON as the second independent variable. I expected to see the correlation between DOC and the percentage of each land use type in each catchment; however, there was no strong relationship between them.

As I was running OLS, I realized I had to make the field numerical type to be DOUBLE.

OLS was run again, and the report showed me the improvement in the OLS model as the adjusted R² value was increased to 0.947 and AICc was decreased to – 7.67. The term, 2k(k+1)/n-k-1, in the AICc equation is a positive value (i.e. 0.67). When I plugged in other values, L turned out to be about 477. The L with one independent variable of absorbance at 254 nm was about 0.097.

I decided to use both absorbance at 254 nm and DON because I am satisfied with improved values of adjusted R² and AICc. And the sign of coefficients are both positive as expected from previous scatter plots.

I ran Moran’s I on the new residuals and found no clustering. Now it was time for me to check other statistical results from OLS.

There are two statistical terms to tell if each independent variable is statistically significant: Probability and Robust Probability. If Koenker Statistics is significant (look for the aretisk), there are regional variations and I can trust only robust probability. My report showed a significant Koenker Statistics. This makes sense because I have regional variations among sampling sites. Clustered sampling sites around the H.J. Andrews Experimental Forest represent undisturbed forest. In Corvallis, clustered sampling sites represent urban and agricultural land use.

Additionally, I have to check if those independent variables are redundant in describing the dependent variable by examining Variance Inflation Factor (VIF). This value needs to be less than 7.5. If it is greater than 7.5, I have to remove the corresponding independent variable. Absorbance at 254 nm and DON were not redundant as their VIFs were about 3.1.

Lastly, the residuals cannot be distributed normally because normally distributed residuals indicate that the model is biased. You can create a histogram or check if Jarque – Bera Statistics is significant in the report output. My model was not biased.

2. GWR

Although Koenker Statistics from OLS suggested there was a regional variance, the result of a regional model of GWR did not improve from a global model of OLS.

GWR generates a layer for each independent variable. Standard Error of Coefficient describes which part of the region in the study area is significant for the particular independent variable. As you can see, there is no variation in Standard Error of Coefficient for both absorbance at 254 nm and DON. That is probably why the ALCc and Adjusted R² of OLS model and GWR model turned out to be the same.

3. Next steps

a) Investigate why GWR result was not improved from OLS result even though Koenker Statistics suggested it should improve.

b) Create a layer with stream networks (using Generate Network Spatial Weights?), so I can use it as a spatial weighting for GWR to improve GWR model.

c) Consider interpolative my pointe features first. My inputs were all point features. I wonder if it is beneficial to interpolate first.

d) Find out if it is a problem to run OLS and GWR when my sampling sites are clustered.

Where did the time go ?! Seems like we just began this term and it is already over. While I feel pretty unproductive right now, especially after seeing the outstanding work my fellow grad students have accomplished, I am extremely appreciative of the opportunity to explore the statistical tools available in ArcGIS and elsewhere that this course has provided. I found the freedom to explore our own data sets and the advice and encouragement provided by Julia and everyone else in the class incredibly rewarding. Thanks Julia for a great class!

I began the term with a pretty undefined thesis project (its still fairly fuzzy but at least I can start to make out a faint outline now). My data set is small and limited to a single survey season. As depicted in Figure 1, I have GPS track logs and the encounter locations for 20 cetacean surveys conducted in the Marquesas Islands of French Polynesia.

Map of the study area in the Marquesas Islands of French Polynesia showing the extent of small boat surveys and cetacean encounter locations. — Figure 1. Map of the study area in the Marquesas Islands of French Polynesia showing the extent of surveys and the location of cetaceans encountered during small boat surveys in 2012.

My primary focus is on a poorly understood species, melon-headed whales (Peponocephala electra).

Melon-headed whales courtesy of ARKive, http://www.arkive.org/melon-headed-whale/peponocephala-electra/image-G85523.html.

These members of the dolphin family form very large “herds” (50 to as many as 1000 individuals) and have been observed congregating near the shore of Nuku Hiva in very specific locations on a regular, daily basis (Figure 2).

Figure 2. Encounter locations for melon-headed whales in the Marquesas Islands. Many of these locations are identical to those documented by Gannier in the mid-90s.

I spent most of this term finding, accessing, downloading, importing, reclassifying, converting, re-downloading, cussing at, etc., etc., environmental and bathymetric data. Using this data and other environmental data such as information on currents and sea surface height, I hope to investigate the differences and similarities between melon-headed whale encounter locations in order to 1) characterize these resting/socializing areas and 2) develop a model to predict possible resting/socializing locations in areas that have not been surveyed.

I was able to explore some of the tools in the Spatial Statistics Toolbox but for this data many of the tools are not applicable. For example, Ordinary Least Squares and Geographically Weighted Regression assume that there is linearity in the data. My data does not show linearity, even after transformation, and my response variable is not continuous. Running the Average Nearest Neighbor Tool produced the results that one would predict after looking at the maps provided above – the encounter locations are more clustered than predicted (Figure 3).

Figure . Results of a Nearest Neighbor analysis for melon-headed whales. As predicted, encounter locations appear to be more clustered than expected by chance.

All of these results brought me to a point where I needed to take a step back and reexamine my data and my objectives. I felt like I was attempting to ask questions that just aren’t going to be answered by my data. My main question involves the characteristics of the encounter locations that define melon-headed whale resting locations. To get at this question, I plan on defining encounter locations spatially, i.e. delineate polygons of a certain size around encounter locations, and statistically examine the similarities and differences between the polygons using the environmental and oceanographic data mentioned above. I will continue to explore the tools available in ArcMap as well as the plethora of non-ArcGIS tools to answer this question.

Just another blogs.oregonstate.edu site

Eco-region Classification Based on GIS

Determining My Spatial Problem From Envision Outputs

Problems with Spatial Problems

Incorporating Spatial Autocorrelation and Seasonality into the Development of a Land Use Regression Model

Geospatial Analysis of Natural and Engineered Data GEO 599 Spring 2013

Mapping Wetlands with Spectral and Ancillary Data

Using Grouping Analysis to identify the correspondence between northern Willamette Valley American Viticultural Areas and Soil Classes

Final Post

What I learned about OLS and GWR

Characterizing melon-headed whale resting/socializing habitat in the Marquesas Islands

Contact Info