April 2014 - GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

I proceeded with performing the seasonal adjustments I proposed at the end of my last blog post (see “Temporal Data” post). Briefly, I used the ratios of mean monthly naphthalene concentrations over the annual naphthalene concentrations for all air monitoring data (i.e. both sites combined). These ratios were then used as adjustment factors to simulate monthly observations for the NATA data that I have. The monthly simulated data was then averaged over a year-long period to obtain the “seasonally adjusted” annual estimates.

The graph below displays the seasonally adjusted versus the unadjusted naphthalene annual concentrations. The adjustment factors caused the annual estimates to increase slightly. These seasonal estimates will now be used as my simulated input data for estimation of annual concentrations for my LUR. model.

There is a local R user group in Portland, Oregon. R is a free software environment for statistical computing and graphics (http://www.r-project.org/). Their goal is to support and share R experience and knowledge among its users in the Portland community. They would like to hear how you use and enjoy the R language and statistical programming environment.

For more information, visit http://www.meetup.com/portland-r-user-group/

I have been busy trying to figure out how exactly to incorporate seasonal variation of naphthalene ambient air concentrations into my Land Use Regression (LUR) model. To start off, I obtained time series data of naphthalene concentrations from the Lane Regional Air Protection Agency (LRAPA), just to see if there is indeed a significant amount of variation temporally. The graph below portrays the variation of mean monthly naphthalene concentrations for two air monitoring sites in Eugene (Amazon Park and Bethel) between April 2010 through April 2011.

While this graph above is somewhat informative, it does not quite describe the relationship in full. The graph below presents the data disaggregated by month without averaging over the months, which indicates a great deal of variation within months and between months. This suggests a complex temporal variation that would likely require a non-linear type model. I therefore decided to use all of the data, rather than aggregated means, to better characterize the temporal relationship (i.e. still retaining the within month variations).

Next I modeled the change in monthly log transformed naphthalene concentrations (since naphthalene was clearly not normally distributed and the transformation lead to a log-normal distribution) using month as a predictor and air monitoring location as an additional predictor. The seasonal predictor was fit with a smoothing function using generalized additive model in order to allow for the obvious non-linear trend between month and naphthalene concentrations.

The following graph displays the relationship between month and log naphthalene concentration using the GAM function in R to allow for a non-linear trend. The use of a smooth function proved to be a better fit given the obvious visual trend and that seasonal adjustment was only significantly predictive once treated as a non-linear trend (the linear model was not significant for season [again not surprisingly!]). This work confirmed the notion of a complex relationship between pollutant concentration and month, even after adjusting for the air monitoring location.

As far as next steps is concerned, I was thinking of using this temporal GAM modelling approach for my LUR model. I would use the historical dataset from LRAPA to simulate temporal observations (i.e. monthly) for my NATA data set. If I’m able to simulate this dataset, it would enable me to include time in my LUR model (using a GAM for month), and therefore enable me to see if adding temporal relationships will improve my predictive LUR model. I view this as a novel way of incorporating temporal variation as an explanatory variable. One approach could be taking the ratios of monthly to annual observed LRAPA naphthalene concentrations, and multiplying these monthly ratio with the NATA annual estimates. The simulated data would then be inputs for performing the LUR GAM model.

Wetlands provide numerous ecosystem services; one of the most valuable services they can provide people is their natural ability to perform water filtration. I want to understand how connected wetlands are to water quality.

5/14/2014 UPDATE: After much deliberation, I’m taking a different approach to this project and class, and hopefully thesis!

New Research Questions
− Using Finley as a case study, can a wetland refuge improve water quality? To what extent?
a. how can it improve water quality – nutrients – evidence wetlands can act as a nitrogen sink – great in an agriculture heavy area like Corvallis
− How similar or different are invertebrate and plant communities (diversity and abundance as an indication of how “healthy” the ecosystems are) within Finley compared to surrounding streams and mitigated wetlands?
a. R: statistically significant correlations?
b. ArcGIS: spatial relationships (maybe cluster analysis)? (would need to account for spatial autocorrelation of habitat type, etc)
− Historically, how has land use affected the Finley refuge wetlands? How has Finley’s history impacted this study? How may land use affect the wetlands today?

Hypothesis
− ecosystem service of wetlands is improvement in water quality: I hope to show streams leaving wetlands will have a lower quantity of nitrogen than streams entering wetlands
a. may need a lit review: see what improvement is typical and to what extent to have a baseline
– can also compare to other surrounding stream quality data – maybe proximity analysis correlated with change in quality? I expect to see quality, again, improved leaving wetlands than those that don’t interact with the wetlands at all.

– I would expect similar-size nearby mitigated wetlands to show similar results; if different, maybe plant and invert communities can indicate differences in ecosystem health between the refuge and mitigated wetlands.

– I would expect past land use, as well as current surrounding land use, to impact the health of the wetlands and streams – run a land use correlation in comparison to plant communities and nitrogen content of streams?

Also, anyone know of a good land use Shapefile that incorporates the Finley area? Or any other Shapefiles that exist for Finley/surrounding areas? I’d hate to work off of aerial photography if I don’t have to (I’m not familiar with it, so if I do need to go that route any links to tutorials would be great!), any input is appreciated. This is a seed of an idea that can go a lot of ways.

Will update with a map next week!

————————————–

A National Wetlands Inventory shapefile of Oregon’s wetlands was clipped to the nine counties that make up the Willamette Valley (county data provided by BLM): Multnomah, Washington, Yamhill, Clackamas, Marion, Polk, Linn, Benton, and Lane Counties. Using wetlands mapped between 1994 and 1996 by The Nature Conservancy of Oregon (funded by the Willamette Basin Geographic Initiative Program and the Environmental Protection Agency (EPA)), data was created that inventoried, classified, and mapped native wetland and riparian plant communities and their threatened biota in the Willamette Valley. I have also clipped 2004 – 2006 stream and lake water quality data from the DEQ to these counties. Furthermore, I have mitigation bank data compiled by ODSL and ODOT developed by The Nature Conservancy of mitigated wetland locations in the Willamette Valley.

Wetlands Map

I am interested in looking at connectivity of streams to wetlands and the relationship of water quality to wetland location. For connectivity I may compare streams connected to wetlands versus those that are not connected to wetlands. Additionally, I am interested in seeing if the water quality data is different around mitigated wetlands versus natural wetlands. I also have stream and lake water quality data from other years, so measuring statistically significant change over time as well.

I am interested in receiving comments regarding potential statistical analyses to examine connectivity; to compare water quality around mitigated versus natural wetlands; and comparing water quality data over time.

So far I have identified over 70 local farms providing food to the Corvallis Farmer’s Market. While many of the farms are far-flung, there is a definite clustering affect around the city of Corvallis. This map shows the whole map, and keep in mind I am still collecting tiles because they are farm locations outside this area I cannot place yet. The purple ellipse comes from the Directional Distribution tool, and it shows the area containing 68% of the local farms, or containing 1 standard deviation. I traced the city limits for Corvallis, Albany and surrounding cities with a city limits shapefile in light blue. Farms that sell at the local Farmer’s Market are represented by gold stars. Note that the ellipse skews to the right of Corvallis, and is longer from the north to the south. Essentially, the ellipse is following the contour of the Willamette Valley, which we would expect.

The purple cross is the mean center of distribution of local farms, which is also the center of the ellipse. But the orange triangle is the median center of farm distribution. The median center moves quite a bit towards Corvallis, implying that remote geographic outliers influence the mean, and that perhaps farms cluster more strongly around the city of Corvallis than the distribution ellipse and mean center suggest.

There remain more than two dozen farms on the Corvallis Farmer’s Market list that I have yet to add to the dataset. After that, I would like to know the approximate acreage of each farm. This would allow me to do a hotspot analysis around a specific question. I have a theory that farms near Corvallis are likelier to be smaller, and that being near an urban development makes it more feasible to grow high quality produce on small acreage as a business model. To put it another way, Corvallis acts like a market driver that spurs and creates local sustainable development nearby. I could test for this using a hotspot analysis if I had an acreage estimation for each farm.

Identifying the location for each farm remains tedious, but each farm has a phone number associated with it, and many have an email address. An initial email survey followed by phone calls could yield information about the size of each farm, how long it has been selling locally, and how much of its produce goes to local markets.

There is definitely error in the accuracy of the farm locations. Google maps does a poor job of identifying the location of farm addresses, so I am sure some stars are nearby, but not on the right farm. It is also hard to determine boundaries of ownership by visual assessment, and so a polygon shape file estimating farm areas would be even less accurate. While I can use tax lot information for Benton County to determine farm area, Lynn County is much more difficult to access, and the farms are flung through many counties, so the process is time-consuming. A combination of ground-truthing and surveying would be necessary to improve accuracy to a publishable level. I have also not addressed farms selling to local groceries like the First Alternative, to local restaurants through wholesale distributors, or through CSAs, all significant contributors to the local food system.

LiDAR point information is usually available as a set of ASCII or LAS data files:

ArcGIS only supports LAS data files; to use ASCII LiDAR data with Arc, you’ll need to use an external tool to convert to LAS.

LAS files cannot be added directly; they must be combined into an LAS dataset that sets a consistent symbology and spatial reference for the entire collection. To create a LAS dataset, go to ArcCatalog and right-click the folder you want to store the dataset in, and select new>LAS Dataset:

Note that you will need 3D Analyst or Spatial Analyst activated to do this. I recommend checking all the extensions to be sure your tools run the first time.

Right-click the dataset in ArcCatalog and choose Properties.

Ensure your dataset has an appropriate coordinate system in the XY Coordinate System tab, which you’ll need to get from the metadata for the LiDAR. Next, activate the LAS Files tab and click Add Files or Add Folders. Once you are done adding files, activate the Statistics tab and press the Calculate button.

At this point, you can import your data into either ArcMap or ArcScene. There are pros and cons to both. As far as I’ve been able to determine, it is impossible to plot 3D point clouds in ArcMap with a DEM or map base. This is possible in ArcScene, and it is also possible to color points according to intensity in 3D view, but unlike in ArcMap there is no ability to adjust point size, and very limited ability to adjust colors of points, at least as far as I’ve been able to determine over the last few days.

Some LAS datasets will include RGB information in addition to intensity, which allows 3D true-color visualizations.

This image shows the Middle Lookout Creek site in the HJ Andrews Experimental Forest as a point cloud colored by intensity in ArcScene. The creek and some log jams are visible on the right, and a road is visible on the left.

To convert LiDAR points to a DEM, you’ll need to convert the dataset to a multipoint feature class first. Open ArcToolbox and go to 3D Analyst Tools> Conversion>From File>LAS to Multipoint. Select the specific LAS files you want. It’s likely that you’ll want to use the ground points only. The usual class code for ground points is 2, but you’ll want to check this by coloring the points by class.

Once you’ve created the multipoint feature, you need to interpolate the values between the points. There are several ways to do this, with advantages and disadvantages. Popular methods are Spline and Kriging. Spline generates a smooth surface but can create nonexistent features, and doesn’t handle large variations over shorter than average distances very well. Kriging is hard to get right without experience in what the parameters do, and can take a long time to achieve the best results, but attempts to take spatial autocorrelation into account. In general, Kriging is better for relatively flat areas, and Spline is better for sloped areas. Inverse Distance Weighting is popular, but produces results similar to a very poorly configured (for terrain anyway) Kriging interpolation. I find that a safe but time-consuming bet is Emprical Bayesian Kriging, which can be found in the toolbox under Geostatistical Analyst Tools> Interpolation, along with a few other advanced interpolation methods that I am not as experienced with. If anyone else is familiar with these, I’d welcome a post explaining how best to use them.

There are a ton of resources about using and understanding Spatial Statistics. The Esri Spatial Statistics team created this blog to make sure that everyone knows where to find them.

http://blogs.esri.com/esri/arcgis/2010/07/13/spatial-statistics-resources/

I am working on a habitat model to predict changes in available steam habitat for acid- and thermally-sensitive aquatic species. The main goal is to combine stream temperature model results with existing acidity (ANC) results for the southern Appalachian Mountain region to evaluate the spatial extent of a habitat “squeeze” on these species. The relationship between air and water temperature will also be explored to generate future scenario results of habitat availability with changes in air temperature. I have been mostly working with non-spatial regression modeling techniques and I would like to explore the use of spatial statistical models to account for spatial autocorrelation among observations.

Here is my full study region:

Non-spatial regression model results for ANC and water temperature:

Here is a potential study area with relatively high ANC data density:

For the purposes of this class I am going to attempt to construct habitat suitability models characterizing the pelagic habitat of an invertebrate species, California market squid (Doryteuthis opalescens) (Fig. 1 –thanks wikipedia), an important prey species for multiple predatory fish (i.e. spiny dogfish sharks and seabirds) and is also commonly captured in the survey region in high abundances.

800px-Opalescent_inshore_squid

The dataset I am working with consists of pelagic fish and invertebrate abundance data that have been collected by NOAA over a 14 year-long (1998-2011) period in the Northern California Current off the Oregon and Washington coasts. Pelagic fish and invertebrates were collected along up to at ~50 stations along eight transect lines off the Washington and Oregon coast in both June and September of each year (Fig. 2). Species were collected using a 30 m (wide) x 20 m (high) x 100 m (long) Nordic 264 pelagic rope trawl (NET Systems Inc.) with a cod-end liner of 0.8 cm stretch mesh. For each sample, the trawl was towed over the upper 20 m of the water column at a speed of ~6 km h^-1 for 30 min (Brodeur, Barceló et al. In Press MEPS).

In addition to species abundance data, survey personnel also collect in situ environmental data at each fish sampling station during each survey, including; water column depth, salinity, temperature and chlorophyll a data, as well as oxygen and turbidity data when instruments were available. One of my goals for this class is to supplement this in-situ environmental dataset with remotely sensed temperature and primary productivity as well as turbidity data from the MODIS-Aqua and SeaWiFS platforms in order to obtain a broader environmental context.

For my habitat suitability modeling approach I will utilize R to conduct Generalized Additive Mixed Effects Models (GAMMs) correlating the environmental covariates to both presence/absence data as well as abundance (catch per unit effort) data. Additionally, I will experiment with Maxent and other habitat suitability modeling techniques available to compare their output to my GAMM models.

Some of the spatial and temporal hurdles I face with this dataset include:

Unequal spacing between sampling locations: This may pose a challenge when attempting to spatially interpolate.

Scope of inference: The habitat modeling that I’ll attempt for this species is likely applicable only in the Northern California current or in a slightly extended region.

Scale of environmental data: The fact that I will be using environmental data from two different sources (in situ data (point data – localized measurement) vs. remotely sensed data (raster satellite data – 500m-1km grain)) will affect the resolution of my interpretations of habitat for this species.

Spatial autocorrelation among stations: Abundances and/or presence/absence of market squid may be spatially correlated among nearby stations due to autocorrelation in environmental covariates that define their habitat.

Temporal autocorrelation for each station: As the data I am using is a bi-annual survey, it is possible that the abundance and spatial structure of market squid within our sampling area is correlated between the two seasons of sampling. It is also possible that the temporal autocorrelation of an individual station with itself though time is not too big of a problem given the fluid medium in which sampling occurs and the highly variable inter-seasonal winds and currents in this region.

I have 218 benthic sediment grabs from the continental shelf ranging from 20 to 130 meters deep. These samples were taken from eight sites spread from Northern California to Southern Washington. Within each site, samples were randomized along depth gradients.

Each sample consists of species counts plus depth, latitude, and sediment characteristics such as grain size (i.e., sand versus silt), organic carbon, and nitrogen concentrations. Using Bayesian Belief Networks, species– habitat associations were calculated and established relationships were used to make regional predictive maps. Final map products depict the spatial distribution of suitable habitat where a high probability indicates a high likelihood of finding a species given a location and its combination of environmental factors. While sampling points were not taken across a consistent gridded scale, suitability maps were scaled to 250 meter resolution.

As in any habitat modeling process, the “best” model was chosen by looking at model performance and the amount of error, or misclassification between what was observed and what was predicted. Error of commission occurred when probability scores were high for a location where the species was actually observed to be absent. Errors of omission occurred when probability scores were low for a location where the species was actually observed to be present.

I am interested in two questions. The first is whether there is a spatial pattern to the error observed and the second question is at what scale is this error significant? Error may be caused by variation in the environment that occurs at a finer scale than what my modeling structure captures.

To explore these two questions, I intend to conduct a spatial autocorrelation analysis on the error for each local site to determine if there is any potential spatial pattern, and if so, if there is an associated environmental pattern to the error (i.e., does most of the errors occur in shallower or deeper water?). I am also interested in creating high resolution local maps of sediment characteristics (grain size, organic carbon, and nitrogen) through spatial interpolation technique using sediment grab data. For these local sites, I will then recreate predictive maps and compare to the 250 meter predictive maps.