June 2015 - GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Research Question
Are magnetic properties such as susceptibility related to heavy metal concentrations at the Formosa Mine Superfund Site? What sort of environmental factors contribute to the distribution of heavy metals in the affected area?
Data Set
The data consist of heavy metal concentrations, obtained with a portable X-Ray fluorescence gun (pXRF) as well as magnetic susceptibility at both low and high frequencies (0.4 and 4.7mT). Samples were taken semi-randomly from the Formosa Superfund Mine site and nearby streams, Middle Creek and Cow Creek. Samples were taken from areas that were accessible but also at random along trails and near site features such as the adit diversion system and identifiable seeps and drainage routes along roadsides at the site. Samples were divided into 4 fractions; bulk, >63µm, between 20-63 µm and <20 µm. These samples were then dried in an oven at 40C and then prepared for pXRF and magnetics measurements.
Hypothesis
Magnetic properties will show correlation with heavy metal concentrations because heavy metals tend to associate with magnetic minerals. Various environmental factors related to generation of acid rock drainage and hydrology will account for the distribution of certain metals in the affected area.
Past research has shown strong correlation between certain magnetic parameters and heavy metal concentrations. Magnetic techniques have been used to identify and delineate polluted areas in a number of applications from mapping atmospheric deposition of fly ashes to determining sources of contamination in urban environments (Lu and Bai, 2006). Furthermore, a Pollution Loading Index (PLI) can be calculated from the cumulative addition of all metals. PLIs are often well correlated with magnetic parameters. At this particular site, the main concern is acid rock drainage and the subsequent transport of heavy metals to streams nearby.

Approach
As this is the exploratory stage with respect to this data set, various methods were employed to categorize and organize the data. Much time was spent combining data from a number of sources including the susceptibility meter, pXRF and GPS used to store waypoints. Once in a usable format, waypoints and data were transferred into ArcGIS for hot spot analysis and general mapping needs.
Results
Hotspot Analysis
The hotspot analysis revealed that heavy metals are indeed found in greater concentrations at the Formosa Superfund Site. At this point, there is only one point that is not ‘hot’, and as such it represents a benchmark for comparison of other samples. Further sampling is required to properly delineate the zone of influence and to assess the degree and extent of contamination to nearby surface waters; Upper Middle Creek and South Fork Middle Creek.
Exploratory Regression
Once the magnetics data were properly normalized for mass and volume, they revealed some interesting correlations with heavy metals. In particular, Mn and V showed strong positive correlations with both low and high frequency susceptibility.
It is expected that additional data will smooth out some of the relationships. With so few points, it is difficult to assess variability and error in the data.
Pearson Correlation Coefficients
Correlation coefficients were calculated for all metals and magnetic susceptibility in both low and high field magnetization. The following pairs of variables were highly correlated (ie. had p-values less than 0.05): Ca and Zr, V and Ti, Fe and Cu, Fe and Zn, Fe and As, Cu and Zn, Cu and As, Zn and As and Zr and Ta. This suggests association of various metals with each other and in the case of Mn with high frequency susceptibility. Of particular interest are the metals correlated with Fe. Most magnetic minerals have Fe and hence these relationships are the most likely to be further elucidated by magnetic measurements. Additional data may yield stronger or weaker correlations between variables. This is yet to be determined.

Principal Component Analysis
PCA is meant to group components (factors) in a way that describes the maximum variance in a data set and hence each factor carries a weight associated with that variability. In this initial analysis, the majority of the variance is accounted for in the first 4 components (see Table 1). The weightings of factors for each component are listed in Table 2.
Table 1= Eigenvalues and percent variance covered by each principal component in analysis of sediments from Formosa Superfund Site

The first component is difficult to interpret. The weightings are low and tend to the negative, showing inverse relationships explain much of the variability in the data. In the second component there is evidence for explanation of variance based on magnetic parameters. The third component shows even stronger evidence of variance explained by high field susceptibility. This has implications towards understanding the mineralogy that drives the magnetic signal. Further correlations with heavy metals signifies important relationships between magnetic minerals and metal contaminants.

Table 2- Weighting of factors for first 4 components of PCA of sediments from Formosa Mine Superfund site

This third table associates the various components with their spatial location, identified by the labels in column 2. Further analysis would map these components to show the spatial aggregation of variables, giving further clues on environmental factors that drive them.
Table 3- Association of sample points with component drivers

Significance and Further Direction
The significance of this research lies in its capability to quickly assess the degree and extent of anthropogenic pollution. To date, magnetic techniques have been employed over a diverse range of applications and landscapes. Further expansion of applications is desirable from many perspectives. Having quick, easy ways of determining hot spots of contamination focuses reconnaissance on areas that are most affected and/or vulnerable and better affords important resources to be allocated towards clean up and mitigation.
At this point, it is impossible to make definitive conclusions about this analysis. More data are needed and there are some considerations to be made with respect to the methodology in analyzing the samples to begin with. Aggregation of particles and association of magnetic materials with organic matter are just two considerations that need to be addressed before further processing and analyses are done. Cross-reference with EPA and BLM data would be a useful endeavor as well. There are many opportunities for expansion and collaboration on this project which should be pursued at this time.

Learning Outcomes
-ArcGIS likes easy to read files- be careful with file names, pathways and column headings
-I learned Statgraphics—It’s an easy tool to use with no language barriers. It’s like a stand-in for quick analysis
– I made zero progress on Python, Modelbuilder or R (which I haven’t used in so long that I feel I need to start from basics again)

Reference
Lu SG, Bai SQ. 2006. Study on the correlation of magnetic properties and heavy metals content in urban soils of Hangzhou City, China. Journal of Applied Geophysics 60 : 1–12. DOI: 10.1016/j.jappgeo.2005.11.002

For this project I was researching the relationship between ambient air temperature and MODIS land surface temperature measurements across Oregon. Public health research around heat stress and negative health outcomes currently uses point source measurements of weather to estimate exposures. The problem with this is that it is difficult to interpolate measurements between monitoring stations and these estimates are not accurate enough to realistically depend on. Using MODIS imagery to estimate exposure would allow researchers to look retrospectively and prospectively at estimates of exposure across nearly the entire planet over a very large temporal resolution.

For this project I used 1km resolution MODIS LST imagery for every day in 2014. I used only one swath of MODIS data, which covered approximately 85% of Oregon. This was compared with more accurate temperature measurements from NOAA monitoring stations. I used 46 stations across Oregon for the comparison.

To explore this relationship I used a Python script to extract the raster values across the entire time frame at the points of the 46 NOAA stations. I then used 2 R scripts to combine the data sets, change them into long format, and do the analysis. I used a linear mixed effects model to account for the random effect of location and also to introduce a temporal relationship to the data. Using the LME model I started with NLCD value, month of measurement, time of MODIS measurement, and elevation as covariates. I then removed covariates that were not significant to refine the model and minimize the standardized residuals.

NLCD and time of MODIS measurement were not found to be significant, and were therefore removed from the model. The results from the final model are shown below:

Parameter	Estimate	95% CI	P-value
(Intercept)	6.240	(3.007 – 9.472)	0.0002
MODIS Value	0.313	(0.302 – 0.324)	0.0000
Elevation	-0.004	(-0.005 – -0.003)	0.0000
NLCD (evgrForst)	2.272	(-1.288 – 5.832)	0.2038
NLCD (lowIntDev)	-2.446	(-7.133 – 2.240)	0.2968
NLCD (scrubBrsh)	0.961	(-2.661 – 4.583)	0.5939
Month (Feb)	-1.341	(-1.944 – -0.738)	0.0000
Month (Mar)	-0.631	(-1.207 – -0.053)	0.0322
Month (Apr)	0.301	(-0.255 – 0.858)	0.2889
Month (May)	2.354	(1.795 – 2.912)	0.0000
Month (Jun)	3.261	(2.680 – 3.840)	0.0000
Month (Jul)	7.894	(7.298 – 8.488)	0.0000
Month (Aug)	6.790	(6.202 – 7.378)	0.0000
Month (Sept)	6.077	(5.501 – 6.654)	0.0000
Month (Oct)	4.085	(3.523 – 4.645)	0.0000
Month (Nov)	-2.161	(-2.724 – -1.597)	0.0000
Month (Dec)	-1.970	(-2.554 – -1.385)	0.0000

Finally, to test how effective this model is I plotted the average residual value for each location on a map to attempt to identify any spatial pattern. There was none obvious; however there are still some missing factors in the model.

These results were interesting and I think that this type of model shows promise for public health research. With some refinement I believe that a good approximation could be made for ambient air temp using LST imagery.

This project was a great opportunity to combine the skills I have gained in R, Python, and using spatial data. Scripting the entire process made standardization and reproducibility easier, and will allow me to continue working on this model in the future (after a good break, that is).

Hello, Geo 584 readers!

In my previous blog post, I mentioned my main goal was to use statistical analysis including environmental variable mapping and spatial interpolation methods to quantify correlations between harbor porpoise movement patterns and distributions with biophysical variables.

However, to do all of this, I had to see exactly what kind of data I was working with. Over the previous two years I have collected spread sheet after spread sheet of survey effort, ship transect lines, marine mammal sighting data, oceanography data, acoustic data, and stranding data. This was my first chance to dive in and look at what all I was working with.

Step 1: Sort out all marine mammal sightings – this proved to be much more tedious than I thought J

Step 2: Plot all longitude and latitude gps points of sightings on a base map

Note: This is where the ship was when the marine mammal was spotted, not necessarily the exact location of the animal.

Step 3: Map a typical day of transect surveys from each of the three survey sites

Step 4: Add course bathymetry layer to the map – and add a 200m depth contour line

Note: Much as the name implies, Harbor Porpoise tend to be a near-shore species not typically inhabiting waters deeper than 200m.

Step 5: Distribution analysis using Arcmap tools – goal was to see if harbor porpoise had an overall different distribution vs all other marine mammals seen. My hypothesis was that harbor porpoise would have a more inshore distribution.

Results:

This was a map of my sightings based on the boat gps, again, this is not exactly where the animal was. This requires more triangulations and calculations which I will be taking a course on in August! So I wanted to save that analysis for then!

This second image is a map of a typical day of transect surveys in each of the three sites. I decided to do this because if you look at the sightings along you tend to see a funky pattern, but this is mainly due to the layout of the transects.

This third image has an added 200m depth contour line, again, this is because harbor porpoise ecology states that they tend to be a near shore species. The two ovals represented in this figure are the distributions of harbor porpoise in yellow vs all other species in green. The odd shape is due to the NH line of the surveys going about 15 more miles offshore than the other two. But it is easy to differentiate that harbor porpoise generally have a more inshore distribution as I predicted.

Walking through these steps was exciting for me this was my first chance to see visual representations of my data as well as learning GIS with using correct projections and distribution calculators.

What’s next – Plans for the next few weeks!

Begin to focus only on harbor porpoises. I chose harbor porpoise for my indicator species for my thesis because they are abundant, sound sensitive, and most likely to overlap with marine renewable energy.
Find fine scale layers: bathymetry, bottom type, etc. Using a basemap is great to look at the data visually, but it is hard to make any interpolations or statistical analysis without environmental covariates.
Coordinating sightings vs effort: take into account unequal transects, length of transect line vs. odds of seeing porpoise
Organize in-situ flow through oceanographic data collected concurrently with transect lines and then use spatial interpolation to create a fluid shape file of sea surface temperature, salinity, and chlorophyll a.
Are environmental covariates determining distributions? SST, Distance to Shore, Depth, Season? What are driving these porpoise occurrences?

As you can see, I have plenty to work on! Thanks for reading!

Research Question

My research objective was to investigate the spatial scale and locations of surface feeding sites of humpback whales (Megaptera novaeangliae) in the Stellwagen Bank National Marine Sanctuary (SBNMS) in the southern Gulf of Maine, USA, and to investigate movement parameters of the whales.

Dataset

The existing data for this analysis stems from a long-term study investigating humpback whale behavior and ecology in the southern Gulf of Maine, USA, (for more details, see Friedlaender et al. 2009). Almost every summer since 2004, whales were equipped with non-invasive tags that recorded detailed information on the underwater movement of the whales or collected video-footage of the behavior of the tagged animal and associated whales. During daylight deployments that usually lasted for up to 8 hours, focal follows were conducted from a small boat following the tagged whale, during which detailed information on the behavior of the tagged whale at the water surface was collected. Because the tags did not contain a GPS, range and bearing information on the whale were also collected at least once when the whale was observed at the surface in between consecutive dives, usually resulting in the collection of one location point every 3-5 min (temporal extent: 117-260 min), with a spatial resolution of tens or hundreds of meters (less when the whale is foraging, more when it is traveling) (extent: 10³-10⁴ m). Continuous GPS locations of the boat were automatically collected. Based on the time stamps of the range/bearing data and the boat GPS data, the GPS locations collected at or close to the time the range and bearing data was collected were identified and together this data was used to calculate the locations of the whale.

Hypotheses

Surface feeding areas (FPT radius) are larger than mean length of prey schools (139 m, Hazen et al. 2009)
Size of surface feeding areas (FPT radius) positively correlated with group size (larger groups search larger area)
Positive correlation between FPT (at FPT radius) and depth and slope (prey habitat preference)
No autocorrelation in FPT
No autocorrelation of absolute turning angles on spatial scale of surface feeding sites (unpredictable foraging movement)

Approaches

Scan focal follow data for deployments during which surface feeding was major behavior
For these deployments, use R package adehabitatLT (Calegne 2011) to implement first-passage time (FPT) metric
- Based on the location points of an animal and the time spend at/between these points, FPT radius identifies the spatial scale on which foraging effort along an animal’s path is concentrated (Fauchald & Tveraa 2003)
- The time spent within this radius of each location point is called FPT. Location points with high FPT identify important foraging areas along the animal’s path (Bailey & Thompson 2006)
Compare FPT radius for each deployment to the group size during the deployment
Map color-coded FPT (at FPT radius) for each deployment onto bathymetry and slope raster layer in ArcMap
Extract depth and slope values for each location point
Connect the location points for each deployment with lines to show path of the animal
Buffer each location point by the value of the FPT radius for the respective deployment
Dissolve buffer for better visibility of the range of foraging effort around the location points and relative to underlying depth and slope rasters
Perform linear regression of FPT against depth/slope for each deployment and calculate r2 value
Calculate autocorrelation in turning angles and first-passage time using Moran’s I coefficient of autocorrelation in the pgirmess R-package

Results

Three deployments (188a, 188b_f, 195b) were identified for which suitable location data existed and that contained a large amount of surface feeding. Because deployment 188b_f originally also contained a large amount of traveling, here, only the part of the path with surface feeding activity was included in the analysis. Table 1 summarizes the results of the analyses.

Table 1: FPT radius, group size and correlation coefficient r2 for the regression of FPT against depth and slope are shown for each of the three deployments.

There does not seem to be a fixed spatial scale of surface feeding as the values of FPT radius range between 90 and 347 m.
It is difficult to investigate a potential relationship between FPT radius and group size as the foraging groups were very dynamic with changes in group size throughout the deployments. However, the smallest FPT radius was calculated for the only group that consisted of group size 1 for some part of the deployment, and the largest FPT radius was observed for the group that had the largest group size during some part of the deployment.
Significant relationships between FPT and depth or slope were found for two of the three deployments. For deployment 188a, depth explained 17.2 % of the variability in FPT (p=0.001). For deployment 188b_f, depth explained 14.2 % (p<0.001) and slope 29.2 % (p<0.001) of the variability in FPT (Figures 1-3)
Autocorrelation in FPT was found for all three deployments at scales greater than the FPT radius for each deployment (Figure 4)
Autocorrelation in absolute turning angles was found for two of three deployments at spatial scales much greater than FPT radius.

Figures 1-3: Whale paths, FPT and FPT radius mapped on slope chart of SBNMS (USGS/SBNMS). Whale locations for each deployment (circles) are color-coded to represent high/low FPT (red/green) at FPT radius (purple buffer) and connected with lines to show temporal sequence of locations. From top to bottom: 188a, 188b_f, 195b.

Figure 4: Moran’s I autocorrelation coefficient calculating FPT autocorrelation plotted against lag distance for all deployments. Vertical lines represent FPT radius for each deployment. Red circles indicate significant autocorrelation in FPT.

Discussion and Significance

The three deployments investigated here showed a wide range of spatial scales of surface feeding. However, there is some indication that this variability could at least in part be explained by the differences in group size between the three groups, with larger groups having a greater spatial scale of surface feeding. The spatial scale of surface feeding is somewhat consistent with the mean size of prey schools. Bathymetry and bathymetric relief seem to have some influence on the locations of concentrated search effort (FPT). This is obvious from both the correlation coefficients of FPT and depth/slope as well as from the map. However, the results of this analysis are obscured by the autocorrelation in FPT. Since the deployments are opportunistic observations, no absence data was analyzed here and the sample size is small, it is unclear whether non-feeding whales also concentrate their activity on similar locations within SBNMS and whether surface feeding also occurs over other parts of SBNMS with different ranges of depth/slope. The spatial scale of autocorrelation in FPT (451-596 m) is more similar between the three whales than the spatial scale of foraging effort. This could suggest the existence of a prey searching mechanism that is similar for all three whales and works on a scale larger than the foraging mechanism captured by the FPT analysis.

A better understanding of the foraging mechanisms of the whales can help to improve management decisions aimed at mitigating ship strike and entanglement risks in the SBNMS.

Learning outcomes

Working with spatial data in R
Implementing FPT metric
Calculating spatial autocorrelation
Extracting raster values in ArcMap

References

Bailey, H. & Thompson P. (2006). “Quantitative Analysis of Bottlenose Dolphin Movement Patterns and Their Relationship with Foraging: Movement Patterns and Foraging.” Journal of Animal Ecology 75 (2): 456–65.

Calenge, C. (2011). “Analysis of Animal Movements in R: The adehabitatLT Package.” Saint Benoist, Auffargis, France: Office Nationale de La Chasse. http://cran.gis-lab.info/web/packages/adehabitatLT/vignettes/adehabitatLT.pdf.

Fauchald, P. & Tveraa T. (2003). “Using First-Passage Time in the Analysis of Area-Restricted Search and Habitat Selection.” Ecology 84 (2): 282–88.

Friedlaender, A. S., Hazen, E. L., Nowacek, D. P., Halpin, P. N., Ware, C., Weinrich, M. T., Hurst, T., Wiley, D. (2009). Diel changes in humpback whale Megaptera novaeangliae feeding behavior in response to sand lance Ammodytes spp. behavior and distribution. Marine Ecology Progress Series 395: 91–100.

Hazen E.L., Friedlaender A.S., Thompson M.A., Ware C.R., Weinrich M.T., Halpin P.N., Wiley D.N. (2009). “Fine-Scale prey aggregations and foraging ecology of humpback whales Megaptera Novaeangliae.” Marine Ecology Progress Series 395: 75–89.

Jennifer Allen and Michael Thompson calculated and provided the whale locations based on boat GPS and range and bearing data. Thank you!

The spatial problem I explored this quarter was about quantifying the extent of the foraging ranges of Adelie and Gentoo penguins breeding at Palmer Station, Antarctica. My original research question was whether interspecific competition could be a possible mechanism driving penguin population trends at Palmer Station. In retrospect, this question was a bit beyond the scope of the spatial analysis I proposed to conduct. However, the approach I used to test my hypothesis (that the foraging ranges of these two species would overlap) is an important first step in starting to answer this question.

The dataset I used to conduct this analysis consisted of location data (Lat/Long coordinates) obtained from platform terminal transmitters (PTTs). Over the course of the 2015 breeding season (5 January-2 February), 20 penguins (n=5 Adelie, n=15 Gentoo) were outfitted with PTT tags for roughly 3 days each. Over these three days, tags transmitted location data to ARGOS satellite system. With the specific purpose of learning spatial analysis techniques in mind, all datapoints were treated as foraging locations. Further analysis of PTT data combined with TDR (time depth recorder) data would need to be conducted in order to separate foraging locations from travelling locations. Location data from individual birds were grouped together by species (n=522 Adelie, n=147 Gentoo). The purpose of this was to analyze each species foraging distribution as a whole rather than look at individual tracks.

I used a kernel density (KD) approach to answer my question of interest. I chose this approach because it is one of the most widely used techniques to apply to tracking data for hot spot analysis, and because it appeared to be relatively easy and quick to learn. My goal was to create isopleths of utilization in order to identify areas used for foraging (95% KDE) and core use areas (50% KDE). The general idea being that the area contained within the 50% contour line would be the smallest area encompassing 50% of the datapoints used to create the entire KDE. I also sought to determine the area (km²) within of each of these contour lines and calculate the proportion of overlap between the two species ranges.

My results are summarized in table 1 (below). Gentoo penguins have a larger foraging range (core use and overall) concentrated around the colony where they were tagged, as well as near the head of Palmer deep canyon (figure 1). Adelie penguins have a more densely concentrated (near shore) range centered around the colonies where they were tagged. These results provide evidence to support my hypothesis that the ranges of the two species overlap. A greater percentage of Adelie foraging area overlaps with Gentoo area, due to the fact that their range is smaller.

Table 1. Estimates of core use (50% KDE) and total (95% KDE) foraging areas used by Adelie and Gentoo penguins with associated overlap between species.

Figure 1. Map depicting kernel density contours for 95% and 50% KDEs. Dark blue and red symbolize overall and core use areas of Adelies and lighter blue and red represent Gentoo ranges.

The significance of these results is questionable due to issues I was unable to address by the end of the quarter. Kernel density estimates are influenced significantly by the smoothing factor (search radius) used, which in turn is influenced by the density of datapoints considered. Therefore, sampling size, or the number of datapoints used in each kernel density estimate, has a big effect on the final KDEs. In this analysis, I used a much larger sample of Adelie locations then I did Gentoo locations. I have begun testing the effect of sample size on these KDEs, but have yet to come to any conclusions about the appropriate number of datapoints to use in order to gain an accurate estimation of foraging range.

Once I’ve addressed this issue of unbalanced sampling, I will be more confident in drawing conclusions about the foraging ranges of these two species. In the future I intend to use this information to make comparisons of these ranges between species and across years of variable prey. Ultimately, this knowledge will inform larger questions of the Palmer Long Term Ecological Research (LTER) project (e.g. how do changes in the marine environment affect the behavior and distribution of penguins? Are penguins competing with each other and/or other krill predators (e.g. whales) in the Palmer area? How does prey variability affect these relationships?).

Over the quarter I’ve gained more knowledge of ArcMap, specifically the spatial analyst toolbox and the kernel density tool. I’ve also begun to learn these same techniques in R and I hope to continue to expand on that in the future. Thanks Julia and Mark!

The question I examined for this project is how does the frequency of reported red-tailed hawk observations from the eBird database change between 2000 and 2014 in NW Oregon? For most of this term, however, I was trying to determine patterns in red-tailed hawk residency (as defined by the ratio of days in a calendar year a red-tailed hawk was observed at a given location to the number of days any other species was observed). I attempted to do so using a multiple regression model. After multiple failed attempts to construct an accurate model, I determined that the eBird data I was using were not suited for such analyses or at least not suited to an approach reasonable for the scope of this course. By first investigating the patterns in red-tailed hawk observation frequency, I thought I might be able to figure out why the data were producing poor results and ways to mitigate these weaknesses.

The bird observations I used to examine this question are from the eBird database, an online citizen science monitoring and reporting system. For each eBird observation, an observer submits the species, the geographic location, and the date/time. These data exist at a range of spatial resolutions from several meters to several kilometers. Because I will analyze habitat colonization at the patch scale, I wrote a Python script to select observations where lat/long is reported at or above a specified precision. The extent of these data stretches beyond the continental US. Temporally, eBird data are reported with a timestamp specified to the minute, however, I analyzed observations by year. I used data from 2000 to 2014. Data exist for observations before 2000, but they are much more sparse.

For environmental covariate data, I used various land cover parameters, population data, and climate data. The land cover parameters were derived from the National Land Cover Database (NLCD) data from 2011, which exist at a 30m resolution. I also used US Census tract data from 2010 to estimate population density. Lastly I used PRISM climate data, including average minimum temperature and average precipitation. The PRISM data used in this project were produced at an 800m resolution and represent averages from 1981-2014.

I expected to find that yearly variation in red-tailed hawk frequency would be minimal. I anticipated that frequency would be highest on the edges of urban areas near large patches of open space such as agricultural land, sagebrush, urban parks, etc. However, I also expected that these effects would be tempered by spatial bias in the eBird data (i.e. high densities of observations in and near urban areas and relatively sparse observations in rural areas). This would be represented, in part, by negative relationship between canopy density to frequency and impervious surface coverage to frequency.

My analysis process to address this problem was as follows:

Create a grid of points at 500m intervals
Snap red-tailed hawk observations to the closest grid point
Count points within every cell of a 500m raster
Extract the value of the count raster and append it to the snapped observation points
Divide the count for each record by the maximum count for that year to get a relative value of frequency (values range between 0 and 1)
Extract the value of each independent variable and append it to the snapped observation points.
Select 1,000 stratified random samples from snapped observation points with a Python script
Run Ordinary Least Squares on random samples
Use coefficients from the regression equation to weight environmental variables in a Raster Calculator expression to produce a prediction of frequency for the entire study area.

The result from this process is a raster of predicted frequency values for every cell in the study area. I repeated this process for years 2000, 2005, and 2010–2014. The total number of observations for year 2000 was 172 and the total for year 2005 was 627 so all observation records were used for each. The OLS tool failed to run for year 2000 data due to multicolinearity (probably a result of the small sample size mostly from highly populated areas). For year 2005, the model performed poorly and the prediction raster was almost completely opposite of all other predictions.

For all other years, overall model performance was good with an R-squared value for years 2010-2013 >.9 and an R-squared valued of .38 for 2014. Akaike’s Information Criterion was < 500 for 2010-2014. However, the residuals for all years were not normally distributed for any of the models, so they are not necessarily reliable. Because the Raster Calculator expressions used to produce the prediction rasters are essentially the regression model equations, the following expressions indicate the strength and the nature of the relationship with each variable for each year. They also show which variables were significant for each year, as any variables that were not significant in a model were omitted from the corresponding Raster Calculator expression. (NOTE: dependent variable values ranged from 0-1 so coefficient values are proportionately small):

2010: -0.113637 + (0.109701 * “lowDev500m”) + (-0.058497* “shrub500m”) + (0.498112 * “water500m”) + (-0.00052 * “precip”) + (0.014086 * “minTemp”) + (0.095448 * “lcdivrsty_0_1”) + (0.001215 * “imperv500m”)

2011: -0.158069 + (0.169261 * “lowDev500m”) + (-0.039183 * “ag500m”) + (0.000008 * “pop”) + (-0.094665 * “highdev500m”) + (-0.035761* “shrub500m”) + (0.403930 * “water500m”) + (-0.00033 * “precip”) + (0.018631 * “minTemp”) + (0.03855 * “lcdivrsty_0_1”) + (0.002261 * “imperv500m”)

2012: -.212994 + (0.224267 * “lowDev500m”) + (-0.065403 * “ag500m”) + (0.000008 * “pop”) + (-0.172365 * “highdev500m”) + (-0.111544 * “shrub500m”) + (0.721691 * “water500m”) + (-0.00008 * “precip”) + (0.031621 * “minTemp”) + (0.117999 * “lcdivrsty_0_1”) + (0.002558 * “imperv500m”)

2013: -0.113637 + (0.109701 * “lowDev500m”) + (-0.009148 * “ag500m”) + (-0.058497 * “shrub500m”) + (0.498117 * “water500m”) + (-0.00052 * “precip”) + (0.01406 * “minTemp”) + (0.095448 * “lcdivrsty_0_1”) + (0.001768 * “can500m”) + (0.001215 * “imperv500m”)

2014: -0.116055 + (0.069739 * “lowDev500m”) + (0.118068 * “ag500m”) + (0.000056 * “pop”) + (0.092787 * “shrub500m”) + (0.020673 * “water500m”) + (-0.00028 * “precip”) + (0.089624 * “minTemp”) + (0.277722 * “lcdivrsty_0_1”) + (-0.003689 * “can500m”) + (-0.007453 * “imperv500m”)

Variable definitions:

lowDev500m – value of 1 if low intensity urban development is the dominant land cover type within 500m, 0 if not

ag500m – value of 1 if pasture or cultivated crops are the dominant land cover type within 500m, 0 if not

pop – population density from 2010 US Census tract data (calculated by total population/area)

highdev500m – value of 1 if high intensity urban development is the dominant land cover type within 500m, 0 if not

shrub500m – value of 1 if shrub is the dominant land cover type within 500m, 0 if not

water500m – value of 1 if water is the dominant land cover type within 500m, 0 if not

precip – average annual precipitation from 1981-2014

minTemp – average daily minimum temperature from 1981-2014

lcdivrsty_0_1 – value of 1 if 3 < diversity < 9, 0 otherwise

can500m – average canopy density within 500m

imperv500m – average impervious surface cover within 500m

Some expected patterns were present in all other prediction rasters. For example, predicted frequency was lowest in forested areas and high elevation areas (with low average minimum temperatures), and it was highest in low-lying open space and in areas of low urban development. At finer scales, however, the differences between years were more significant. For instance, local differences in urban development had a noticeably stronger influence for 2012 and 2014. The low values in forested areas were also less uniform for these years. Below are the prediction rasters for 2010-2014.

If the regression models can be trusted, these results suggest that any assessment of some other metric based on frequency values (e.g. species residency) is also likely to be highly variable from year to year. Additionally, the parameters used in my regression models are still not sufficient to properly explain the patterns in frequency of red-tailed hawks. A more complex model is likely necessary to account for complex spatiotemporal variation and eBird data biases.

Throughout this exploration of eBird data, I have learned a lot about the statistical tools available in ArcGIS and the limitations of some of them. I was also challenged to write a couple of Python scripts that forced me to expand my knowledge of the Pandas package for handling large data tables. While I used ModelBuilder to automate some ArcGIS processes, I don’t think I learned anything new. I did not use R at all for this project.

GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Just another blogs.oregonstate.edu site

Monthly Archives: June 2015

Environmental Magnetism and Heavy Metals at Formosa Mine Superfund Site

MODIS Land Surface Temperature and Ambient Ait Temperature

Update: Harbor Porpoise Distribution Patterns off the Coast of Oregon

Spatial scale and locations of surface feeding sites and movement parameters of humpback whales (Megaptera novaeangliae) in the the southern Gulf of Maine, USA

Final Project: Estimating foraging ranges of Adelie and Gentoo penguins breeding at Palmer Station, Antarctica

Final Project: Regression Analysis of Red-Tailed Hawk Frequency in Northwest Oregon

Contact Info