The question I examined for this project is how does the frequency of reported red-tailed hawk observations from the eBird database change between 2000 and 2014 in NW Oregon? For most of this term, however, I was trying to determine patterns in red-tailed hawk residency (as defined by the ratio of days in a calendar year a red-tailed hawk was observed at a given location to the number of days any other species was observed). I attempted to do so using a multiple regression model. After multiple failed attempts to construct an accurate model, I determined that the eBird data I was using were not suited for such analyses or at least not suited to an approach reasonable for the scope of this course. By first investigating the patterns in red-tailed hawk observation frequency, I thought I might be able to figure out why the data were producing poor results and ways to mitigate these weaknesses.
The bird observations I used to examine this question are from the eBird database, an online citizen science monitoring and reporting system. For each eBird observation, an observer submits the species, the geographic location, and the date/time. These data exist at a range of spatial resolutions from several meters to several kilometers. Because I will analyze habitat colonization at the patch scale, I wrote a Python script to select observations where lat/long is reported at or above a specified precision. The extent of these data stretches beyond the continental US. Temporally, eBird data are reported with a timestamp specified to the minute, however, I analyzed observations by year. I used data from 2000 to 2014. Data exist for observations before 2000, but they are much more sparse.
For environmental covariate data, I used various land cover parameters, population data, and climate data. The land cover parameters were derived from the National Land Cover Database (NLCD) data from 2011, which exist at a 30m resolution. I also used US Census tract data from 2010 to estimate population density. Lastly I used PRISM climate data, including average minimum temperature and average precipitation. The PRISM data used in this project were produced at an 800m resolution and represent averages from 1981-2014.
I expected to find that yearly variation in red-tailed hawk frequency would be minimal. I anticipated that frequency would be highest on the edges of urban areas near large patches of open space such as agricultural land, sagebrush, urban parks, etc. However, I also expected that these effects would be tempered by spatial bias in the eBird data (i.e. high densities of observations in and near urban areas and relatively sparse observations in rural areas). This would be represented, in part, by negative relationship between canopy density to frequency and impervious surface coverage to frequency.
My analysis process to address this problem was as follows:
- Create a grid of points at 500m intervals
- Snap red-tailed hawk observations to the closest grid point
- Count points within every cell of a 500m raster
- Extract the value of the count raster and append it to the snapped observation points
- Divide the count for each record by the maximum count for that year to get a relative value of frequency (values range between 0 and 1)
- Extract the value of each independent variable and append it to the snapped observation points.
- Select 1,000 stratified random samples from snapped observation points with a Python script
- Run Ordinary Least Squares on random samples
- Use coefficients from the regression equation to weight environmental variables in a Raster Calculator expression to produce a prediction of frequency for the entire study area.
The result from this process is a raster of predicted frequency values for every cell in the study area. I repeated this process for years 2000, 2005, and 2010–2014. The total number of observations for year 2000 was 172 and the total for year 2005 was 627 so all observation records were used for each. The OLS tool failed to run for year 2000 data due to multicolinearity (probably a result of the small sample size mostly from highly populated areas). For year 2005, the model performed poorly and the prediction raster was almost completely opposite of all other predictions.
For all other years, overall model performance was good with an R-squared value for years 2010-2013 >.9 and an R-squared valued of .38 for 2014. Akaike’s Information Criterion was < 500 for 2010-2014. However, the residuals for all years were not normally distributed for any of the models, so they are not necessarily reliable. Because the Raster Calculator expressions used to produce the prediction rasters are essentially the regression model equations, the following expressions indicate the strength and the nature of the relationship with each variable for each year. They also show which variables were significant for each year, as any variables that were not significant in a model were omitted from the corresponding Raster Calculator expression. (NOTE: dependent variable values ranged from 0-1 so coefficient values are proportionately small):
2010: -0.113637 + (0.109701 * “lowDev500m”) + (-0.058497* “shrub500m”) + (0.498112 * “water500m”) + (-0.00052 * “precip”) + (0.014086 * “minTemp”) + (0.095448 * “lcdivrsty_0_1”) + (0.001215 * “imperv500m”)
2011: -0.158069 + (0.169261 * “lowDev500m”) + (-0.039183 * “ag500m”) + (0.000008 * “pop”) + (-0.094665 * “highdev500m”) + (-0.035761* “shrub500m”) + (0.403930 * “water500m”) + (-0.00033 * “precip”) + (0.018631 * “minTemp”) + (0.03855 * “lcdivrsty_0_1”) + (0.002261 * “imperv500m”)
2012: -.212994 + (0.224267 * “lowDev500m”) + (-0.065403 * “ag500m”) + (0.000008 * “pop”) + (-0.172365 * “highdev500m”) + (-0.111544 * “shrub500m”) + (0.721691 * “water500m”) + (-0.00008 * “precip”) + (0.031621 * “minTemp”) + (0.117999 * “lcdivrsty_0_1”) + (0.002558 * “imperv500m”)
2013: -0.113637 + (0.109701 * “lowDev500m”) + (-0.009148 * “ag500m”) + (-0.058497 * “shrub500m”) + (0.498117 * “water500m”) + (-0.00052 * “precip”) + (0.01406 * “minTemp”) + (0.095448 * “lcdivrsty_0_1”) + (0.001768 * “can500m”) + (0.001215 * “imperv500m”)
2014: -0.116055 + (0.069739 * “lowDev500m”) + (0.118068 * “ag500m”) + (0.000056 * “pop”) + (0.092787 * “shrub500m”) + (0.020673 * “water500m”) + (-0.00028 * “precip”) + (0.089624 * “minTemp”) + (0.277722 * “lcdivrsty_0_1”) + (-0.003689 * “can500m”) + (-0.007453 * “imperv500m”)
Variable definitions:
lowDev500m – value of 1 if low intensity urban development is the dominant land cover type within 500m, 0 if not
ag500m – value of 1 if pasture or cultivated crops are the dominant land cover type within 500m, 0 if not
pop – population density from 2010 US Census tract data (calculated by total population/area)
highdev500m – value of 1 if high intensity urban development is the dominant land cover type within 500m, 0 if not
shrub500m – value of 1 if shrub is the dominant land cover type within 500m, 0 if not
water500m – value of 1 if water is the dominant land cover type within 500m, 0 if not
precip – average annual precipitation from 1981-2014
minTemp – average daily minimum temperature from 1981-2014
lcdivrsty_0_1 – value of 1 if 3 < diversity < 9, 0 otherwise
can500m – average canopy density within 500m
imperv500m – average impervious surface cover within 500m
Some expected patterns were present in all other prediction rasters. For example, predicted frequency was lowest in forested areas and high elevation areas (with low average minimum temperatures), and it was highest in low-lying open space and in areas of low urban development. At finer scales, however, the differences between years were more significant. For instance, local differences in urban development had a noticeably stronger influence for 2012 and 2014. The low values in forested areas were also less uniform for these years. Below are the prediction rasters for 2010-2014.
If the regression models can be trusted, these results suggest that any assessment of some other metric based on frequency values (e.g. species residency) is also likely to be highly variable from year to year. Additionally, the parameters used in my regression models are still not sufficient to properly explain the patterns in frequency of red-tailed hawks. A more complex model is likely necessary to account for complex spatiotemporal variation and eBird data biases.
Throughout this exploration of eBird data, I have learned a lot about the statistical tools available in ArcGIS and the limitations of some of them. I was also challenged to write a couple of Python scripts that forced me to expand my knowledge of the Pandas package for handling large data tables. While I used ModelBuilder to automate some ArcGIS processes, I don’t think I learned anything new. I did not use R at all for this project.