hoopersa, Author at GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

The question I examined for this project is how does the frequency of reported red-tailed hawk observations from the eBird database change between 2000 and 2014 in NW Oregon? For most of this term, however, I was trying to determine patterns in red-tailed hawk residency (as defined by the ratio of days in a calendar year a red-tailed hawk was observed at a given location to the number of days any other species was observed). I attempted to do so using a multiple regression model. After multiple failed attempts to construct an accurate model, I determined that the eBird data I was using were not suited for such analyses or at least not suited to an approach reasonable for the scope of this course. By first investigating the patterns in red-tailed hawk observation frequency, I thought I might be able to figure out why the data were producing poor results and ways to mitigate these weaknesses.

The bird observations I used to examine this question are from the eBird database, an online citizen science monitoring and reporting system. For each eBird observation, an observer submits the species, the geographic location, and the date/time. These data exist at a range of spatial resolutions from several meters to several kilometers. Because I will analyze habitat colonization at the patch scale, I wrote a Python script to select observations where lat/long is reported at or above a specified precision. The extent of these data stretches beyond the continental US. Temporally, eBird data are reported with a timestamp specified to the minute, however, I analyzed observations by year. I used data from 2000 to 2014. Data exist for observations before 2000, but they are much more sparse.

For environmental covariate data, I used various land cover parameters, population data, and climate data. The land cover parameters were derived from the National Land Cover Database (NLCD) data from 2011, which exist at a 30m resolution. I also used US Census tract data from 2010 to estimate population density. Lastly I used PRISM climate data, including average minimum temperature and average precipitation. The PRISM data used in this project were produced at an 800m resolution and represent averages from 1981-2014.

I expected to find that yearly variation in red-tailed hawk frequency would be minimal. I anticipated that frequency would be highest on the edges of urban areas near large patches of open space such as agricultural land, sagebrush, urban parks, etc. However, I also expected that these effects would be tempered by spatial bias in the eBird data (i.e. high densities of observations in and near urban areas and relatively sparse observations in rural areas). This would be represented, in part, by negative relationship between canopy density to frequency and impervious surface coverage to frequency.

My analysis process to address this problem was as follows:

Create a grid of points at 500m intervals
Snap red-tailed hawk observations to the closest grid point
Count points within every cell of a 500m raster
Extract the value of the count raster and append it to the snapped observation points
Divide the count for each record by the maximum count for that year to get a relative value of frequency (values range between 0 and 1)
Extract the value of each independent variable and append it to the snapped observation points.
Select 1,000 stratified random samples from snapped observation points with a Python script
Run Ordinary Least Squares on random samples
Use coefficients from the regression equation to weight environmental variables in a Raster Calculator expression to produce a prediction of frequency for the entire study area.

The result from this process is a raster of predicted frequency values for every cell in the study area. I repeated this process for years 2000, 2005, and 2010–2014. The total number of observations for year 2000 was 172 and the total for year 2005 was 627 so all observation records were used for each. The OLS tool failed to run for year 2000 data due to multicolinearity (probably a result of the small sample size mostly from highly populated areas). For year 2005, the model performed poorly and the prediction raster was almost completely opposite of all other predictions.

For all other years, overall model performance was good with an R-squared value for years 2010-2013 >.9 and an R-squared valued of .38 for 2014. Akaike’s Information Criterion was < 500 for 2010-2014. However, the residuals for all years were not normally distributed for any of the models, so they are not necessarily reliable. Because the Raster Calculator expressions used to produce the prediction rasters are essentially the regression model equations, the following expressions indicate the strength and the nature of the relationship with each variable for each year. They also show which variables were significant for each year, as any variables that were not significant in a model were omitted from the corresponding Raster Calculator expression. (NOTE: dependent variable values ranged from 0-1 so coefficient values are proportionately small):

2010: -0.113637 + (0.109701 * “lowDev500m”) + (-0.058497* “shrub500m”) + (0.498112 * “water500m”) + (-0.00052 * “precip”) + (0.014086 * “minTemp”) + (0.095448 * “lcdivrsty_0_1”) + (0.001215 * “imperv500m”)

2011: -0.158069 + (0.169261 * “lowDev500m”) + (-0.039183 * “ag500m”) + (0.000008 * “pop”) + (-0.094665 * “highdev500m”) + (-0.035761* “shrub500m”) + (0.403930 * “water500m”) + (-0.00033 * “precip”) + (0.018631 * “minTemp”) + (0.03855 * “lcdivrsty_0_1”) + (0.002261 * “imperv500m”)

2012: -.212994 + (0.224267 * “lowDev500m”) + (-0.065403 * “ag500m”) + (0.000008 * “pop”) + (-0.172365 * “highdev500m”) + (-0.111544 * “shrub500m”) + (0.721691 * “water500m”) + (-0.00008 * “precip”) + (0.031621 * “minTemp”) + (0.117999 * “lcdivrsty_0_1”) + (0.002558 * “imperv500m”)

2013: -0.113637 + (0.109701 * “lowDev500m”) + (-0.009148 * “ag500m”) + (-0.058497 * “shrub500m”) + (0.498117 * “water500m”) + (-0.00052 * “precip”) + (0.01406 * “minTemp”) + (0.095448 * “lcdivrsty_0_1”) + (0.001768 * “can500m”) + (0.001215 * “imperv500m”)

2014: -0.116055 + (0.069739 * “lowDev500m”) + (0.118068 * “ag500m”) + (0.000056 * “pop”) + (0.092787 * “shrub500m”) + (0.020673 * “water500m”) + (-0.00028 * “precip”) + (0.089624 * “minTemp”) + (0.277722 * “lcdivrsty_0_1”) + (-0.003689 * “can500m”) + (-0.007453 * “imperv500m”)

Variable definitions:

lowDev500m – value of 1 if low intensity urban development is the dominant land cover type within 500m, 0 if not

ag500m – value of 1 if pasture or cultivated crops are the dominant land cover type within 500m, 0 if not

pop – population density from 2010 US Census tract data (calculated by total population/area)

highdev500m – value of 1 if high intensity urban development is the dominant land cover type within 500m, 0 if not

shrub500m – value of 1 if shrub is the dominant land cover type within 500m, 0 if not

water500m – value of 1 if water is the dominant land cover type within 500m, 0 if not

precip – average annual precipitation from 1981-2014

minTemp – average daily minimum temperature from 1981-2014

lcdivrsty_0_1 – value of 1 if 3 < diversity < 9, 0 otherwise

can500m – average canopy density within 500m

imperv500m – average impervious surface cover within 500m

Some expected patterns were present in all other prediction rasters. For example, predicted frequency was lowest in forested areas and high elevation areas (with low average minimum temperatures), and it was highest in low-lying open space and in areas of low urban development. At finer scales, however, the differences between years were more significant. For instance, local differences in urban development had a noticeably stronger influence for 2012 and 2014. The low values in forested areas were also less uniform for these years. Below are the prediction rasters for 2010-2014.

If the regression models can be trusted, these results suggest that any assessment of some other metric based on frequency values (e.g. species residency) is also likely to be highly variable from year to year. Additionally, the parameters used in my regression models are still not sufficient to properly explain the patterns in frequency of red-tailed hawks. A more complex model is likely necessary to account for complex spatiotemporal variation and eBird data biases.

Throughout this exploration of eBird data, I have learned a lot about the statistical tools available in ArcGIS and the limitations of some of them. I was also challenged to write a couple of Python scripts that forced me to expand my knowledge of the Pandas package for handling large data tables. While I used ModelBuilder to automate some ArcGIS processes, I don’t think I learned anything new. I did not use R at all for this project.

My goal over the last few weeks has been to determine the relationship between red-tailed hawk residency and environmental variables. I realized though that before I could do so, there were some data quality issues that I needed to address. Specifically, I realized that the 0 values in my residency raster (a ratio of the number of days red-tailed hawks were observed to the number of days any other species were observed) represented locations where there weren’t any hawks. Since only the locations where hawks were observed are relevant to my analyses, I removed records with a ratio of 0.

In removing these 0 values from my data (which cut the data roughly in half), I realized that they probably had a significant influence on the hotspot analyses I ran in the first few weeks of the class. I re-ran the hotspot analysis, and as expected, the hotspots were much more finely articulated. I decided to see what influence other portions of the data might have on the hotspot analysis so I tried iterations with <100%(meaning a hawk was seen on every day that any bird was seen), 0< and <100%, and 0< and <50%. The latter two produce nearly identical hotspot maps so it seemed appropriate to use all data 0< and <100%.

Hotspot analyses from upper right, clockwise: all data, <100%, 0< and <100%, and 0%<.

Next I prepared my environmental variable data. For my regression model, I used 8 variables:

Population – value of containing census tract from US Census data
Average precipitation – value of cell from 2014 PRISM data
Minimum Temperature – ibid
Percent open space with 1km radius – reclassified NLCD data (Herbaceous Upland, Grasslands/Herbaceous, Planted/Cultivated, Pasture/Hay, Row Crops. Small Grains, Fallow, Urban/Recreational Grasses = 1, everything else 0) > Focal statistics mean
Dominant land cover in 500m radius– focal statistics majority on NLCD
Avg percent canopy cover in 500m radius – focal statistics mean
Avg percent impervious surface in 500m radius – ibid
Land cover diversity in 500m radius – focal statistics variety

I then ran the Ordinary Least Squares regression tool. My R-squared value was .214. From the report the tool produced, I concluded that the residuals were not randomly distributed.

To be sure, I also ran the Spatial Autocorrelation tool on the residuals, and there is a 1% chance that the distribution could be random. I also ran a hotspot analysis on the residuals at two scales, 1,000 ft (the minimum distance band that wouldn’t produce an error) and 47,891 ft (the calculated distance band from my previous analyses). While the 1,000 ft distance band did not produce anything interpretable, the 47,891 ft distance band may point to flaws in model design. That is, the distinct locations where the model over-predicted and under-predicted may suggest other environmental variables I should include or ones I should modify/drop from my model. I haven’t figured out what these are yet though.

Hotspot analyses on residuals at 1,000 ft (left) and 47,891 ft (right).

The Koenker (BP) Statistic indicated that my model is heterscedastic (i.e. the model is not evenly fit for high and low dependent variable values). To try to understand why, I re-ran the OLS tool on a subset of my data where the residency ratio < 5% and another subset where residency > %50. Both of these subsets totaled about 3500 records each. The R-square value for > 50% was .29 and .14 for < 5%. The differences in the histograms are also telling.

Screen Shot 2015-05-12 at 9.15.20 PM — < 5%, population, avg precip, min temperature, percent open space/km, dominant land cover

> 50%, population, avg precip, min temperature, percent open space/km, dominant land cover

Screen Shot 2015-05-12 at 9.16.10 PM — < 5%, avg percent canopy per 500m, avg percent impervious surface per 500m, land cover diversity

Screen Shot 2015-05-12 at 9.48.43 PM — > 50%, avg percent canopy per 500m, avg percent impervious surface per 500m, land cover diversity

From these plots, I will try to develop other environmental variables that may be better predictors of residency.

As described in my previous blog post, my original intent was to investigate successional patterns of urban re-colonization by diurnal raptors using eBird data. I soon realized that Exercises 2 and 3 would not be possible with point data that I have because the data alone do not have any quantitative attributes that vary in any meaningful way in relation to my research question. After discussing other potential research questions with Julia, we decided that analyzing differing patterns of year-round residency across an urban/rural gradient would be a reasonable alternative that still addresses some of the same overarching themes.

My general approach to conducting this analysis was to calculate the ratio of the number of days a particular species was observed within a given year to the number of days observations of any species were made. This ratio is calculated per cell in a raster covering the extent of the study area. If the ratio is closer to 1 in certain places, then that species of bird is likely staying in that area for more of the year. A value of 0 would indicate that at least one other species was observed, but the species of interest was not observed at all at that location.

I initially thought I would get more meaningful results with a larger sample size so for this initial analysis, I chose to analyze the residency of red-tailed hawks (n = 6,607), one of the most commonly reported species of raptor. The year-round range of red-tailed hawks, however, overlaps with my study site in NW Oregon which might have influenced my results. I therefore repeated the analysis with observations of merlins (n = 589), a species without a year-round range overlapping the study site, to see if patterns in the data were different. Of course, merlins and red-tailed hawks may respond to urban and rural environmental characteristics differently. For both species, I used observations from 2014 only.

An overview of the workflow for this analysis is shown below , but I will also describe each step in detail.

Step 1 Dissolve all observations dataset and species observation dataset by date and observer ID fields.

Step 2 Point Density on both sets of dissolved observations with a cell size of 200 m and a neighborhood of 1 cell

Step 3 Raster Caculator on both points density rasters to multiply each by 200. This produces a raster where each cell is a count of the number of points within that cell. (Not shown in workflow schematic due to space limitations.)

Screen Shot 2015-04-14 at 10.22.32 PM — Observations of all species per 200 m

Screen Shot 2015-04-14 at 10.23.16 PM — Red-tailed hawk observations per 200 m

Step 4 Raster Calculator to compute raster where each cell is a ratio of species observation count to total observation count

Step 5 Extract Values to Points with all observation dataset as input points and ratio raster as values to sample.

Step 6 Getis-Ord Gi* Hotspot analysis on sampled points with Fixed Distance Band as the Conceptualization of Spatial Relationships parameter.

Screen Shot 2015-04-14 at 11.21.53 PM — Hot and cold spots of red-tailed hawk residency. Black polygons are 2010 Census delineated urbanized areas.

While the results seem to reflect a pattern I would expect, I’m not sure that I trust them entirely. This is for several reasons:

The merlin hotspot map is very similar despite the fact that the observations were much more sparse than red-tailed hawk observations. The merlin map also shows hotspots in locations where there aren’t any merlin observations too, and the sampled value of many of the points is 0 (meaning other birds were observed but no merlins were).

Screen Shot 2015-04-14 at 11.37.19 PM — Merlin hotspots

I conducted the same sampling procedures using Extract Values to Points but with a grid of points even spaced 1km apart. The hotspot map is very different and seems to be mostly noise. There is only faint evidence of a discernible pattern.

Screen Shot 2015-04-14 at 11.42.56 PM — Hot spot analysis on a 1 km grid of sampled red-tailed hawk residency ratio

There are some data quality issues that I did not address here. For instance, some observers may have only been looking for certain species and might not have reported the species of interest even if it were present. The converse could be true as well. This is usually reported with those kinds of observations but I didn’t filter these out.
Also, the data are very clustered to begin with and I may not have selected the right Conceptualization of Spatial Relationships for the data.

Is there a pattern evident in the successional order of habitat types recolonized by certain diurnal raptor species within the last twenty years along urban/rural gradients in Washington, Oregon, and California? This is the question I hope explore throughout this course. I have yet to finalize which raptor species I will include in this analysis, but I will likely use observations of red-tailed hawks, merlins, American kestrels, and northern harriers. This analysis will be automated with a Python script so the number of species I choose to include is not critical at this stage of the research.

To answer this question, I will use citizen science bird observations and land cover data. The bird observations will come from the eBird database, an online citizen science monitoring and reporting system. For each eBird observation, an observer submits the species, the geographic location, and the date/time. These data exist at a range of spatial scales from several meters to several kilometers. Because I will analyze habitat colonization at the patch scale, I have already written Python code to select observations where lat/long is reported at or above a specified precision. The extent of these data stretches beyond the continental US. Temporally, eBird data are reported with a timestamp specified to the minute, however, my analysis will only require resolution to the month. I will use data from 1995 to 2014. Data exist for observations before 1995, but they are much more sparse.

The land cover data I will use is derived from LandTrendr, a series of algorithms for extracting land cover change information from Landsat time series imagery. For this project, I will use only land cover data from Washington, Oregon, and California as these data already exist. Land cover data for other parts of the country would need to be processed, a time-consuming endeavor that is unnecessary for the purposes of this class. Land cover data for Washington, Oregon, and California exist for all years from 1995 to 2014.

From this research, I expect to see two general patterns: 1) species prevalence for each selected species will increase inside highly developed patches from the beginning of the study period to the end, and 2) a successional order of colonized habitat types will be present for each species. A number of longitudinal studies have documented urban re-colonization by various diurnal raptor species. These re-colonization events may follow successional patterns. This process can be understood in the following way: when an urban area is first developed, existing raptor species likely retreat to remaining high quality habitat. As this high quality habitat reaches carrying capacity, pioneering individuals must find marginal habitat that still suits their ecological needs. This same process of demographic saturation is repeated, likely resulting in a successional pattern, as more marginal habitat is re-colonized.

From my limited knowledge of spatial statistics, it seems that a geographically weighted regression or ordinary least squares would be appropriate to tease out such patterns. eBird data are also notorious for spatial biases where observations are more numerous near easily accessible places (e.g. roads, trails, etc.) Perhaps a spatial autocorrelation analysis may work to assess the accuracy of successional pattern analysis results. To perform these analyses, I am proficient with ArcInfo, ModelBuilder, and Python. However, I don’t have any experience with R.

Within the last several decades, urbanization has rapidly altered available habitat for diurnal raptor species, leading to changes in geographic distribution. Greater understanding of urban raptor habitat selection may lead to more effective conservation. To date, however, data on urbanization patterns and species response are not available at the spatial and temporal scales needed to build this understanding. Answering the above research question will hopefully shed light on the effects of development on raptor distributions and aid city planners and conservationists in building more ecologically habitable cities.

GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Just another blogs.oregonstate.edu site

Author Archives: hoopersa

Final Project: Regression Analysis of Red-Tailed Hawk Frequency in Northwest Oregon

Regression Analysis of Red-tailed hawk residency

Red-tailed hawk residency in Northwest Oregon

Determining Patterns of Urban Colonization by Diurnal Raptors

Contact Info