Regression Analysis of Red-tailed hawk residency « GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

My goal over the last few weeks has been to determine the relationship between red-tailed hawk residency and environmental variables. I realized though that before I could do so, there were some data quality issues that I needed to address. Specifically, I realized that the 0 values in my residency raster (a ratio of the number of days red-tailed hawks were observed to the number of days any other species were observed) represented locations where there weren’t any hawks. Since only the locations where hawks were observed are relevant to my analyses, I removed records with a ratio of 0.

In removing these 0 values from my data (which cut the data roughly in half), I realized that they probably had a significant influence on the hotspot analyses I ran in the first few weeks of the class. I re-ran the hotspot analysis, and as expected, the hotspots were much more finely articulated. I decided to see what influence other portions of the data might have on the hotspot analysis so I tried iterations with <100%(meaning a hawk was seen on every day that any bird was seen), 0< and <100%, and 0< and <50%. The latter two produce nearly identical hotspot maps so it seemed appropriate to use all data 0< and <100%.

Hotspot analyses from upper right, clockwise: all data, <100%, 0< and <100%, and 0%<.

Next I prepared my environmental variable data. For my regression model, I used 8 variables:

Population – value of containing census tract from US Census data
Average precipitation – value of cell from 2014 PRISM data
Minimum Temperature – ibid
Percent open space with 1km radius – reclassified NLCD data (Herbaceous Upland, Grasslands/Herbaceous, Planted/Cultivated, Pasture/Hay, Row Crops. Small Grains, Fallow, Urban/Recreational Grasses = 1, everything else 0) > Focal statistics mean
Dominant land cover in 500m radius– focal statistics majority on NLCD
Avg percent canopy cover in 500m radius – focal statistics mean
Avg percent impervious surface in 500m radius – ibid
Land cover diversity in 500m radius – focal statistics variety

I then ran the Ordinary Least Squares regression tool. My R-squared value was .214. From the report the tool produced, I concluded that the residuals were not randomly distributed.

To be sure, I also ran the Spatial Autocorrelation tool on the residuals, and there is a 1% chance that the distribution could be random. I also ran a hotspot analysis on the residuals at two scales, 1,000 ft (the minimum distance band that wouldn’t produce an error) and 47,891 ft (the calculated distance band from my previous analyses). While the 1,000 ft distance band did not produce anything interpretable, the 47,891 ft distance band may point to flaws in model design. That is, the distinct locations where the model over-predicted and under-predicted may suggest other environmental variables I should include or ones I should modify/drop from my model. I haven’t figured out what these are yet though.

Hotspot analyses on residuals at 1,000 ft (left) and 47,891 ft (right).

The Koenker (BP) Statistic indicated that my model is heterscedastic (i.e. the model is not evenly fit for high and low dependent variable values). To try to understand why, I re-ran the OLS tool on a subset of my data where the residency ratio < 5% and another subset where residency > %50. Both of these subsets totaled about 3500 records each. The R-square value for > 50% was .29 and .14 for < 5%. The differences in the histograms are also telling.