Exercise 3: Poisson regression analysis of Texas rural index scores and colorectal cancer mortality counts

Research Question

What is the association between Texas county rurality index score and colorectal cancer (CRC) mortality?

In Exercise 1, I created a rurality index for Texas counties using rural indicator variables, while in Exercise 2, I selected Texas counties with the highest CRC death counts and created multi-ring buffers around them to visualize and measure how rural indicator variables shift as ones moves away from the counties. In this exercise, I am doing a more direct analysis of CRC mortality and my rurality index by utilizing a Poisson regression model. I expect the results to show that as index scores become more “rural,” county CRC death counts will increase.

Tools and Data Sources Used

The analysis, diagnostic, and graphing procedures for this exercise were all completed using the following R functions/packages: glm(), vcd, AER, and car. More specifically on the R tools, a generalized linear model (GLM) of the family Poisson was utilized for modeling, a rootogram from the vcd package was used for Poisson goodness-of-fit, a test for model over-dispersion was utilized from the AER package, and an influence plot was created using the car package. Other goodness-of-fit diagnostics and statistical measures were ascertained via the base glm() procedure.The rural indicators utilized for the index in this analysis are from the same sources I used in Exercise 1: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). Like in my Exercise 2, aggregated CRC mortality counts per 100,000 population for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.

Methods

Attribute Table Conversion to R: First, the attribute table of CRC death counts and rural index scores by county (created as part of Exercise 2) was exported to an Excel file using the Table to Excel tool in Arc. The Excel file was then loaded into R using the “readxl” package.

Univariate Poisson Diagnostics: Before introducing a Poisson regression, I wanted confirm the outcome variable follows a Poisson distribution. To do this, I used the rootogram() function from the vcd package in R to visualize the distribution of the CRC mortality count data. The figure below show that the data seems to roughly follow a Poisson distribution. Based on this, I proceeded to fitting the model.

Model Fitting: I fit a GLM of the Poisson family using the following formula in R: glm(formula = CRCDeathrate ~ PCAWeighted_Index, family = “poisson”). Initially, I attempted to construct a linear OLS regression with this data, but after many attempted transformations, I realized the outcome count data was likely following a Poisson distribution. In subsequent steps, I walk through the post-fitting Poisson diagnostics about the appropriateness of this fit.

Poisson Regression Diagnostics: In Poisson regressions, one of the main concerns is over-dispersion of the data, as Poisson distributions have only one free parameter and do not allow for variance to be adjusted individually from the mean. Therefore, if over-dispersion exists, the data has more variation than the Poisson distribution allows, biasing the variance of the model. To ensure the Poisson regression is not over-dispersed, I utilized the dispersiontest() function from the AER package. Equi-dispersion from this function is displayed when alpha=0. The alpha value of -0.20 and p=0.99 displayed below show that model is not over-dispersed.

Further proof that Poisson is the correct model for the data can be shown through an influence plot from the car package, where only a few data points (101,57) are highly influential (shown below).

I also used a chi-squared test to confirm that a Poisson model is a good fit. The resulting p-value (0.95) indicates that a Poisson model is a very good fit for the data (shown below).

Exporting to Arc for Mapping: After creation the model, I computed the standardized residuals of the model in R for each county using the studres() function. I then exported this data to Arc and added it to the attribute table I mentioned previously in this exercise. I then used symbology to represent the residuals on the Texas map and created a layout

Results

Because Poisson regression of counts utilized a log link, the resulting coefficient(s) need to be exponentiated. The exponentiated coefficient for this regression display that as county index scores increase (become more urban) by 1 unit, the intercept (mean of CRC counts) is multiplied by 0.99. In other words, for every unit increase in rurality, county CRC deaths counts increase by 1% (p<0.001). The narrow confidence intervals for this estimate indicate that the coefficients are accurate.

The map of standardized residuals can be seen above and indicates that there is significant regionality to the association between index scores and CRC death counts, where central Texas is much more green (negative standardized residuals) and counties near the edges are orange and red (positive standardized residuals). Also, as expected, the high CRC death count Texas counties from Exercise 2 (Anderson, Howard, Gonzales, and Newton) have standardized residuals significantly higher than expected in the Poisson model.

Critique

This process of regression construction was very helpful for determining how to best model my data. I initially attempted to use a linear regression and after laboring to find a transformation for my outcome data, I finally decided on Poisson regression (count data duh!). Based on the model diagnostics I performed, it seems that the data works perfectly for Poisson regression. In my opinion, the most difficult part of this exercise was interpreting the coefficients of the regression. Because the coefficients are on the log scale, they have to be exponentiated, which greatly confused my interpretation at first. Because there is an extensive amount of county CRC death data missing from this analysis (confidentiality), these results may be less indicative of the true rurality-CRC mortality associations in Texas. Further, this analysis suffers from the modifiable areal unit problem, where counties may not an appropriate boundary for comparison. In future analyses, I would like to use more complete and spatially defined data from the state cancer registry to shed more light on the relationship.

Print Friendly, PDF & Email