Tag Archives: colorectal cancer

Final Project: Examining the relationship between rurality and colorectal cancer mortality in Texas counties

1.The research question that you asked.

My initial research question was about creation and comparison of rural indices for Texas. I did end up making a rural index for Texas, but instead of creating multiple indices and comparing the importance of different rural indicators, I incorporated an outcome variable, colorectal cancer (CRC) mortality, for Exercise 2 and 3. So my final research question ended up being as follows: how does the spatial pattern of CRC mortality in Texas relate to proxy measures of rurality and a constructed rural index for counties in Texas?

 

2. A description of the dataset you examined, with spatial and temporal resolution and extent.

For my project, I utilized Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). Aggregated CRC mortality rates for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.

 

3. Hypotheses: predictions of patterns and processes you looked for.

Based on my revised research question I remade before exercise 2, I expected 2 major spatial patterns for my analyses: 1) areas immediately surrounding high CRC mortality counties to have higher “rural” values in 3 rural indicator variables: household income, population density, and percent land developed, with more “urban” values as distance increases away from the counties and 2) CRC mortality counts to significantly increase as county rural index values (from the 3 indicator variables) became more rural. I chose rural indicator variables based both on true measures of where population centers in Texas are located (population density and land development) and a well-established marker of rural status (income). For spatial pattern 1 from my neighborhood analysis, I expected there to be uniform shifts in buffer means towards more “urban” values in each of the 3 indicator variables as distance increased away from the high CRC counties. I expected this effect because counties with high CRC mortality are commonly rural, so areas in counties surrounding them with lower CRC mortality may show increasingly urban indicator values. For spatial pattern 2, I expected CRC mortality to increase as county rural index values became more rural in Texas. I expected this pattern because CRC mortality has been linked to rural status and various indicator variables in previous research, though a weighted rural index has never been used for Texas cancer data.

 

4. Approaches: analysis approaches you used.

For my analyses I used the following approaches:

Exercise 1: I utilized principal component analysis (PCA) to construct a rural index for Texas counties using 3 indicator variables.

Exercise 2: I did a neighborhood analysis of high CRC mortality counties by creating multi-ring buffers around centroids of 4 Texas counties, intersected the buffers with county polygons containing rural indicator variable values, and calculated buffer means for indicator variables for each donut around each county.

Exercise 3: I completed a Poisson regression of CRC mortality counts (Y) and my constructed rural index (X) to examine the effect of rural status on CRC mortality for Texas counties.

 

5. Results: what did you produce — maps? statistical relationships? other?

For all 3 exercises, my results included maps (Texas county maps, buffer maps, standardized residual maps), statistical relationships (p-values, coefficients, etc.), and plotting of statistical relationships (biplots, influence plots, line plots). I produced the maps in Arc Pro, while all other visualizations were produced in R.

 

6.What did you learn from your results? How are these results important to science? to resource managers? + software/technical limitations

I believe exercise 1 displays the effectiveness of PCA for construction of rural indices. More deductive methods for rural classification are very much needed in rural health, and I believe this method could improve detection and prevention of rural health disparities.

In exercise 2, the 4 high CRC mortality counties did not all follow the expected rural indicator spatial “gradient” that I expected. Two of the counties exhibited increasing urban scores as distance increased away from the counties, while the other two showed the opposite pattern. I think this result could be due to the arbitrary distances I chose for the buffers around the counties and the modifiable areal unit problem of utilizing county indicator variable data instead of more spatially defined values. Also, significant regionality in Texas could exist for the indicator variables I chose, as the 4 counties were not located in the same regions of Texas. This could have affected the relationships I found in each of the counties.  For example, certain rural regions in Texas may have lower or higher household income than other rural regions of Texas due to factors such as available jobs or regional/local governmental policies. Also, there are likely other variables that are indicators of rurality that I could have included in the analysis that may have more consistent spatial patterns.

In exercise 3, the results from the Poisson regression followed statistical pattern I expected: as county CRC mortality increases, rural index scores become more rural. I believe my results show important introductory associations between CRC mortality and rurality in Texas that indicate further and deeper study into these associations should be considered. To show the exercise 3 results, I utilized a map of standardized residuals in Arc and statistical plots in R.

The technical limitations of my analyses were mainly due to extensive missing cancer data for the state of Texas. This missingness was due to data suppression by the CDC’s National Vital Statistics System in order to maintain confidentiality of cancer cases. I did not have any obvious software issues besides difficulty with the Calculate Field tool in Arc Pro, where the tool would consistently fail when using more complicated formulas for data transformations. I also had some problems when joining Excel table to attribute tables, where Arc would not show symbology for the newly-joined data until exporting to geodatabase and restarting the program.

 

7. Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

I believe I greatly improved my skills in ArcGIS this term, especially in data wrangling and cleaning to convert my data to formats that maximize the visualization. Also, I feel I became much better at using key plotting packages in R, such as ggplot2.

 

8. What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA), and (e) or other techniques?

I believe the major statistical and spatial skills I improved in this course include PCA, neighborhood analysis, Poisson regression, and (especially) regression diagnostics. I had not used PCA in R before this term and feel very comfortable using it going forward in my research on rural health. The many assumptions I had to follow for the Poisson regression in exercise 3 improved my ability to run diagnostics in R and create intuitive assumption plots.

 

Exercise 3: Poisson regression analysis of Texas rural index scores and colorectal cancer mortality counts

Research Question

What is the association between Texas county rurality index score and colorectal cancer (CRC) mortality?

In Exercise 1, I created a rurality index for Texas counties using rural indicator variables, while in Exercise 2, I selected Texas counties with the highest CRC death counts and created multi-ring buffers around them to visualize and measure how rural indicator variables shift as ones moves away from the counties. In this exercise, I am doing a more direct analysis of CRC mortality and my rurality index by utilizing a Poisson regression model. I expect the results to show that as index scores become more “rural,” county CRC death counts will increase.

Tools and Data Sources Used

The analysis, diagnostic, and graphing procedures for this exercise were all completed using the following R functions/packages: glm(), vcd, AER, and car. More specifically on the R tools, a generalized linear model (GLM) of the family Poisson was utilized for modeling, a rootogram from the vcd package was used for Poisson goodness-of-fit, a test for model over-dispersion was utilized from the AER package, and an influence plot was created using the car package. Other goodness-of-fit diagnostics and statistical measures were ascertained via the base glm() procedure.The rural indicators utilized for the index in this analysis are from the same sources I used in Exercise 1: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). Like in my Exercise 2, aggregated CRC mortality counts per 100,000 population for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.

Methods

Attribute Table Conversion to R: First, the attribute table of CRC death counts and rural index scores by county (created as part of Exercise 2) was exported to an Excel file using the Table to Excel tool in Arc. The Excel file was then loaded into R using the “readxl” package.

Univariate Poisson Diagnostics: Before introducing a Poisson regression, I wanted confirm the outcome variable follows a Poisson distribution. To do this, I used the rootogram() function from the vcd package in R to visualize the distribution of the CRC mortality count data. The figure below show that the data seems to roughly follow a Poisson distribution. Based on this, I proceeded to fitting the model.

Model Fitting: I fit a GLM of the Poisson family using the following formula in R: glm(formula = CRCDeathrate ~ PCAWeighted_Index, family = “poisson”). Initially, I attempted to construct a linear OLS regression with this data, but after many attempted transformations, I realized the outcome count data was likely following a Poisson distribution. In subsequent steps, I walk through the post-fitting Poisson diagnostics about the appropriateness of this fit.

Poisson Regression Diagnostics: In Poisson regressions, one of the main concerns is over-dispersion of the data, as Poisson distributions have only one free parameter and do not allow for variance to be adjusted individually from the mean. Therefore, if over-dispersion exists, the data has more variation than the Poisson distribution allows, biasing the variance of the model. To ensure the Poisson regression is not over-dispersed, I utilized the dispersiontest() function from the AER package. Equi-dispersion from this function is displayed when alpha=0. The alpha value of -0.20 and p=0.99 displayed below show that model is not over-dispersed.

Further proof that Poisson is the correct model for the data can be shown through an influence plot from the car package, where only a few data points (101,57) are highly influential (shown below).

I also used a chi-squared test to confirm that a Poisson model is a good fit. The resulting p-value (0.95) indicates that a Poisson model is a very good fit for the data (shown below).

Exporting to Arc for Mapping: After creation the model, I computed the standardized residuals of the model in R for each county using the studres() function. I then exported this data to Arc and added it to the attribute table I mentioned previously in this exercise. I then used symbology to represent the residuals on the Texas map and created a layout

Results

Because Poisson regression of counts utilized a log link, the resulting coefficient(s) need to be exponentiated. The exponentiated coefficient for this regression display that as county index scores increase (become more urban) by 1 unit, the intercept (mean of CRC counts) is multiplied by 0.99. In other words, for every unit increase in rurality, county CRC deaths counts increase by 1% (p<0.001). The narrow confidence intervals for this estimate indicate that the coefficients are accurate.

The map of standardized residuals can be seen above and indicates that there is significant regionality to the association between index scores and CRC death counts, where central Texas is much more green (negative standardized residuals) and counties near the edges are orange and red (positive standardized residuals). Also, as expected, the high CRC death count Texas counties from Exercise 2 (Anderson, Howard, Gonzales, and Newton) have standardized residuals significantly higher than expected in the Poisson model.

Critique

This process of regression construction was very helpful for determining how to best model my data. I initially attempted to use a linear regression and after laboring to find a transformation for my outcome data, I finally decided on Poisson regression (count data duh!). Based on the model diagnostics I performed, it seems that the data works perfectly for Poisson regression. In my opinion, the most difficult part of this exercise was interpreting the coefficients of the regression. Because the coefficients are on the log scale, they have to be exponentiated, which greatly confused my interpretation at first. Because there is an extensive amount of county CRC death data missing from this analysis (confidentiality), these results may be less indicative of the true rurality-CRC mortality associations in Texas. Further, this analysis suffers from the modifiable areal unit problem, where counties may not an appropriate boundary for comparison. In future analyses, I would like to use more complete and spatially defined data from the state cancer registry to shed more light on the relationship.

Exercise 2: Neighborhood analysis of Texas counties with high colorectal cancer mortality rates

Question being asked

How do rurality indicator variables shift as distance increases away from Texas counties with high colorectal cancer (CRC) mortality?

In exercise 1, I used principal component analysis (PCA) to create a PCA-weighted rural index of the state of Texas using 3 scaled variables: population density, land development percentage, and median income. In this exercise I applied these same variables to determine how they change as distance increases away from the 4 Texas counties with the highest CRC mortality rates. To do this, I created multi-ring buffers around Anderson, Gonzales, Howard, and Newton county and computed averages of each rural indicator variable for each successive buffer “donut.” I hypothesize that as distance increases away from high CRC county centroids, rurality indicator measures will have more “urban” values (i.e. higher population density, higher percent developed, higher median income) and CRC mortality rates will decrease.

Tools and Data Sources Used

For this exercise, I utilized the intersection, feature-to-point, and multi-ring buffer tools in ArcGIS along with the latticeExtra/gridExtra plotting packages in R. The rural indicator data used in this analysis are from the same sources I used in Exercise 1: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). These data were then scaled using the same procedure from Exercise 1, which can be found here.  Aggregated CRC mortality rates for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.

Methods

Attribute Table Wrangling: The Texas county indicator variables were linked to county polygons in my Exercise 1, but cancer mortality data was not. For polygon linkage in this exercise, I imported the mortality data excel sheet into Arc and used the join procedure to insert the data into the existing attribute table (with indicator variables) for county polygons.

Centroid & Multi-ring Buffer Creation: First, I utilized the point-to-feature tool in Arc to create a layer of county centroid points from the county polygon layer. Once the county polygons had been converted to centroids, I identified the 4 Texas counties with the highest CRC mortality rates. Then, using the select features tool and multi-ring buffer procedure, I selected each of the 4 counties separately and created multi-ring buffers at 50, 75, and 100 Km. These distances were chosen based on the size of the selected counties and the size of the full state of Texas.

Intersection & Donut Summary Statistics: Once the multi-ring buffer layers were created, I intersected each of the 4 buffer layers with the original county polygon layer containing all relevant variables. Then, mean via the summary statistics tool were computed in Arc for population density, percent developed land, and median income for each successive donut in the multi-ring buffers. The computed tables of buffer donut means for each variable and county were then exported to Excel files.

R Plotting: The Excel files were then imported into R and line plots of buffer means by distance were created using the xyplot function within the latticeExtra package. Plots were then combined into figures by county using the gridExtra package.

Results

This figure of scaled population density means for county multi-ring buffer donuts indicates varying trends for population density between the 4 high CRC counties. Two counties have increasing population density as distance increases away from centroids, and two have decreasing population density as distance increases away from centroid. Only the buffer map for population density was presented above for post conciseness and space limitations on this blog site. More specific neighborhood relationships between CRC death rates and all indicator variables can be seen in the line plots and explanations below.

Line Plots of County Indicator Variables Over Buffer Distances

The above plots display with more specificity than the buffer map that areas surrounding the 4 counties have differences in indicator variable trends as distance increases away from county centroids. Both Anderson and Newton counties largely follow the trend hypothesized: as distance from the county centroids increases, rural indicator variables have more “urban” values and CRC mortality rate decreases. For Newton county, this trend does not hold for median income, because as CRC mortality decreases away from the county centroid, median income also decreases. For the other other 2 counties, Gonzales and Howard, the hypothesized relationship does not hold, because as distance increases away from county centroids, the rural indicator variables become more “rural” as CRC mortality decreases. This indicates that the associations between CRC mortality and rural indicator variables are complex and that neighborhood analysis does not capture all relationships.

Critique

This sort of neighborhood analysis was effective at determining trends in rural indicator variables in the areas surrounding high CRC mortality counties in Texas. The buffer map produced great broad results, where more generalized trends can be determined. The line plots specifically were highly useful for visualizing more specific changes in indicator variables and CRC mortality rates over distance. These results should be considered in the light of some limitations. First, county level data was used for all variables and the buffer donuts may be too large or too small to capture the true neighborhood relationships in the analysis, as statistical procedures were not utilized to determine the distances. Further, more buffer donuts may have been useful to see more nuanced trends over distance. Secondly, as can be seen in the buffer map, there is a lack of CRC mortality data for many counties in west and southern Texas due to data suppression in order to preserve patient confidentiality. This presents significant bias in result interpretation, especially in Howard county, where many of the counties surrounding it have suppressed CRC mortality data.

My future analysis of this data will likely be a comparative confusion matrix of the PCA-weighted index I created in Exercise 1 and CRC mortality data used in this exercise.