Tag Archives: rural disparities

Exercise 2: Neighborhood analysis of Texas counties with high colorectal cancer mortality rates

Question being asked

How do rurality indicator variables shift as distance increases away from Texas counties with high colorectal cancer (CRC) mortality?

In exercise 1, I used principal component analysis (PCA) to create a PCA-weighted rural index of the state of Texas using 3 scaled variables: population density, land development percentage, and median income. In this exercise I applied these same variables to determine how they change as distance increases away from the 4 Texas counties with the highest CRC mortality rates. To do this, I created multi-ring buffers around Anderson, Gonzales, Howard, and Newton county and computed averages of each rural indicator variable for each successive buffer “donut.” I hypothesize that as distance increases away from high CRC county centroids, rurality indicator measures will have more “urban” values (i.e. higher population density, higher percent developed, higher median income) and CRC mortality rates will decrease.

Tools and Data Sources Used

For this exercise, I utilized the intersection, feature-to-point, and multi-ring buffer tools in ArcGIS along with the latticeExtra/gridExtra plotting packages in R. The rural indicator data used in this analysis are from the same sources I used in Exercise 1: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). These data were then scaled using the same procedure from Exercise 1, which can be found here.  Aggregated CRC mortality rates for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.

Methods

Attribute Table Wrangling: The Texas county indicator variables were linked to county polygons in my Exercise 1, but cancer mortality data was not. For polygon linkage in this exercise, I imported the mortality data excel sheet into Arc and used the join procedure to insert the data into the existing attribute table (with indicator variables) for county polygons.

Centroid & Multi-ring Buffer Creation: First, I utilized the point-to-feature tool in Arc to create a layer of county centroid points from the county polygon layer. Once the county polygons had been converted to centroids, I identified the 4 Texas counties with the highest CRC mortality rates. Then, using the select features tool and multi-ring buffer procedure, I selected each of the 4 counties separately and created multi-ring buffers at 50, 75, and 100 Km. These distances were chosen based on the size of the selected counties and the size of the full state of Texas.

Intersection & Donut Summary Statistics: Once the multi-ring buffer layers were created, I intersected each of the 4 buffer layers with the original county polygon layer containing all relevant variables. Then, mean via the summary statistics tool were computed in Arc for population density, percent developed land, and median income for each successive donut in the multi-ring buffers. The computed tables of buffer donut means for each variable and county were then exported to Excel files.

R Plotting: The Excel files were then imported into R and line plots of buffer means by distance were created using the xyplot function within the latticeExtra package. Plots were then combined into figures by county using the gridExtra package.

Results

This figure of scaled population density means for county multi-ring buffer donuts indicates varying trends for population density between the 4 high CRC counties. Two counties have increasing population density as distance increases away from centroids, and two have decreasing population density as distance increases away from centroid. Only the buffer map for population density was presented above for post conciseness and space limitations on this blog site. More specific neighborhood relationships between CRC death rates and all indicator variables can be seen in the line plots and explanations below.

Line Plots of County Indicator Variables Over Buffer Distances

The above plots display with more specificity than the buffer map that areas surrounding the 4 counties have differences in indicator variable trends as distance increases away from county centroids. Both Anderson and Newton counties largely follow the trend hypothesized: as distance from the county centroids increases, rural indicator variables have more “urban” values and CRC mortality rate decreases. For Newton county, this trend does not hold for median income, because as CRC mortality decreases away from the county centroid, median income also decreases. For the other other 2 counties, Gonzales and Howard, the hypothesized relationship does not hold, because as distance increases away from county centroids, the rural indicator variables become more “rural” as CRC mortality decreases. This indicates that the associations between CRC mortality and rural indicator variables are complex and that neighborhood analysis does not capture all relationships.

Critique

This sort of neighborhood analysis was effective at determining trends in rural indicator variables in the areas surrounding high CRC mortality counties in Texas. The buffer map produced great broad results, where more generalized trends can be determined. The line plots specifically were highly useful for visualizing more specific changes in indicator variables and CRC mortality rates over distance. These results should be considered in the light of some limitations. First, county level data was used for all variables and the buffer donuts may be too large or too small to capture the true neighborhood relationships in the analysis, as statistical procedures were not utilized to determine the distances. Further, more buffer donuts may have been useful to see more nuanced trends over distance. Secondly, as can be seen in the buffer map, there is a lack of CRC mortality data for many counties in west and southern Texas due to data suppression in order to preserve patient confidentiality. This presents significant bias in result interpretation, especially in Howard county, where many of the counties surrounding it have suppressed CRC mortality data.

My future analysis of this data will likely be a comparative confusion matrix of the PCA-weighted index I created in Exercise 1 and CRC mortality data used in this exercise.

Exercise 1: Principal component analysis for rural index weighting & rural indicator measure contributions for Texas counties

Questions Being Asked:

How much do different indicator measures of rurality in Texas counties each contribute to a basic index of rurality both statistically and spatially? How much does the weighted basic index differ from a simple calculated mean for rurality?

Tools and Approaches Used

The major statistical method I used to answer these questions was principal component analysis (PCA) via the prcomp function in R and subsequent rurality indicator variable weighting for counties in Texas. The raw indicator data used in this analysis are as follows: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database).

Description of Analysis Steps

1) Data Wrangling: Both income and population density data were natively in county-level polygon form, so no conversion was required in ArcGIS. Land use required extensive conversion in order to produce the indicator variable that would both be at the appropriate spatial definition and measure the correct aspect of land use. The raw land use discrete data was first reclassified from many categories of land, such as “deciduous forest,” “barren land,” and “scrub/shrub,” into binary “developed” and “undeveloped” classification groups for the entire state of Texas. Then, the resulting binary raster was converted to unsimplified polygons and the “tabulate intersection” ArcGIS tool was utilized in order to intersect Texas county boundaries with the produced unsimplified polygons. This tool allowed for calculation of percent of area of each Texas county that is considered developed land by the USGS.

2) Data Scaling: After the 3 variables were in analyzable form, each were scaled on a 0 to 100-point scale for visual comparison, PCA analysis, and computation of index weights.

3) Creation of Maps of Scaled Variables for Comparison: ArcGIS was utilized to create maps of each of the scaled variables to visually compare the 3 factors to one another.

4) PCA and Weight Computation: The PCA was completed in R using the prcomp function and variable weights were extracted from the procedure by determining the proportion of variance each variable contributed to. This specific step helps answer how much each indicator variable contributes to rurality in Texas.

5) Simple Means Map, Weighted Index Map, and Statistical Comparison: The scaled variables were used to calculate both an unweighted mean rurality score and weighted mean rurality score for all Texas counties. Maps of each of these measures were produced for visual comparison. A student’s t-test was then completed to compare unweighted and weighted rurality scores for counties.

Results

Maps of Scaled Variables of Texas Counties for Comparison

Rough but similar spatial patterns can be seen in all three maps. Counties with large cities and generally more urbanized counties have higher median income, population density, and percent of land area developed (Green=high, yellow=medium, red=low). There are more extreme values in the population density and percent of land area developed maps than median income.

PCA and Weighting Computation

The biplot of the PCA procedure helps visualize how the samples of the 3 variables relate to one another and how much each variable contributes to the first principal component. Important factors to notice in this plot are the direction of the arrows and clustering of the points. Typically, the more the variable arrow points toward the principal component 1 (PC1) axis, the more it contributes to the proportion of variance. The population density and percent land area developed variables happen to overlap one another in this plot, as they have highly similar contributions to the variance in PC1.

This table shows that population density contributes to 44.4% of the variance in PC1, income contributes to 11.11%, and percent land area developed contributes 44.5%. This PCA procedure indicates that when weighting rurality scores by county in Texas using these 3 variables, percent land area developed should have the highest weight, population density should have the second highest weight, and income should have the lowest. The weighting of variables in the weighted county rurality mean score follows these proportions.

Simple Means Map, Weighted Index Map, and Statistical Comparison

Visually comparing the two maps of rurality scores, there is is some overlap, but quite stark differences. County scores overall have become more rural (lower scores) based on simple visual comparison. Statistical analysis will allow a more nuanced view of the actual differences between the two.

The computed t-test indicates that the weighted index, on average, produces county rurality scores that are significantly more rural (lower scores on the 0-100 scale) than the unweighted index (p<0.001). The overall unweighted rurality score mean for all Texas counties is ~17, while the overall weighted rurality score mean is ~7.5. This indicates that the basic weighted index may better capture the multidimensionality of rurality for Texas counties than unweighted mean scores.

Critique

Based on my understanding, PCA is quite useful for creation of weighted indexes. Unfortunately, the variables I utilized for this analysis have somewhat varying patterns of clustering and possible outliers, which can be seen in the above biplot. These possible outliers could have an effect on the weighting of variables in the index. Due to the small sample size of counties and in order to produce a complete picture of all counties in Texas, I did not want to remove any counties from the analysis. PCA as it relates to index weighting also requires that there is either an a priori understanding or statistical proof of a correlation between variables. Based on the directions of the arrows in the biplot, there is likely correlation between the variables, but more considerations may be needed to confirm this. Also, the scaling of the median income variable is slightly off because the maximum value is over 100. This will need to be corrected in a later analysis.

My future analysis will include more health-related variables in the PCA procedure to determine how health variables contribute to rurality in comparison to the basic rural indicator variables analyzed in this exercise.