Tag Archives: Oregon

Spatial and Temporal Patterns of Oregon Salmonella Cases

  1. The research question that you asked.

My original question focused solely on the spatial patterns of Salmonella cases in Oregon. However, after realizing my dataset contained time series information I decided to pivot my analysis somewhat and my research question became the following: How are the spatial and temporal patterns of reported Salmonella infections associated with sociodemographic factors in Oregon counties between 2008-2017? While the eventual discovery of causal drivers of disease will be important for disease prevention efforts, the focus here is what other factors are associated with disease at the county level. Through this analysis I hoped to identify sub-populations who are at a higher risk of becoming infected with Salmonella in Oregon.

  1. A description of the dataset you examined, with spatial and temporal resolution and extent.

The data used in my analysis came from a variety of sources. Data regarding the number of Salmonella cases per county came from the Oregon Public Health Epidemiology User System (ORPHEUS). These data contained individual level information, but it was de-identified due to privacy concerns. Information concerning age, sex, disease onset, and county of residence was available with these data. Population estimates came from Portland State University’s Population Research Center which collected yearly county level population estimates. Other sociodemographic information came from the American Community Survey, a yearly survey which assesses various county-level characteristics like poverty, high school graduation rates, percentage of foreign-born residents, etc. All data used in this analysis was from years 2008-2017.

  1. Hypotheses: predictions of patterns and processes you looked for.

Hypothesis 1: I expect counties with higher proportions of females to have higher levels of Salmonella infections due to findings in prior literature concluding that females have a higher incidence rate of Salmonella infections compared to males. The underlying causal mechanism for this pattern is unknown, but given the results of other studies of foodborne illness I expect Oregon’s population to be similar. As a result I would expect the percentage of a county that is female to be significantly associated with Salmonella incidence.

Hypothesis 2: I expect counties with higher proportions of infants and newborns to also have higher levels of Salmonella infections compared to counties with a higher proportion of older age groups. Other findings from the Oregon Health Authority indicate that young children are at high risk for developing foodborne illnesses. The reasoning here is that the immune systems of young children are not fully developed and less likely to effectively fight off infection resulting in a disproportionately high disease incidence in this group. I expect to see that the percentage of a county that belongs to the age group 0-4 years old will be significantly associated with Salmonella incidence.

Hypothesis 3: I expect to find a significant time trend of Salmonella incidence over time in Oregon counties. Disease incidence varies from year to year and can sometimes be volatile in scarcely populated counties or counties experiencing a major outbreak. Because of this natural variance over time I expect to see a significant time trend of reported cases in Oregon. Some counties will show a positive trend over time, others will show decreasing disease rates, and others will be relatively level. However, I expect time will be a significant factor associated with disease incidence.

  1. Approaches: analysis approaches you used.

I used auto- and cross-correlation, longitudinal trend analysis, hotspot analysis, geographically weighted regression, and Principal Component Analysis. All of these analytical approaches were performed in the R statistical software with the aid of various packages which allowed me to perform all of my spatial analysis.

  1. Results: what did you produce — maps? statistical relationships? other?

From these analyses I was able to create maps, plots of auto- and cross-correlation over time, and a bivariate plot of principle components which were associated with different regions of Oregon. Also, there was considerable numerical output for my time trend regression and GWR which provided evidence of statistically significant associations between the variables in my model and the outcome. These outputs can be seen in my second and third blog posts.

Hypothesis 1: Findings from my GWR support my first hypothesis that the proportion of a county which is female is significantly associated with Salmonella incidence. While coefficient estimates vary by county, most often there is a positive association between the proportion of a county that is female and disease incidence. Cross-correlation analysis found there were large areas of Oregon, particularly in the Western part of the state, where there was a positive association between the two variables was positive and some areas clustered in the Eastern part of the state where this pattern was reversed.

Hypothesis 2: GWR findings and cross correlation analyses did not support my second hypothesis. The proportion of a county which is aged 0-4 is not significantly associated with county Salmonella incidence. However, it was found that county percentage of child poverty was significantly associated with disease incidence. Perhaps the reason for this finding is that the proportion of children alone may not be significantly associated with disease, but poverty is. Thus children in poverty partly explain the association seen for children as a whole.

Hypothesis 3: Longitudinal analysis and regression supported my third hypothesis that a significant time trend existed for Oregon’s Salmonella incidence. As expected, there was some variance in county disease rates over time due to a large host of factors. Larger populations are likely to have more cases of disease reported due simply to the fact that there are more people to possibly infect. Overall, there is an increasing trend in Oregon’s Salmonella incidence. There may be a higher amount of disease occurring in Oregon over time. Another possible explanation for some of the increase could be due to improvements in Oregon’s disease monitoring abilities and infrastructure over time.

  1. What did you learn from your results? How are these results important to science? to resource managers?

This analysis found statistically significant associations between some population characteristics (like median age, childhood poverty levels, and time) and the reported rate of Salmonella. These results are important to researchers because the focus to find causal exposures for Salmonella can be narrowed to groups or areas more associated with the disease. These results can inform future disease prevention research. Resource managers would also be interested in this analysis as the counties identified here as being more associated with disease than others can be targeted for disease surveillance.

  1. Your learning: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other?

My analysis was completed solely in R statistical software. I learned a lot over this course about different spatial analysis packages in order to make my R more robust. Some spatial packages have a steep learning curve with a fair degree of technical knowledge to appropriately implement in your analysis. I would say I learned a lot about trouble shooting from discussion threads and GitHub posts.

  1. What did you learn about statistics, including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA),  and (e) or other techniques?

I learned how hotspot analysis is performed and what a Getis-Ord Gi* score is and how they are compared to yield hot and cold spots. As for autocorrelation I learned how to perform cross-correlation analysis through both space and time, as well as how to cluster different values into a map which is easier to interpret. During my geographically weighted regression analysis I learned how to transform my data and make it compatible with the regression technique as well as figure out which variables to put in my GWR based on an ordinary least squares regression. Prior to this project I had never heard of PCA, so here I learned introductory skills about how to apply this analytical technique to my dataset.

Manipulating salinity to create a better fit habitat suitability model for Olympia oysters

Follow-up from Exercise 2
In Exercise 2, I compared Olympia oyster location data to the model of predicted suitable habitat that I developed in Exercise 1. Results from that analysis showed that 13 of the 18 observations within or near areas of high-quality habitat (type 4) indicated the presence of Olympia oysters (72%) versus 5 locations where oysters were not found (28%). No field survey locations fell within areas of majority lowest quality habitat (type 1). Seven observations were found within the second lowest quality habitat type (2), with 2 of those observations indicating presence (29%) and 5 indicating absence (71%).

Habitat suitability
4 3 2 1
Presence 13 [0.72] 4 [0.4] 2 [0.29] 0 [0]
Absence 5 [0.28] 6 [0.6] 5 [0.71] 0 [0]
Total (n = 35*) 18 [1.0] 10 [1.0] 7 [1.0] 0 [0]

*3 data points removed from analysis due to inconclusive search results.

To expand on this analysis, I used a confusion matrix to further examine the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. For ease of interpretation, I removed habitat suitability type 1 since there were not observations in this category, and type 3 since it fell in between high and low-quality habitat.

Habitat suitability
4 (high) 2 (low)
Presence 0.37 0.06
Absence 0.14 0.14

Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The habitat suitability model predicted that oysters would be present within the highest quality habitat type and absent in low-quality habitat. The confusion matrix shows that the model was successful in predicting that 37% of the total observations where oysters were present were found within habitat type 4 (high), and 14% of the observations where oysters were absent were found in habitat type 2 (low).

In the type 4 habitat, 14% of the total observations found that oysters were absent, which goes against the model prediction. I suspect this is partly due to the patchy nature of substrate availability in Yaquina Bay and the low-resolution quality of the substrate raster layer used for analysis. For the 6% of observations that show oyster presence within habitat type 2, it’s possible that these points were juvenile oysters that were able to settle in year-1, but are less likely to survive into adulthood. Both of these errors could also indicate issues with the weights assigned in the model back in Exercise 1.

Question asked
For exercise 3, I wanted to expand on the habitat suitability analysis to see if I could more accurately predict oyster locations and account for the errors found in exercise 2. Here I asked:

Can the spatial pattern of Olympia oyster location data be more accurately described by manipulating the spatial pattern of one of the parameters of suitable habitat (salinity)?

I decided to modify the rank values of one of the model parameters: salinity. Based on my experience collecting oyster location data in the field, it seemed that salinity was the biggest influence in where oysters would be found. It was also the easiest parameter to change since it had the fewest rank categories. The excerpt below comes from the ranking value table I established for the habitat parameters in Exercise 1. Changes to rank value for salinity are indicated in the right-most column.

Habitat parameter Subcategories Subcategory variable range Olympia oyster tolerance Rank value applied
Mean wet-season salinity (psu) Upper estuary < 16 psu somewhat, but not long-term 1 –> 2
Upper mid estuary 16.1 – 23 psu yes 4 –> 3
Lower mid estuary 23.1 – 27 psu yes 3 –> 4
Lower estuary > 27 psu somewhat 2 –> 1

Name of tool or approach
I combined my approach from exercise 1 and exercise 2 to create a different model output based on the new rank values applied to the salinity parameter. The analysis was completed in ArcGIS Pro and the table of values generated was reviewed in Excel.

Brief description of steps to complete the analysis

  1. After assigning new rank values to the salinity parameter, I applied a new ‘weighted overlay’ to the salinity raster layer in ArcGIS. As I did in exercise 1, I used the ‘weighted overlay’ tool again to combine the weighted substrate and bathymetry layers with the updated salinity layer. A new map of suitable habitat was created based on these new ranking values.
  2. Then, I added the field observation data of oyster presence/absence to the map and created a new map of all the data points overlaid on habitat suitability.
  3. I then created buffers around each of the points using the ‘buffer’ tool. In the last analysis, I used the ‘multiple ring buffer’, but was only able to analyze the largest buffer (300m). This time, I created only the one buffer around each point.
  4. Using the ‘Zonal Statistics’ tool, I overlaid the newly created buffers on the updated raster of habitat suitability and viewed the results. I again chose ‘majority’ as my visual represented statistic, which categories the buffer based on the habitat suitability type occupying the largest area.
  5. I also created a results table using the ‘Zonal Statistics as Table’ tool, then copied it over to Excel for additional analysis.

Results
An updated table based on manipulated salinity rank values was generated to compare to the table created from exercise 2 and displayed at the top of this blog post. Results from this analysis showed that only 2 of the 35 total observations fell within or near areas of high-quality habitat (type 4), one indicated presence and the other absence. The adjustments to the salinity rank value allowed the habitat type 3 to dominate the map, with 31 of the total 35 observations falling in this category. Of the 31 points, 18 showed presence data (58%) and 13 were absence data (42%). Again, no field survey locations fell within areas of majority lowest quality habitat (type 1). Two observations were found within the second lowest quality habitat type (2), both indicating absence (100%).

Habitat suitability
4 3 2 1
Presence 1 [0.5] 18 [0.58] 0 [0] 0 [0]
Absence 1 [0.5] 13 [0.42] 2 [1.0] 0 [0]
Total (n = 35) 2 [1.0] 31 [1.0] 2 [1.0] 0 [0]

Again, I used a confusion matrix to further examine the ‘errors’ in the data, or the observations that did not correlate with my model of predicted suitable habitat. I removed habitat suitability type 1 since there were not observations in this category.

Habitat suitability
4 (high) 3 2 (low)
Presence 0.03 0.51 0
Absence 0.03 0.37 0.06

 

Decimals reported indicate the proportion of total observations (n = 35) that fell within this category. The confusion matrix shows that the model fit nearly all observations (31) into the type 3 habitat category, with a near even split between presence (18) and absence (13). In reference to the confusion matrix from exercise 2 at the top of this blog, it is difficult to make a direct comparison of the errors since most of the observations fell into type 3.

Critique of the method
I was surprised to see how drastically the map of suitable habitat changed by manipulating only one of the habitat parameters. The adjustment of the rank values for salinity resulted in a vast reduction in area attributed to the highest quality habitat (type 4). The results indicate that choosing the salinity parameter to manipulate did not result in a better fit model and that changes to salinity rank values were too drastic. Since the salinity parameter contains only 4 subcategories, or 4 different weighted salinity values, the impacts to the habitat suitability map were greater than if the parameter had had more nuance. For example, the bathymetry parameter has 10 subcategories and a reworking of the ranking values within could have made more subtle changes to the habitat suitability map.

The next steps would be to examine another parameter, either substrate or bathymetry, to see if adjustments to ranking values result in a more appropriate illustration of suitable habitat. Additionally, the collection of more oyster location data points will help in creating a better fit model and understanding the nuances of suitable habitat in Yaquina Bay.

 

Spatial and Temporal Patterns of Reported Salmonella Rates in Oregon

Question: How are values of reported yearly Salmonella rates related to predictors found in previous OLS and GWR analysis at different temporal and spatial lags? Also, to what extent are there regional groupings in Oregon found through Principal Components Analysis (PCA)?

The data used here is the same used in all prior blog posts.

Names of Analytical Tools/Approaches Used

Both temporal and spatial cross-correlation tools were used in my analysis to visualize how values of the predictors I identified in previous analyses varied at higher/lower rates of reported Salmonella. The time series analysis was limited to 2008-2017 which was the extent of my data. Temporal cross-correlation allowed me to visualize how the values of predictors varied at different time lags of Salmonella rates and the spatial cross-correlation allowed me to visualize the variation in the predictors at different spatial lags of Salmonella rates. Results for the temporal and spatial cross-correlation were visualized with ACF plots and cluster plots respectively. Finally, PCA was used to identify noticeable regional groupings as a function of the different variables in my dataset. These results were visualized on a biplot.

Description of the Analytical Process

All cross-correlation analysis was limited to the significant predictors identified in previous regression analysis. Specifically, county Salmonella rate as a function of: county % female, county % child poverty, and county median age.

  1. Temporal Autocorrelation: Data were summarized by year, creating a data frame of median values for Salmonella rates and the predictors for each of the 10 years in my dataset. ACF plots were created for each of the predictors as they varied over different time lags of Salmonella
  2. Spatial Autocorrelation: Data were summarized by county, creating a data frame of median values for Salmonella rates and the predictors for all 36 counties in Oregon. Spatial cross-correlation was carried out using a local spatial weights matrix to create clusters of local indicators of spatial association. The clusters for each of the predictors were visualized on maps of Oregon.
  3. PCA: Oregon was divided into three broad regions based on population distribution and geographical barriers: 1) the Portland metro area consisting of Multnomah, Clackamas, Washington, and Columbia counties, 2) West Oregon consisting of all other counties west of the Cascade Mountain range, and 3) East Oregon consisting of all counties east of the mountain range. Data was summarized by county and every county was associated with a specific region within Oregon. PCA was carried out on the first two principle components because approximately 70% of the variation in the data was explained by these two components. The results of the PCA were visualized in a biplot.

Brief Description of Results

Temporal Cross Correlation:  results show that county percent female was positively associated with Salmonella rates at time lags 1-4 and slightly negatively associated with disease rates at all other lags. Child poverty was also positively associated with Salmonella rates at lags 1-4 and were otherwise negatively associated with disease rates except at distantly negative time lags. Temporal cross correlation analysis of median age and Salmonella incidence rates yielded a similar pattern. It appears the temporal cross-correlation patterns across all three of these variables follow a somewhat sinusoidal curve.

Female:

Child Poverty:

Median Age:

Spatial Cross Correlation: results from a cross correlation analysis of county percent female and Salmonella rates show a large cluster of high county percent female and high rates of Salmonella clustered in the Northwest region of the state. Other clusters are fairly well dispersed around the state. From an analysis of child poverty and Salmonella rates there is a large cluster of high Salmonella rates and high child poverty in the Northeastern area of the state and a large cluster of high Salmonella rates but low child poverty in the Northwest. A similar pattern can be seen in the median age cross-correlation plot.

Female:

Child Poverty:

Median Age:

PCA: Principal Component Analysis showed that the counties comprising the Portland metro area were all characterized by relatively higher income compared to the rest of the state. The Eastern portion of the state can somewhat be characterized by higher median ages and higher proportions of elderly residents. The rest of the western portion of Oregon is not characterized by particularly high values of any of the other variables of interest.

Critique of Methods Used

These analyses support the findings of Exercise 2 where there was evidence to support the existence of a time trend in the rates of reported Salmonella in Oregon counties. Also, the results of the cross-correlation and principal component analysis support the findings in the GWR analysis where different predictors were positively/negatively associated with Salmonella rates depending on the county in which the data was measured. One main critique of the spatial cross-correlation analysis was that through the use of a local spatial weight matrix, only local indicators of spatial indicators of association were determined. This analysis did not include a global spatial weight matrix which could change the spatial associations seen in my results. Also, while PCA was useful in showing that different regions in Oregon were more strongly associated with certain predictors than others, there is considerable overlap between the regions. Thus, it is unknown if these results are significant.

Spatial and Temporal Patterns of Reported Salmonella Rates in Oregon

  1. Question Asked

Here I asked if there was evidence supporting temporal autocorrelation of age-adjusted Salmonella county rates within Oregon from 2008-2017 and if so what type of correlation structure is most appropriate. I also investigated spatial patterns of reported Salmonella rates as they related to various demographic variables like: Percent of county which is aged 0-4, percent of county which is aged 80+, percent of county which is female, median age, percentage of county residents who have completed high school, median county income, percent of county who is born outside the US, percent of county who speaks a language other than English at home, percentage of county estimated to be in poverty, and percent of children in a county estimated to be in poverty.

To answer these questions, I used the same data outlined in the first exercise blog post with newer demographic variables being taken from the American Community Survey and published on AmericanFactFinder which provides yearly estimates of demographic information at the county level. Unfortunately, yearly data prior to the year 2009 is unavailable which shortened the window of analysis by a year.

  1. Names of analytical tools/approaches used

The methods used to answer these questions was first to create an exhaustive general linear model where county Salmonella rates were a function of the above listed demographic variables. A Ljung-Box Test was used to assess if there was evidence of non-zero autocorrelation of residuals of the model at various time lags. Following this, an ideal linear model was selected using a stepwise AIC selection process and then different variance structures were compared by AIC, BIC, and log Likelihood metrics as well as ANOVA testing. Following the selection of an appropriate base model and variance structure, I allowed for interaction between all variables and time and performed more ANOVA testing to select the best model which allowed for variable time interaction. A version of this model would later be used in geographically weighted regression analysis. I performed geographically weighted regression allowing the coefficients to vary across space in Oregon.

  1. Description of the analytical process

I created a lattice plot of reported age-adjusted Salmonella rates over the 10-year period in every county to visually assess whether Salmonella rates were changing over time. After seeing that rates of reported Salmonella were changing over time in many Oregon counties I created a “full” linear model in which the rate of reported Salmonella cases in a county was a function of the demographic variables described above. Because this is longitudinal data measured over time I wanted to see if rates of Salmonella were correlated over time, meaning that a county’s rate one year could be predicted from the rate found in another year in the same county. I first performed a Ljung-Box test to assess the need to evaluate for temporal autocorrelation as well as tested normality assumptions, and log-transformed my outcome (Salmonella rates) based on those normality assumption tests. A simple backward step-wise AIC approach was used on my full model to identify and remove extraneous variables. This worked by removing variables in the full model in a stepwise fashion, comparing the AIC values between the two models, and continuing this process until the AIC values between the two models being compared are not significantly different. I then used this model to select an ideal variance structure to compare Salmonella rates autocorrelated at different time lags. The types of variance compared were: Independent variance, compound symmetric, AR1, MA1, ARMA (1,1), unstructured, and allowing for different variance structures across time. After AIC, BIC, log Likelihood, and ANOVA testing an ideal variance structure was determined and the model using this variance structure was evaluated for basic interaction with time. All variables present in the model were allowed to have time interaction including time itself (i.e. allowing for a quadratic time trend). Once again AIC, BIC, log Likelihood, and ANOVA testing were used to select the most ideal model.

Following this I moved on to GWR, where I was able to use the model identified above to create a new data frame containing Beta coefficient estimates of the significant variables in my final model for every county in Oregon. This data frame of coefficients was merged with a spatial data frame containing county level information for all of Oregon. Plots of the coefficient values for every different county were created.

  1. Brief Description of Results

Every panel represents one of Oregon’s 36 counties and panels containing an error did not have any cases of reported Salmonella. Some counties are seen decreasing with time, others show slightly increasing trends, and others show a fairly level rate over time. Clearly there is evidence of some time trend for some counties.

Results of normality testing found that age- adjusted rates of reported Salmonella in Oregon were not normally distributed, for the ease of visualization and as an attempt to address the failure to meet the assumption of normality in linear modeling I log-transformed the rates of reported Salmonella. Results of the Ljung-Box test with my full model provided evidence of non-zero time autocorrelation in the data and a visual inspection of the lattice plot supports this with most counties showing a change in rates over time.

The stepwise AIC model selection approach yielded the following model:

logsal ~ %female + %childpov + medianage +year

Covariance structure comparison:

Covariance Model logLikelihood AIC BIC
Independent -431 874 896
Compound Symmetry -423 860 887
AR(1) -427 869 895
MA(1) -427 869 895
ARMA(1,1) -423 862 892
Unstructured -387 859 1017
Compound Symmetry Different Variance Across Time -412 854 910

 

Mostly AIC, BIC, and Log Likelihood values were clustered together for the different models. I decided to base my choice primarily on the lowest AIC because that’s how I did variable selection to this point. This resulted in me choosing a compound symmetric model which allowed for different variances across time.

Next, I built models which allowed for simple interaction with time meaning that any three-way interaction with time was not evaluated for. Subsequent ANOVA testing comparing the different interaction models to each other, to a model where no interaction was present, and a model where time was absent were used in my selection of a final model.

Final Model: 

logsal ~ %female + %childpov + medianage + year +(medianage*year)

This model follows a 5×5 compound symmetric correlation variance model which allows for variance to change over time.

Code: interact_m <- gls(logsal ~ female + childpov + medianage + year +(medianage*year), na.action=na.omit, correlation= corCompSymm(form= ~1|county),weights=varIdent(form=~1|year),data=alldata)
Within County Standard Error  (95% CI): 0.92
Estimate Name Estimate (log-scale) Std. Error p-value
Intercept -759.42 237.79 0.002
% Female 18.06 4.46 <0.001
% Child Poverty 0.03 3.27 0.001
Median Age 17.16 3.11 0.002
Year 0.38 3.18 0.002
Median Age*Year -0.01 -3.12 0.002

 

Estimates are on the log scale making them difficult to interpret without exponentiation, however it can be seen that a percent change in the number of females in a county or a year change in median age are associated with much larger changes in rates of reported Salmonella incidence compared to changes in percent of child poverty and the year. Overall, incidence rates of reported Salmonella were shown to increase with time, county percentage females, county percentage of child poverty, and county median age with a significant protective interaction effect between time and median age.

For my GWR analysis I used a function derived from the rspatial.org website and looks like:

regfun <- function(x) {
dat <- alldata[alldata$county == x, ]
m <- glm(logsal ~ female + childpov + medianage + year + (medianage*year), data=dat)
coefficients(m)
}

As can be seen, I retained the same significant variables found in the OLS regression for my time series analysis. GWR in this case allows for the coefficient estimates to vary by county.

This allowed me to create a data frame of all coefficient estimates for every county in Oregon. Subsequent dot charts showed the direction and magnitude of all covariates varied across the counties. Plots and dot charts of Oregon for the different coefficient estimates were made.

For % Female

For % Child Poverty

For Median Age

For Year

For Median Age * Year

For the most part, county % female, median age, year, and the interaction term clustered close to 0 for most counties. Some counties were showed highly positive/negative coefficient estimates though no consistently high/low counties could be identified. The maps for the coefficients of median age and year are very similar though I do not have a clear idea why this is the case. The map of the coefficients of child poverty showed the most varied distribution across space. Autocorrelation analysis using Moran’s I of the residuals from this GWR model did not find any evidence of significant autocorrelation. I could not find evidence of a significant non-random spatial pattern for the residuals of my model.

  1. Critique of Methods Used

While the temporal autocorrelation analysis was useful in that it provided evidence of temporal autocorrelation present in the data and prior univariate spatial autocorrelation provided limited evidence of variables being spatially autocorrelated at different spatial lags, I was unable to perform cross correlation analysis between my outcome and predictor variables. One important note: I do plan on performing this analysis I just need to figure out how the “ncf” package works in R. This is one of the more glaring shortcomings of my analysis so far is that I do not have evidence that my outcome is correlated with my predictor variables at various distances. Another critique is that the choice of an ideal temporal correlation structure was fairly subjective with my choice of model selection criteria being AIC. Basing my decision on other criteria would likely change my covariance structure. A similar argument could be said for my choice of variable selection being based on a backwards stepwise AIC approach where other selection criteria/methods would likely have different variables in the model.

Finally, the results of my GWR analysis do not show actual drivers of reported Salmonella rates. Rather it shows the demographic characteristics associated with higher reported rates. While this information is useful it does not identify any direct drivers of disease incidence. Further analysis will be required to see if these populations have different exposures or severity of risky exposures.

Determining suitable habitat for Olympia oysters in Yaquina Bay, OR

Exercise #1

Question that you asked:
My goal for my thesis work is to evaluate the distribution of native Olympia oysters in Yaquina Bay, Oregon by assessing habitat suitability through spatial analysis of three habitat parameters: salinity, substrate availability, and elevation. A map of predicted suitable habitat as a result of the spatial analysis will be compared with field observations of oyster locations within Yaquina Bay. The main research question I am examining in this project is:

How is the spatial pattern of three habitat parameters (salinity, substrate, elevation) [A] related to the spatial pattern of Olympia oysters in the Yaquina estuary [B] over time [C]?

For this blog post, I will be evaluating the [A] portion of this question and the three habitat parameters simultaneously to identify where habitat is least suitable to most suitable. To better understand the spatial pattern of the habitat parameters, I am evaluating a raster layer for each parameter, then combining them to determine where overlap between the layers shows the best environmental conditions for oysters to survive.

Name of the tool or approach that you used:
For this portion of my research analysis, I wanted to be able to make an educated guess about where the best and worst habitat for Olympia oysters would be located within Yaquina Bay by ranking different subcategories within each of the salinity, substrate, and elevation datasets.

To do this, I started by looking through the available literature on the subject and consulting with shellfish biologists to get an idea of what conditions oysters prefer in order to apply a ranking value. The following table is a compilation of that information:

Habitat parameter Subcategories Subcategory variable range Olympia oyster tolerance Rank value applied
Mean wet-season salinity (psu) Upper estuary < 16 psu somewhat, but not long-term 1
Upper mid estuary 16.1 – 23 psu X 4
Lower mid estuary 23.1 – 27 psu X 3
Lower estuary > 27 psu somewhat 2
 
Substrate availability 1.2 Unconsolidated mineral substrate possible 3
1.2.1.3.3 Gravelly mud unlikely 2
1.2.2.4 Sandy mud no 1
2 Biogenic substrate yes 4
3 Anthropogenic substrate yes 4
3.1 Anthropogenic rock yes 4
3.1.2 Anthropogenic rock rubble unlikely 2
3.1.3 Anthropogenic rock hash no 1
9.9.9.9.9 Unclassified uncertain
 
Bathymetric depth (compared to MLLW) 1.5 – 2.5m supratidal no 1
1 – 1.5m supratidal no 1
0.5 – 1m intertidal maybe 2
0 – 0.5m intertidal yes 3
-2 – 0m intertidal yes 4
-3 – -2m subtidal yes 4
-4 – -3m subtidal yes 4
-6 – -4m subtidal yes 4
-8 – -6m subtidal yes 3
-12.5 – -8m subtidal yes 3
 

Once I established my own ranking values, I decided to use the ‘weighted overlay’ function, found within the Spatial Analyst toolbox in ArcGIS Pro. Weighted overlay applies a numeric rank to values within the raster inputs on a scale that the ArcGIS user is able to set. For example, on a scale from 1-9 ranking 1 as areas of least fit and 9 as areas of best fit. This can be used to determine the most appropriate site or location for a desired product or phenomenon. I used the ranking value scale 1-4 where 1 indicates the lowest suitability of subcategories for that parameter and 4 indicates the highest suitability.

Brief description of steps you followed to complete the analysis:

To apply the weighted overlay function:

  1. Open the appropriate raster layers for analysis in ArcGIS Pro. Weighted overlay will only work with a raster input, specifically integer raster data. Here, I pulled all three of my habitat parameter raster layers from my geodatabase into the Contents pane and made each one visible in turn as I applied the weighted overlay function.
  2. In the Geoprocessing pane, type ‘weighted overlay’ into the search box. Weighted overlay can also be found in the Spatial Analyst toolbox.
  3. Once in the weighted overlay window within the Geoprocessing pane, determine the appropriate scale or ranking values for the analysis. I used a scale from 1-4, where 1 was low suitability and 4 was high suitability.
  4. Add raster layers for analysis by selecting them from your geodatabase and adding them into the window at the top left. To add more than one raster, click ‘Add raster’ at the bottom of the window.
  5. Select one of the raster inputs and see the subcategories for that raster appear on the upper right. Here, ranking values within the predetermined can be individually applied to the subcategories by clicking from a drop-down list. Do this for each subcategory within each raster input. I ranked each subcategory within each of my habitat rasters according to the ranks listed on the table above.
  6. Determine the weights of each raster input. The weights must add up to 100, but can be manipulated according to the needs of the analysis. A raster input can be given greater or lesser influence if that information is known. For my analysis, I made all three of my habitat raster inputs nearly equal weight (two inputs were assigned a weight of 33, one was weighted 34 to equal 100 total).
  7. Finally, run the tool and assuming no errors, an output raster will appear in the Contents pane and in the map window.

Brief description of results you obtained:

The first three images show each habitat parameter weighted by suitability, with green indicating most suitable and red indicating least suitable.

Salinity —

Bathymetry —

Substrate —

The results of the final weighted overlay show that the oysters are most likely to be in the mid estuary where salinity, bathymetry, and substrate are appropriate.

 

Critique of the method – what was useful, what was not?:

The weighted overlay was a simple approach to combining all of the raster layers for habitat and creating something spatially meaningful for my research analysis. The areas indicated in green in the resulting map generally reinforce what was found in the literature and predicted by local shellfish biologists. While the weighted overlay tool did generate a useful visual, it is highly dependent on the quality of the raster inputs. In my analysis, the detailed resolution of the bathymetry layer was very helpful, but the substrate layer is a more generalized assessment of sediment types within Yaquina Bay. It doesn’t show the nuances of substrate availability that might be important for finding exactly where an opportunistic species like Olympia oysters might actually have settled. For example, in Coos Bay Olympia oysters have been found attached to shopping carts that have been dumped. The substrate raster is a generalized layer that uses standardized subcategories and does not pinpoint such small features.

Additionally, the salinity layer is an average of wet-season salinity, but it can change dramatically throughout the year. Some in situ measurements from Yaquina Bay this April showed that the surface salinity with the subcategory range of 16-23 psu were actually <10 psu. While it is more reasonable to generalize salinity for the purposes of this analysis, it is important to note that the oysters are exposed to a greater range over time.

This spatial information serves as a prediction of suitable oyster habitat. The next step is to compare this predicted suitability to actual field observations. I’ve recently completed my first round of field surveys and will be analyzing how closely the observations align with the prediction in Exercise #2.

Spatial Patterns of Salmonella Rates in Oregon

  1. Question Asked

At this stage I asked several questions regarding the spatial distribution of population characteristics in all counties in Oregon in 2014: What are the county level spatial patterns of reported age-adjusted Salmonella rates within Oregon in 2014? County level spatial patterns of proportions of females? Median Age? Proportion of infants/young children aged 0-4 years?

To answer these questions I used several different datasets. The first dataset used is a collection of all reported Salmonella cases in Oregon from 2008-2017 which includes information like sex, age group, county in which the case was reported, and onset of illness. The information in this dataset was deidentified by Oregon Health Authority. The second dataset used was a collection of Oregon population estimates over the same time period. This dataset includes sex and age group specific county level population information. I also obtained county level median ages from AmericanFactFinder. The last dataset used is a shapefile from the Oregon Spatial Data Library containing polygon information of all Oregon counties.

  1. Names of analytical tools/approaches used

I used a direct age adjustment (using the 2014 statewide population as the standard population) to obtain county level age-adjusted Salmonella rates. After calculating county level summary data e.g. proportion of females, proportion of children aged 0-4, median age, and age-adjusted Salmonella rates, I merged this information with a spatial dataframe containing polygonal data of every county in Oregon. After doing this I did both local (between 0-150 km) and global (statewide) spatial autocorrelation to get a Moran’s I statistic for each of the population variables listed above. I produced choropleth maps of each of the variables for Oregon as well. Finally, I produced a heatmap for county-level age-adjusted Salmonella rates using a Getis-Ord Gi* local statistic to evaluate statistically significant clustering of high/low rates of reported Salmonella cases.

  1. Description of the analytical process

After extensive reformatting, I was able to organize cases of Salmonella by age group and by county for the year 2014. After this I formatted 2014 county level population estimates in the same way. I then divided the Salmonella case dataframe by the population estimate dataframe to get rates by the different age groups. To get county age-adjusted rates I created a “standard population”, in this case I used Oregon’s statewide population broken down into the same age groups as above. I then multiplied the each of the county’s age-specific rates by the standard population’s matching age groups to create a dataframe of hypothetical cases. This dataframe represents the number of cases we would expect in each of the counties if they had the same population and age distribution as Oregon as a whole. I summed the expected Salmonella cases by county and divided this number by the 2014 statewide population. This yielded age-adjusted reported Salmonella rates by county.

Given that the population data contained county level populations broken down by age group and by sex I was able to calculate proportions of county populations which were female, and which were young children aged 0-4 years by dividing those respective group populations by the total county population.

After this I performed local and global spatial autocorrelation with Moran’s I using the county level median age, proportion of children, proportion of females, and age adjusted Salmonella rates which were associated with centroid points for each county. The global Moran’s I was calculated using the entire extent of the state and the local Moran’s I was calculated by limiting analysis to locations within 150 km of the centroid. Both global and local Moran’s I statistics were calculated using the Monte-Carlo method with 599 simulations.

Finally, I completed a Hot Spot Analysis using Getis-Ord Gi* to assess for any statistically significant hot or cold spots in Oregon. This was only done for the age-adjusted Salmonella rates. This was completed using the same county centroid points as above. I completed this analysis with a local weights matrix using Queen Adjacency for neighbor connectivity. The weighting scheme was set to where all neighbor weights when added together equaled 1.

  1. Brief description of results you obtained

Choropleth Maps of Oregon:

From the median age map, we can see that there are some clusters of older counties in the northeastern portion of the state and along west coast. Overall, the western portion of Oregon is younger than the eastern portion of the state.

From the proportion of children map there are a few clusters of counties in the northern portion of the state with high proportions of children compared to the rest of the state. Overall, the counties surrounding the Portland metro area have higher proportions of children compared to the rest of the state.

From the proportion of females map, we can see that the counties with the highest proportion of females are clustered in the western portion of the state.

Finally, from the age-adjusted county Salmonella rates map we can see that the highest rates of Salmonella occur mostly in the western portion of the state with a few counties in the northeast having high rates as well. Overall, the counties surrounding Multnomah county have the highest rates of Salmonella.

The global Moran’s I statistics:

  • County proportions of females: 0.053 with a p-value of 0.15. This suggests insignificant amounts of slight clustering.
  • County median age: 0.175 with a p-value of 0.02. This provides evidence of some significant mild clustering.
  • County proportions of children: 0.117 with a p-value of 0.05. This provides evidence of significant mild clustering
  • County age-adjusted Salmonella rates: -0.007 with a p-value of 0.32. This suggests insignificant amounts of higher dispersal than would be expected.

Local Moran’s I Statistics:

  • County proportions of females: 0.152 with a p-value of 0.02. This suggests significant amounts of mild clustering.
  • County median age: 0.110 with a p-value of 0.07. This provides evidence of some insignificant mild clustering.
  • County proportions of children: 0.052 with a p-value of 0.1617. This provides evidence of insignificant slight clustering
  • County age-adjusted Salmonella rates: -0.032 with a p-value of 0.5083. This suggests insignificant amounts of higher dispersal than would be expected.

Getis-Ord Gi*:

  • The heatmap shows a significant hotspot (with 95% confidence) in Clackamas county with another hotspot (with 90% confidence) in Hood River County. Three cold spots (with 90% confidence) are seen in Malheur, Crook, and Morrow counties.

  1. Critique of Methods

The choropleth maps were very useful at showing areas with high/values however this method was not able to detect counties with significantly different values compared their neighbors. Overall, it was useful as an exploratory tool. The global and local Moran’s I calculations were able to detect if high/low values were closely clustered or more dispersed than what is expected. However, I am unsure if this method was completely appropriate given the coarseness of this county level data. At a local scale, only the proportion of women showed a significant amount of clustering, and globally median age and proportion of children showed some amount of significant clustering. Given that most of the Moran’s I statistics were not associated with significant values, I don’t believe this analytical method highlighted a particularly meaningful spatial pattern in my data. The heatmap provided evidence of some significant hot and cold spots in Oregon, however this was based on immediate neighbor weights and perhaps global weights would be more appropriate. Overall, this tool was very useful in detecting significantly high/low Salmonella rates.

Seth Rothbard My Spatial Problem

A description of the research question that you are exploring

Of the 31 pathogens known to cause foodborne illness, Salmonella is estimated to contribute to the second highest number of illnesses, the most hospitalizations, and the highest number of deaths in the US when compared to other domestically acquired foodborne illnesses1. Salmonellosis is the bacterial illness caused by Salmonella infection. It is estimated there are approximately 1.2 million cases of salmonellosis and around 450 deaths every year in the US due to Salmonella1. Over time there has been marked variability in the number of reported cases per year. Salmonellosis is a mandatory reportable illness in Oregon and available information indicates that incidence rates of this disease have been stable since the new millennium2. The objective of this study is to perform spatial analysis of lab-confirmed Salmonella in Oregon counties for the years 2008-2017 for which county level data are available and determine whether some counties have a higher risk of Salmonella infection compared to others. I also wish to explore the socioeconomic factors associated with high incidence rate counties. My research question that I wish to explore is:How are spatial patterns of Salmonella related to spatial patterns of socioeconomic factors? Certain socioeconomic patterns such as lower levels of education and income may increase rates of Salmonella in these populations as a result of improperly preparing/cooking foods, less strict sanitation practices, and/or higher rates of eating high risk foods.

A description of the dataset you will be analyzing, including the spatial and temporal resolution and extent

The Oregon Health Authority has created a database called the Oregon Public Health Epidemiology User System (ORPHEUS) as a repository for relevant exposure and geospatial data related to disease cases reported to public health departments all across the state. This database has been maintained by the state since 1989 and includes information regarding various diseases. The dataset I will be using is a collection of every single reported non-typhoidal Salmonella case within Oregon from 2008-2017. The distinction between typhoidal Salmonella and non-typhoidal is that the typhoidal variety of Salmonella causes typhoid fever while non-typhoidal Salmonella causes salmonellosis (a common gastrointestinal disease and a type of “food poisoning” as it is usually referred to). The spatial resolution of this data has been obscured to the county level to protect personal privacy and confidentiality. I will also be using data from the American Community Survey and the CDC’s Social Vulnerability Index. These datasets contain social vulnerability related variables for Oregon at the county level. In the case of the American Community Survey, data is available for the years 2009-2017 and the Social Vulnerability Index has data available for 2014 and 2016. Yearly county population estimates will also be used from Portland State University’s Population Research Center. Because of the high amounts of available data I will choose to start my exploratory analysis for Oregon in 2014 as all data is reported for that year.

Hypotheses: predict the kinds of patterns you expect to see in your data, and the processes that produce or respond to these patterns.

I expect counties with younger populations (higher proportions of infants and newborns) as well as counties with higher proportions of females to have higher adjusted incidences of Salmonella. Prior surveillance suggests that children under the age of 5 are at the highest risk for Salmonella infection likely due to their developing immune system and how they interact with their environment. Specifically, many young children do not/are unable to wash their hands prior to touching their mouths. Females are also known to have a higher risk of Salmonella infection, however the mechanism behind this is relatively unknown with some explanations suggesting that it is due to that females are more likely to have more interactions with young children. I also expect counties with lower Social Vulnerability scores to have higher rates of Salmonella infections. Higher rates of poverty and lower amounts of education are often associated with more negative health outcomes.

Approaches: describe the kinds of analyses you ideally would like to undertake and learn about this term, using your data.

I would like to calculate age and sex adjusted rates of disease for each county in Oregon. I am also interested in undertaking cluster analysis and calculate spatial autocorrelation among Oregon counties over time. Finally, I would like to perform a regression of county disease incidence rates by the different socio-economic factors found in the American Community Survey and Social Vulnerability Index. I would be interested in learning about spatial Poisson regression to assess which variables are significantly associated with the presence of disease. I would also be interested in learning about hotspot analysis to evaluate if there are areas of Oregon with significantly higher disease rates. Ideally, all of my analyses will be performed in R and ArcGIS.

Expected outcome: what do you want to produce — maps? statistical relationships? other?

I would like to produce choropleth maps of adjusted Salmonella infection rates as well as for hotspot analysis. I want to produce regression models to describe how incidence rates of Salmonella vary across different socioeconomic indicators. I also want to create graphs to describe spatial autocorrelation patterns as well as to show disease rates over time.

Significance. How is your spatial problem important to science? to resource managers?

This analysis will be helpful to identify county populations which are at higher risk for Salmonella infections. The inclusion of social vulnerability variables will be useful for state/local policy makers. Reforms can be proposed or further studied to assess how addressing the needs of particularly vulnerable populations will affect the incidence of Salmonella. This research will be beneficial for further public health research as trends found here may also hold true for other foodborne illness. The aim of this research is to benefit the health of communities in Oregon by highlighting the association between social vulnerability and the risk of foodborne illness.

Your level of preparation: how much experience do you have with (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) image processing, (e) other relevant software

I have no experience with Arc-Info, programming in Python, and image processing. I have some limited experience within Modelbuilder. I am very comfortable performing statistical analyses within R and have some experience using the software to create maps using various packages.

References

  1. Estimates of Foodborne Illness in the United States. Centers for Disease Control and Prevention. https://www.cdc.gov/foodborneburden/2011-foodborne-estimates.html#modalIdString_CDCTable_0. Published July 15, 2016. Accessed July 31, 2018.
  2. Oregon Health Authority. Salmonellosis 2016 Report. Oregon Public Health Division. Available at: https://www.oregon.gov/OHA/PH/DISEASESCONDITIONS/COMMUNICABLEDISEASE/DISEASESURVEILLANCEDATA/ANNUALREPORTS/Documents/2016/2016-Salmon.pdf. Accessed July 31, 2018.