Tag Archives: geographically weighted regression

Final

 

Question

How are the spatial patterns of individuals’ perception of natural resource governance related to the spatial patterns of environmental restoration sites via distance and abundance of improved sites?

Datasets

 Puget Sound Partnership Social Data

My data came from a survey on subjective human wellbeing related to natural environments. The sample was a stratified random sample (28% response, n= 2323) of the general public from the Puget Sound in Washington from one time period. I am specifically examining questions related to perceptions of natural resource governance. Around 1730 individuals gave location data (cross street and zip code) which have been converted to GPS points. I also have demographic information for individuals, their overall life satisfaction, and how attached and connected they feel to the environment.

Environmental Data

I have three different forms of environmental data. (1) Point locations of restoration sites in the Puget Sound. (2) A shapefile of census tract level average environmental effects (the environmental condition ranked from 1 good to 10 very poor, and (3) Land cover raster files from 1992 and 2016.

  • Individual restoration sites numbered over 12,000. They spanned many years, and point locations overlapped significantly through time.
  • The environmental effects data comes from an online mapping tool that was created in a collaboration between the University of Washington, the Washington Department of Ecology, and the Washington Department of Health. Environmental effects are based on lead risk from housing, proximity to hazardous waste treatment storage and disposal facilities (TSDSs), proximity to national priorities list facilities (Superfund Sites). Proximity to Risk Management Plan (RMP) facilities, and wastewater discharge. The combination of these effects was aggregated within census tracks to produce an environmental effects ‘Rank’ from low effects (Rank of 1) to high effects (Rank of 10).
  • The land cover files contained 25 different types of land cover. I reclassified these land cover types by combining similar types of land cover (e.g. deciduous forest and conifer forest to ‘forest’).

 

Hypotheses

 1) Shorter distances between individuals and restoration sites, and greater number of restoration sites near individuals, would correlate positively with governance perceptions

2) Positive environmental conditions will correlate positively with governance perceptions

Richard Petty and John Cacioppo developed the elaboration likelihood model (1980) which asserts how individuals change their attitude based on persuasive stimuli. Perry and Cacioppo develop this idea by implementing the idea of two routes of persuasion—the central and peripheral paths—where determinants of routes are determined by motivation, ability, and nature of mental processing. The purpose of this theory is to help explain how individuals elaborate on ideas to form attitudes and how strong, or long lasting, those attitudes are. This theory may suggest that individuals will form stronger attitudes when reasons to form attitudes are stronger and more immediate. Stimuli of this kind tend to be nearer to individuals, as processing of issues far away is cognitively difficult. Research has shown that individuals have a difficult time thinking about problems on larger spatial scales, and there is an inverse relationship between how feelings of individual responsibility for environmental problems and spatial scale (as scale gets larger, feelings of responsibility go down (Uzzell, 2000).

Cognitive biases also suggest that people use heuristics to answer questions when they are unsure. One common heuristic is the saliency bias, where individuals overemphasize things that are emotionally more important, and ignore less interesting things (Kahneman, 2011). This may suggest that spatial patterns are more important than temporal patterns because what is happening now is more salient to individuals than what happening in the past. This may imply that even if environmental outcomes have improved over time, the immediacy of the current environment or things influencing the environment may have a greater effect on individual perceptions.

 

Approaches

Statistical Analyses

Spatial Autocorrelation Analyses

Moran’s I: To compute Moran’s I, I used the “ape” library in R. I also subset my data to examine spatial autocorrelation by demographics including area (urban, suburban, rural), political ideology, life satisfaction, income, and cluster (created by running a cluster analysis on the seven variables which comprise the governance index

Correlograms: I created correlograms for the variables that were significant (urban, conservative, and liberal) using the “ncf” library and the correlog() function. These figures give a better picture of spatial autocorrelation at various distances.

Semivariograms: To create semivariograms, I used the “gstat” and “lattice” libraries which contain a function called variogram. This function takes the variable of interest along with latitude and longitude locations. The object created can then be plotted. For this analysis I used the same subsets as in the Moran’s I analysis. These figures are excluded from the results as they do not provide information beyond what the correlograms show.

Nearest Neighbor analysis: A file of restoration sites was loaded into R. The sites were points. This analysis required four libraries as the points were from different files. The libraries were: nlem, rpart, spatstat, and sf. I first had to turn my points into point objects in R (.ppp objects). I used the convenxhull.xy() function to create the ‘window’ for my point object, then I used that to run the ppp() function. I then used the nncross() function. This function produced min, max, average, and quartile distances from individuals to the nearest restoration site. I added the ‘distance’ from the nearest neighbor as a variable in a linear regression to determine an association. I used these distances (1st quartile, median, and 3rd quartile) to produce buffers. I ran spatial joins between the buffers and the restoration sites. This produced an attribute table that had a join count—the number of restoration sites within the buffer. In R I ran a linear regression with join count as an independent variable.

Geographically weighted regression: I joined individual points to my rank data, a shapfile at census tract level. This gave individuals a rank for environmental effects at their location. Initially I used the GWR tool in Arc to run the regression, but wanted the flexibility of changing parameters and an easily readable output that R provides. In R, I used two different functions to run GWR, gwr(), and gwr.basic(). The gwr() required creating a band using gwr.sel, and the gwr.basic required creating a band using bw.gwr. The difference between these functions is that gwr.basic produces p-values for the betas. I ran gwr on both my entire data set and a subset based on perceptions. The subset was the ‘most agreeable’ and ‘least agreeable’ individuals who I defined as those one standard deviation above and below the mean perception.

Neighborhood analysis: I created a dataset of the upper and lower values of my governance perceptions (one standard deviation above and below the means). I then added buffers to these points at 1 and 5 miles. I joined the buffers to the rank values to get an average rank within those buffers. I exported the files for each buffer to R. I created a density plot of average rank for the low governance values at each buffer, and for the high governance values at each buffer.

T-test: To test whether land cover change differed between those with high and low perceptions, I exported the tables and calculated the change in land cover for the two samples. I then ran two-sample t-tests for each land cover type change between the two groups.

Mapping

Kriging and IDW: I used the Spatial Analysis toolbox to preform Kriging and IDW to compare the outputs of the two techniques. I used my indexed variable of governance perceptions. The values of the variable vary from 1 to 7. I then also uploaded a shapefile bounding the sample area.

Hotspot Analysis: I used the Spatial Analysis toolbox to preform ‘hotspot analysis.’ I used my indexed variable of governance perceptions with values from 1 to 7.

Tabulate area: I used two land cover maps from 1992 and 2016 to approximate environmental condition. I loaded the raster layers into ArcGIS Pro, and reclassified the land types (of which there were 255, but only 25 present in the region) down to 14. I then created buffers of 5km for my points of individuals with governance scores one standard deviation above and below the average. I then tabulated the land cover area within the buffers for each time period.

 

Results

What are the spatial patterns of perceptions of natural resource governance?

 Moran’s I statistic for overall governance perceptions was insignificant, indicating a lack of spatial autocorrelation at the regional scale (I value; p = 0.51). Significant autocorrelation was detected when the data were subset by demographic variables. Individuals residing in urban areas exhibited spatial autocorrelation of perceptions while individuals in suburban and rural areas did not. Lastly, individuals classified as ‘conservative’ (political ideology < 0.5) exhibited significant spatial autocorrelation for governance perceptions (I statistic: -0.0062, p = 0.0018). The spatial autocorrelation for governance perceptions was mainly restricted to short distances as shown in the correlograms below. Significant correlations for the entire study population and individuals in urban areas were detected at near distances Individuals classified as conservative exhibited significant correlation across multiple distances. No significant correlations were found among liberal individuals, suggesting a non-spatial driver mechanism may be at play.

 

Local patches of significantly lower perceptions (cold spots) and significantly higher perceptions (hot spots) were identified as shown in the map below. Three ‘cold spots’ were located in the areas around Port Angeles, Shelton, and the greater Everett area. Two hot spots were identified surrounding Bainbridge Island, and the city of Tacoma, which is where the Puget Sound Partnership is located.

Blue points are “cold spots” at 99%, 95%, and 90% confidence that the points reside in a cold spot. Cold spots are areas where perception is significantly lower than the average compared to those around them. Red points are “hot spots” at 99%, 95%, and 90% confidence that the points resides in a hot spot. Hot spots are areas where perception is significantly higher than the average compared to those around them. Small grey points are insignificantly different. Bounding lines are counties in Northwest Washington.

What are the spatial patterns of natural resource governance perceptions in relation to environmental condition?

The median distance for individuals from restoration site was 0.037 degrees, 1st quartile was 0.020 degrees, and 3rd quartile was 0.057 degrees.

The regression on whether distance correlates with individual perception was insignificant (p = 0.198), which means distance from the nearest restoration site does not influence perceptions. For each regression on the number of sites near an individual, all coefficients were negative, meaning the more sites near an individual, the more disagreeable their perception was. All produced significant results, but the effect size of number of sites near individuals was very minimal (Table 1).

Table 1. Regression results for ‘nearest neighbor’ of individuals to restoration sites.

Buffer Size b p-value Effect Size (rpb)
Buffer 1 (1st quartile) -0.003 0.0181 .003
Buffer 2 (median) -0.002 0.045 .002
Buffer 3 (3rd quartile) -0.001 0.002 .003

In a geographically weighted regression (GWR) model, rank, life satisfaction, years lived in the Puget Sound, and race are significant (Table 2). All other variables were insignificant. For rank (the combination score of environmental effects) the coefficient was positive. Higher rank values are worse environmental effects, so as agreeable perceptions increase, environmental condition decreases. Overall, the effect size of this model on governance perceptions was small, explaining about 10% of the variance in the data (Table 2).

A map of residuals from the GWR confirms the results from the hotspot analysis about where high and low responses congregate.

Table 2. Regression results for environmental effects at individuals’ locations.

Variable1 b p-value2
Rank 0.077 0.007**
Life Satisfaction 0.478 <0.001**
Years Lived in the Puget Sound -0.010 0.002**
Sex -0.156 0.289
Area -0.150 0.179
Education -0.002 0.922
Income -0.022 0.622
Race 0.006 0.034*
Ideology 0.103 0.533

1R2 = 0.094

2 ** = significant at the 0.01 level, * = significant at the 0.05 level

The plot of high and low governance values at the two buffers is presented below. The black and grey curves represent respondents from the survey that were at least one standard deviation lower than the mean (the mean was neutral). The green curves represent respondents from the survey that were at least one standard deviation higher than the mean. The solid curves are the average rank at the one mile buffer, and the dotted curve is the average rank at the five mile buffer.

This figure indicates there are two peaks in average rank, at low environmental effects (~1) and at mid-environmental effects (~4.75). Those with lower perceptions of environmental governance had higher peaks at low environmental effects for each buffer size. Those with higher perceptions of environmental governance had a bimodal distribution with peaks at low and mid-environmental effects.

 

What are the spatial patterns of natural resource governance perceptions in relation to environmental condition over time?

I used land cover change as a proxy for environmental condition over time. Land cover change differed significantly for four different land cover types between the two groups—grassland, forest, shrub/scrub, and bare land cover. Grassland cover decreased for both groups, but decreased by 5% more in the areas with individuals with below average governance perceptions. Forest cover also decreased for both group, but decrease by 1% more individuals with below average governance perceptions (the amount of forest cover in the region is very large which accounts for why a 1% difference is significant). Shrub and scrub cover increased for both groups, but increased by 3% more in areas with individuals with below average governance perceptions. Lastly, bare land decreased for both, but decreased by 5% more for individuals in areas with individuals with below average governance perceptions (Table 3). The effect size of land cover change on perception was small for each of the significant variables with biserial correlations between .07 and .09 (Table 1).

Table 3. Differences in land cover change between individuals with above average perceptions of natural resource governance perceptions and individuals with below average governance perceptions

Governance Perception1
Above Average Below Average t-value p-value Effect size (rpb)
High Development 17 16 0.80 .426 .02
Medium Development 19 19 0.67 .500 .02
Low Development 11 11 0.22 .829 .00
Developed Open Space 12 13 0.21 .837 .00
Agriculture -1 0 0.74 .461 .02
Grassland -13 -18 3.03 .002 .09
Forest -10 -9 2.61 .009 .08
Shrub/Scrub 40 43 3.04 .002 .09
Wetlands 0 0 1.67 .095 .05
Unconsolidated Shore 2 2 0.10 .918 .00
Bare Land -21 -25 2.43 .015 .07
Water 0 0 0.90 .368 .03
Aquatic Beds -1 -5 1.19 .233 .03
Snow/Tundra 0 0 0.00 .997 .00

1Cell entries are percent changes of land cover from 1992 to 2016

As one land cover type goes up, another land cover type must go down. While forest overall decreased in the region for both groups over time, areas with lower perceptions saw significantly less decrease. The percentage is small, but a lot of the area in the region is forested, so the actual amount of land is large. The two maps below show land cover in 2016 (top) and 1992 (bottom) around the area of Port Angeles. This area is one that had a significant cold spot (low perceptions). From these maps, you can see a spread of the dark green color (forest), and a decrease of the yellow (grassland). Port Angeles historically was a timber community (the same could be said for the regions of the other cold spots in the region. This change in forest and grassland could be the source of negative perceptions if these communities believe the governance structures are responsible for the change in forest cover.

Significance

Overall, in this analysis, poor environmental condition relates to positive perceptions, and land cover change may provide insight into reasons for poor perceptions. Areas with negative perceptions may not directly indicate poor natural resource governance. Elsewhere, trust in natural resource governance was primarily driven by individual value orientations (Manfredo et. al. (2017). The three areas identified as cold spots in this study could be areas where individuals do not feel their values are represented by the natural resource governing systems. Conversely, hot spot area A is located near the headquarters of the Puget Sound Partnership, potentially facilitating trust, representation, and access to information that may influence the perceptions of individuals located nearby. This urban, liberal, developed area contrasts the area to the west with a highly significant cold spot (Area 2). Individuals from area two, a more rural, conservative, historic logging site likely have different value orientations than those from Area A. Similar comparisons of the other cold spots could be made to areas near them. More research should be conducted in this area to determine if value orientations have a significant influence on perceptions of natural resource governance.

 

Your learning

 

During this course I significantly advanced my knowledge in ArcGIS Pro and in R studio. I learned how to run many new tools in Arc including hotspot, geographically weighted regression, and tabulate area. I also learned new skills in troubleshooting problems. For example, I was having difficulty getting the rank data to display on my map. Another student taught me that as I imported the data from a .csv and joined it to a shapefile, I would need to create a new feature class from it before it would be able to display. I also gained knowledge in how to run more spatial analyses in R. These included many spatial autocorrelation analyses, and geographically weighted regression (which involved creating a point object!).

 

What did you learn about statistics?

 

I first and foremost learned of the existence of many types of spatial statistics. These included statistics Hotspot (Getis-Ord Gi*), spatial autocorrelation (correlogram, semivariogram, Moran’s I), geographically weighted regression (GWR). For hotspot, I learned what the hotspot clusters mean, and the best ways to pick the fixed-distance band of the neighbors the tool looks at (either through selecting the number of minimal points near it, or by looking at the z-score at the distance that shows the highest amount of spatial autocorrelation). Speaking of spatial autocorrelation, I learned what that meant in terms of the correlation of perceptions of governance between individuals depending on their distance from each other. Finally, I learned a little about GWR, including that it is necessary to plot the output to examine patterns over the space in question, and I also learned that the reason Arc does not display p-values for independent variables is not something to due with the equation, but actually due to computing power of Arc.

Deaggregation of multi-hazard risks, losses, and connectivity: An application to the joint earthquake-tsunami hazard at Seaside, OR

Research Question

The Pacific Northwest is subject to the rupture of the Cascadia Subduction Zone (CSZ) which will result in an earthquake and nearfield tsunami. Low-lying communities along the coast (such as Seaside, OR) are susceptible to infrastructure damage from both earthquake ground shaking and tsunami inundation. Given resource constraints (budget, personnel, time, etc.), it is not feasible for city planners and resource managers to expect to mitigate all possible damages resulting from an earthquake/tsunami event; however, it is possible to optimize resources in order to minimize the expected damages. Consequently, a first step toward resource optimization is a thorough understanding of the risks posed by the CSZ.

For this project, I investigated how the spatial pattern of tax-lot damage and connectivity to critical infrastructure following a multi-hazard earthquake/tsunami in Seaside, OR can provide insight into possible hazard mitigation measures. The damages and connectivity were deaggregated by both hazard (earthquake vs. tsunami) as well as intensity of the event. An understanding of the deaggregated hazard provides insight into vulnerable areas within Seaside.

Description of Dataset

My dataset consists of maps of the built environment (or infrastructure) and hazard maps (earthquake ground shaking and tsunami inundation) for Seaside, OR. The infrastructure is composed of buildings, an electric power network, a transportation network, and a water supply network (Figure 1). The earthquake and tsunami hazard maps were previously generated through numerical modeling (Park, Cox, Alam, & Barbosa, 2017). Due to computational limitations, tsunami inundation is modeled using a “bare-earth” model indicating that no infrastructure is included.

In addition to the infrastructure and hazard maps, a suite of probabilistic damage codes were utilized. These damage codes implement Monte-Carlo methods to evaluate the expected damages, economic losses, and connectivity associated with earthquake/tsunami hazards.

Figure 1: Infrastructure at Seaside (Kameshwar et al., n.d.)

 

Hypotheses

The project was divided into three exercises. The questions posed by each exercise, and hypotheses are outlined below:

  1. What is the spatial pattern of economic losses resulting from a joint earthquake/tsunami event? How does each hazard contribute to this spatial pattern?
    1. Earthquakes and tsunamis pose different hazards to the built environment. The former results in strong ground shaking whereas the latter results in inundation. Tsunami inundation is much more spatially variable due to limiting characteristics such as ground elevation and friction losses. Conversely, earthquake ground shaking is not limited by elevation or friction and is subsequently not as spatially variable within a geographic region (especially at scales the size of Seaside). Because of these differences in hazardous conditions, I expect the spatial pattern of tsunami losses to be concentrated along the coast, whereas the spatial pattern of economic losses will be dispersed around Seaside.

 

  1. How does the spatial pattern of economic losses relate to the spatial pattern of tsunami momentum flux?
    1. Because there is significantly more spatial variability in the tsunami hazard compared to the earthquake hazard, I wanted to investigate how this spatial pattern relates to the spatial pattern of tsunami economic losses. Economic losses are driven by the intensity of the hazard, therefore, I expect a significant correlation between economic losses and tsunami momentum flux (a measure of both inundation depth and velocity).

 

  1. How vulnerable is Seaside’s networked infrastructure to a joint earthquake/tsunami event?
    1. While economic losses to infrastructure can play a role in hazard mitigation planning, it should not serve as the only driver. Mitigation planning should also (and perhaps more importantly) consider resident accessibility to critical services and facilities immediately following a disaster. Given the importance of resident accessibility, I wanted to perform a connectivity analysis of networked infrastructure (electricity, transportation, and water) following a joint earthquake/tsunami event. The networked infrastructure in Figure 1 shows that failure of a few key links could severely limit large populations of Seaside (g. the network is not highly “parallel”, but exhibits some “series” features). For example, failure of the pipes leading to the water treatment plant (Figure 1-d) would result in complete disconnectivity of water to Seaside. Given the structure of the networks at Seaside, I expect sharp increases in the disconnectivity if some of the key links fail.

Approaches/Methods

  1. Spatial pattern of economic losses:
    1. To evaluate the spatial pattern of economic losses, I created real market value and economic loss heatmaps. Economic loss heatmaps provide insight into densely populated regions within Seaside that could potentially benefit from a single mitigation option. Heatmaps were generated by using the kernel density estimation tool within the QGIS processing toolbox.
  2. Relationship between economic losses and tsunami momentum flux:
    1. Performing a geographically weighted regression (GWR) provided insight into the relationship between economic losses and tsunami momentum flux. Here, the percent of economic loss depended on the tsunami momentum flux. The GWR was performed using the python spatial analysis library PySal.
  3. Connectivity of networked infrastructure:
    1. Connectivity analyses between tax lots and critical infrastructure provided insight into the probability of tax lots becoming disconnected from critical infrastructure. Additionally, maps showing the probability of link failure were generated to isolate vulnerable links within the networks. The connectivity analysis was performed using the python network analysis package python-igraph.

Results

Spatial pattern of economic losses

Heatmaps showing the spatial pattern of economic losses deaggregated by both hazard and intensity are shown in Figure 2. It can be seen that a hot spot of damages is located at the central business district (CBD), and appears for low magnitude earthquake events. The tsunami damages are more evenly distributed along the coast, relative to the earthquake damages. Figure 3 shows the total economic risks for of Seaside deaggregated by hazard and intensity. Here, risk is defined as the economic losses multiplied by the annual probability of occurrence (the inverse of the return year). Quantifying the economic losses by risk allows for the isolation of events that are both likely to occur and produce significant damages. It can be seen in Figure 3 that the 250 to 1000-year events pose the highest economic risks. If economic losses are a priority, using Figures 2 and 3, a city planner could identify regions of buildings within Seaside that are vulnerable to earthquake damage. Subsequently these regions would benefit the most from mitigation options.

Figure 2: Deaggregated heatmaps of economic losses

Figure 3: Deaggregated economic risks for all of Seaside

Relationship between economic losses and tsunami momentum flux

The relationship between percent of economic losses and tsunami momentum flux was measured by performing a geographically weighted regression (GWR). A key parameter of the GWR is the bandwidth, which describes the area under which the regression is performed. A bandwidth of 200 was initially used (see to exercise 2). The bandwidth has been further optimized by comparing the resulting r2 values of multiple GWRs (Figure 4).  It can be seen that a bandwidth of 75 results in the largest r2 value, and was subsequently used for further analysis. Bandwidths below this value resulted in errors in the GWR code.

The results from the GWR with an updated bandwidth are shown in Figure 5. It can be seen that the slope is small near to the coast with hotspots of larger values located further inland. Conversely, the intercept is large near to the coast, and decreases as further inland. The intercept can be explained by the large momentum flux near to the coast and decreases as the tsunami propagates landward.

Interestingly, some of the slope values are less than 0 indicating that the damages decrease as momentum flux increases. This is likely explained by the heterogeneity of building types within Seaside. For example, concrete buildings are more resistant to tsunami damage than wood buildings. Consequently, a concrete building that experiences a large momentum flux may result in less damage than a wood building that experiences a small momentum flux. To validate this, the buildings in Seaside were deaggregated by their characteristics and a linear regression was performed (Figure 6). The buildings can be classified according to the construction material (wood vs. concrete), year built, and number of floors. Here, W1 corresponds to one-story wood building; W2 corresponds to two-story wood building; and C1 corresponds to concrete buildings with moment a frame. The concrete moment frame can be further divided into less than or greater than 4 stories. It can be seen in Figure 6 that all regression slopes are positive with relatively high r2 values. Considering the most extreme case, it can be seen that if a W1 building built before 1979 is next to a C1 building built after 2003, the resulting regression slope would likely be negative.

Figure 4: GWR bandwidth vs. r^2. Used for bandwidth selection.

Figure 5: Slope and intercept results from GWR analysis

Figure 6: Linear regression deaggregated by building characteristics

Connectivity of networked infrastructure

The probability of each tax lot being connected to critical facilities was evaluated by performing a connectivity analysis using the EPN, transportation, and water networks (see networks in Figure 1). The total fraction of tax lots disconnected from critical infrastructure are shown in Figure 7 (left-hand side). Similar to economic risks, the fraction of disconnection was multiplied by the probability of occurrence to determine the disconnectivity risk. It can be seen that for the 1000-year event, nearly all of Seaside becomes disconnected. Furthermore, the 250- to 500-year events pose the highest risk across the three networked infrastructures.

The network performance can also be characterized spatially by creating maps indicating the probability of link failure and tax lot disconnection. An example is shown in Figure 8 for the tsunami damage resulting from a 500-year event. This case was selected to view spatially because it drives the risk associated with the transportation network. Figure 8 shows that tax lots west of the Necanicum River have a high probability of becoming disconnected. Viewing the link failure map, it becomes clear that the disconnection is not driven by bridge failure, but rather from road washout.

Figure 7: Disconnection results: (left) fraction of tax lots disconnected from critical infrastructure and (right) disconnection risk

Figure 8: Probability of (left) link failure and (right) tax-lot disconnection

Significance

The ability to spatially isolate vulnerable infrastructure within a community serves as a first step in optimizing resource management. Being able to identify highly vulnerable areas, ensures that resources are not spent in areas that are already relatively resilient to the earthquake/tsunami hazard. The risk results from this analysis show that Seaside is particularly vulnerable to the 250- to 1000-year rupture of the CSZ. There is a large concentration of economic value located at the city center of seaside. If resource managers and city planners place an emphasis on reducing economic losses, then they should focus on mitigation measures to reduce damages within this region. However, more important than economic losses is resident and tourist accessibility to critical facilities following a natural disaster. Network mitigation measures should focus on reducing the damages associated with the 250- to 500-year events.

The results from this analysis are not comprehensive and future work should include:

  1. Spatial analysis of mitigation measures – Additional damage modeling should be performed with various mitigation measures in place. For example, the effect on transportation connectivity resulting from earthquake retrofitting of bridges can be incorporated into the damage model. Similar spatial maps as those produced in this project could be created in order to determine the spatial effect that mitigation measures have.
  2. Incorporate population in analyses – Climate change and natural disasters tend to have a disproportionate effect on populations within a geographic region. Future work should include spatially mapping socio-economic vulnerability indices to identify vulnerable populations within Seaside.
  3. Perform additional GWRs – It was shown that the building classification lead to counterintuitive results in the GWR. Additional GWRs could be performed by only considering similar building types with a geographic region.

Learning

Throughout this term, I learned how to both use new software as well as apply new statistical methods.

  1. Software
    1. Python – Prior to this course, I was familiar with both QGIS and python; however, I had not used the two together. A personal goal was to learn how to perform geospatial analysis in python. This was accomplished by performing the GWR using the python library I additionally learned how to read and modify shapefiles in python using the geospatial library geopandas.
    2. QGIS processing toolbox – I learned how to use multiple tools within the QGIS processing toolbox (g. heatmap interpolation and hotspot analysis).
  2. Statistics –
    1. Hot spot analysis – Although I did not use hotspot analysis for my project, I learned about this method and did a test case for exercise 1. The hotspot analysis was used to isolate areas within Seaside that resulted in disproportionate damages relative to the surrounding area.
    2. Geographically weighted regression – Geographically weighted regression was performed to evaluate the relationship between tsunami momentum flux and percent damage.

As a significant portion of my research will be spatially-oriented, the tools and skills I learned during this term will be beneficial for future work. Furthermore, this course introduced additional geo-spatial statistic methods that were not implemented for this project but could be relevant for additional work.

References

Kameshwar, S., Farokhnia, K., Park, H., Alam, M. S., Cox, D., Barbosa, A., & van de Lindt, J. (n.d.). Probabilistic decision-support framework for community resilience: Incorporating multi-hazards, infrastructure interdepencies, and target objectives in a Bayesian network. Reliability Engineering and System Safety.

Park, H., Cox, D. T., Alam, M. S., & Barbosa, A. R. (2017). Probabilistic Seismic and Tsunami Hazard Analysis Conditioned on a Megathrust Rupture of the Cascadia Subduction Zone. Frontiers in Built Environment, 3(June), 1–19. https://doi.org/10.3389/fbuil.2017.00032

 

Spatial and Temporal Patterns of Reported Salmonella Rates in Oregon

  1. Question Asked

Here I asked if there was evidence supporting temporal autocorrelation of age-adjusted Salmonella county rates within Oregon from 2008-2017 and if so what type of correlation structure is most appropriate. I also investigated spatial patterns of reported Salmonella rates as they related to various demographic variables like: Percent of county which is aged 0-4, percent of county which is aged 80+, percent of county which is female, median age, percentage of county residents who have completed high school, median county income, percent of county who is born outside the US, percent of county who speaks a language other than English at home, percentage of county estimated to be in poverty, and percent of children in a county estimated to be in poverty.

To answer these questions, I used the same data outlined in the first exercise blog post with newer demographic variables being taken from the American Community Survey and published on AmericanFactFinder which provides yearly estimates of demographic information at the county level. Unfortunately, yearly data prior to the year 2009 is unavailable which shortened the window of analysis by a year.

  1. Names of analytical tools/approaches used

The methods used to answer these questions was first to create an exhaustive general linear model where county Salmonella rates were a function of the above listed demographic variables. A Ljung-Box Test was used to assess if there was evidence of non-zero autocorrelation of residuals of the model at various time lags. Following this, an ideal linear model was selected using a stepwise AIC selection process and then different variance structures were compared by AIC, BIC, and log Likelihood metrics as well as ANOVA testing. Following the selection of an appropriate base model and variance structure, I allowed for interaction between all variables and time and performed more ANOVA testing to select the best model which allowed for variable time interaction. A version of this model would later be used in geographically weighted regression analysis. I performed geographically weighted regression allowing the coefficients to vary across space in Oregon.

  1. Description of the analytical process

I created a lattice plot of reported age-adjusted Salmonella rates over the 10-year period in every county to visually assess whether Salmonella rates were changing over time. After seeing that rates of reported Salmonella were changing over time in many Oregon counties I created a “full” linear model in which the rate of reported Salmonella cases in a county was a function of the demographic variables described above. Because this is longitudinal data measured over time I wanted to see if rates of Salmonella were correlated over time, meaning that a county’s rate one year could be predicted from the rate found in another year in the same county. I first performed a Ljung-Box test to assess the need to evaluate for temporal autocorrelation as well as tested normality assumptions, and log-transformed my outcome (Salmonella rates) based on those normality assumption tests. A simple backward step-wise AIC approach was used on my full model to identify and remove extraneous variables. This worked by removing variables in the full model in a stepwise fashion, comparing the AIC values between the two models, and continuing this process until the AIC values between the two models being compared are not significantly different. I then used this model to select an ideal variance structure to compare Salmonella rates autocorrelated at different time lags. The types of variance compared were: Independent variance, compound symmetric, AR1, MA1, ARMA (1,1), unstructured, and allowing for different variance structures across time. After AIC, BIC, log Likelihood, and ANOVA testing an ideal variance structure was determined and the model using this variance structure was evaluated for basic interaction with time. All variables present in the model were allowed to have time interaction including time itself (i.e. allowing for a quadratic time trend). Once again AIC, BIC, log Likelihood, and ANOVA testing were used to select the most ideal model.

Following this I moved on to GWR, where I was able to use the model identified above to create a new data frame containing Beta coefficient estimates of the significant variables in my final model for every county in Oregon. This data frame of coefficients was merged with a spatial data frame containing county level information for all of Oregon. Plots of the coefficient values for every different county were created.

  1. Brief Description of Results

Every panel represents one of Oregon’s 36 counties and panels containing an error did not have any cases of reported Salmonella. Some counties are seen decreasing with time, others show slightly increasing trends, and others show a fairly level rate over time. Clearly there is evidence of some time trend for some counties.

Results of normality testing found that age- adjusted rates of reported Salmonella in Oregon were not normally distributed, for the ease of visualization and as an attempt to address the failure to meet the assumption of normality in linear modeling I log-transformed the rates of reported Salmonella. Results of the Ljung-Box test with my full model provided evidence of non-zero time autocorrelation in the data and a visual inspection of the lattice plot supports this with most counties showing a change in rates over time.

The stepwise AIC model selection approach yielded the following model:

logsal ~ %female + %childpov + medianage +year

Covariance structure comparison:

Covariance Model logLikelihood AIC BIC
Independent -431 874 896
Compound Symmetry -423 860 887
AR(1) -427 869 895
MA(1) -427 869 895
ARMA(1,1) -423 862 892
Unstructured -387 859 1017
Compound Symmetry Different Variance Across Time -412 854 910

 

Mostly AIC, BIC, and Log Likelihood values were clustered together for the different models. I decided to base my choice primarily on the lowest AIC because that’s how I did variable selection to this point. This resulted in me choosing a compound symmetric model which allowed for different variances across time.

Next, I built models which allowed for simple interaction with time meaning that any three-way interaction with time was not evaluated for. Subsequent ANOVA testing comparing the different interaction models to each other, to a model where no interaction was present, and a model where time was absent were used in my selection of a final model.

Final Model: 

logsal ~ %female + %childpov + medianage + year +(medianage*year)

This model follows a 5×5 compound symmetric correlation variance model which allows for variance to change over time.

Code: interact_m <- gls(logsal ~ female + childpov + medianage + year +(medianage*year), na.action=na.omit, correlation= corCompSymm(form= ~1|county),weights=varIdent(form=~1|year),data=alldata)
Within County Standard Error  (95% CI): 0.92
Estimate Name Estimate (log-scale) Std. Error p-value
Intercept -759.42 237.79 0.002
% Female 18.06 4.46 <0.001
% Child Poverty 0.03 3.27 0.001
Median Age 17.16 3.11 0.002
Year 0.38 3.18 0.002
Median Age*Year -0.01 -3.12 0.002

 

Estimates are on the log scale making them difficult to interpret without exponentiation, however it can be seen that a percent change in the number of females in a county or a year change in median age are associated with much larger changes in rates of reported Salmonella incidence compared to changes in percent of child poverty and the year. Overall, incidence rates of reported Salmonella were shown to increase with time, county percentage females, county percentage of child poverty, and county median age with a significant protective interaction effect between time and median age.

For my GWR analysis I used a function derived from the rspatial.org website and looks like:

regfun <- function(x) {
dat <- alldata[alldata$county == x, ]
m <- glm(logsal ~ female + childpov + medianage + year + (medianage*year), data=dat)
coefficients(m)
}

As can be seen, I retained the same significant variables found in the OLS regression for my time series analysis. GWR in this case allows for the coefficient estimates to vary by county.

This allowed me to create a data frame of all coefficient estimates for every county in Oregon. Subsequent dot charts showed the direction and magnitude of all covariates varied across the counties. Plots and dot charts of Oregon for the different coefficient estimates were made.

For % Female

For % Child Poverty

For Median Age

For Year

For Median Age * Year

For the most part, county % female, median age, year, and the interaction term clustered close to 0 for most counties. Some counties were showed highly positive/negative coefficient estimates though no consistently high/low counties could be identified. The maps for the coefficients of median age and year are very similar though I do not have a clear idea why this is the case. The map of the coefficients of child poverty showed the most varied distribution across space. Autocorrelation analysis using Moran’s I of the residuals from this GWR model did not find any evidence of significant autocorrelation. I could not find evidence of a significant non-random spatial pattern for the residuals of my model.

  1. Critique of Methods Used

While the temporal autocorrelation analysis was useful in that it provided evidence of temporal autocorrelation present in the data and prior univariate spatial autocorrelation provided limited evidence of variables being spatially autocorrelated at different spatial lags, I was unable to perform cross correlation analysis between my outcome and predictor variables. One important note: I do plan on performing this analysis I just need to figure out how the “ncf” package works in R. This is one of the more glaring shortcomings of my analysis so far is that I do not have evidence that my outcome is correlated with my predictor variables at various distances. Another critique is that the choice of an ideal temporal correlation structure was fairly subjective with my choice of model selection criteria being AIC. Basing my decision on other criteria would likely change my covariance structure. A similar argument could be said for my choice of variable selection being based on a backwards stepwise AIC approach where other selection criteria/methods would likely have different variables in the model.

Finally, the results of my GWR analysis do not show actual drivers of reported Salmonella rates. Rather it shows the demographic characteristics associated with higher reported rates. While this information is useful it does not identify any direct drivers of disease incidence. Further analysis will be required to see if these populations have different exposures or severity of risky exposures.

Exercise 2: Geographically weighted regression on two forested stands.

Bryan Begay

  1. Initial Spatial Question: How does the spatial arrangement of trees relate to forest aesthetics in my areas of interest?

Context:

To understand forest aesthetics in my stand called Saddleback, I did a Ripley’s K analysis for Saddleback and on a riparian stand called Baker Creek to determine if the stands are clustered or dispersed.  The Baker Creek location is a mile west of the Saddleback stand.

  1. Geographically weighted Regression:

I performed a geographically weighted regression on both the Saddleback and the Baker Creek stands. The dependent variable was a density raster value and the explanatory value was tree height.

  1. Tools and Workflow

Figure 1. The workflow for creating the Geographically Weighted Regression for the Saddleback Stand. The Baker Creek stand followed the same workflow as well.

Results:

 

Figure 2. Geographically Weighted Regression showing the explanatory variable coefficients in the Saddleback and Baker Creek stands near Corvallis Oregon. Yellow color indicates negative relationships and the hotter colors  indicate positive relationships between tree height and density.

Figure 3. Geographically Weighted Regression showing the Local R2 values in the Saddleback and Baker Creek stands near Corvallis Oregon. Yellow color indicates that the local model is performing poorly, while hotter colors indicate better performance locally.

Table 1. Summary table output for the Saddleback stand’s geographically weighted regression.

Table 2. Summary table output for the Back Creek stand’s geographically weighted regression.

4. Interpretation/Discussion:

Having done the Ripley’s K analysis, I wanted to have a connection with this exercise, so I created a point density raster on both my stands (Figure 1). The point density raster calculates a magnitude-per-unit area from my tree points and outputs a density for the neighborhood around each tree point. The raster values would then be a descriptor of the trees neighborhood density. Having the density neighborhood values describes the stands tree spatial arraignment and relates to the Ripley’s K analysis outputs of telling if a stand is spatially clustered or dispersed.

Figure 2. shows that there is a spatial pattern in the Saddleback stand between density and height. There is a positive relationship on the edges of the stand and a decreasing relationship in the middle of stand between the two variables. This makes sense when thinking about how the stand would have denser and higher trees on the edges of the managed stand to screen the forest operations. The coefficient values for the baker creek showed a positive relationship on the north eastern portion of the stand, which would need further investigation to understand the relationship between density and height. Overall the relationship was negative in the Baker creek stand between density and height, but this may be attributed to the low local R2 values that indicate poor modeling (figure 3). Table 2. also shows that the Baker Creek model only accounted for 50% of the variance for the adjusted R2 values, which would indicate that more variables would be needed for the riparian stand. Figure 1. shows the summary table for GWR in the Saddleback stand.

  1. Critiques

The critiques for this exercise is that I only look at height and density. If I had more knowledge of working with LAS data sets I would have liked to have implemented the return values on the LiDAR data as an indicator of density. Another critique would be that I used density as a dependent variable and height as an explanatory variable. Using density as the dependent value allows me to see the spatial patterning of my trees when plotted in ArcMap so I can reference the Ripley’s K outputs for further analysis. Having height as a response variable with density as an explanatory is something that would have been easier for me understand and explain that relationship. Density can affect tree height in a stand but understanding tree height as a factor that affects density is not as intuitive. Looking at how tree height responds to density in my stand would tell something about tree height, but that relationship has already been explored in great depth.

On the relationship between spatial variation of economic losses and tsunami momentum flux

Question

For Exercise 1, I evaluated the spatial pattern of economic losses resulting from a joint earthquake and tsunami event. I then deaggregated the losses by hazard (earthquake only, tsunami only, and combined) as well the intensity of the event.

For Exercise 2, I evaluated how the spatial pattern of economic losses resulting from a tsunami relates to the spatial pattern of tsunami momentum flux (a measure of velocity and inundation depth) by performing a geographically weighted regression (GWR). For this analysis, I only considered the tsunami because there is significant spatial variation in the hazard, whereas the spatial variation for the earthquake is minimal.

Tool and approach

I performed the GWR using the python library PySAL (Python Spatial Analysis Library). The independent variable was defined as the momentum flux, and the dependent variable defined as the percent loss of economic value.

Description of steps

The average losses at each building resulting from an earthquake/tsunami loss model were first converted to percent loss (loss divided by real market value), and added as an attribute to a GIS shapefile. The percent loss was used as opposed to the economic losses because each building has a different initial value. Consequently, the percent loss serves to normalize the economic losses across all buildings within Seaside. For this analysis, the results from the “1000-year” tsunami event were analyzed.

The GWR was then performed using PySAL with the momentum flux defined as the independent variable and the percent loss defined as the dependent variable. The GWR resulted in a slope and intercept at each tax lot, as well as a global r2 value. Two separate maps were generated wherein each tax lot was color coded based on values of the slope and intercept.

Description of results

The results from the GWR and a global regression are shown in Figures 1 and 2 respectively. A global r-squared value of 0.575 was obtained, indicating that the data is moderately correlated. In Figure 1, it can be seen that the intercept is larger near to the ocean, and decreases as the distance to the shore increases. This can be explained by the fact that the momentum flux is the largest near to the coast, and decreases as the tsunami propagates over the land.

Similar trends would be expected for the slope coefficients; however, it can be seen that along the coast the results are negative indicating that the economic losses decrease as the momentum flux increases. This can likely be explained by inconsistent building types within Seaside. For example, concrete buildings are able to better withstand the impact of a tsunami compared to their wood counterparts. Similarly, buildings of different heights (number of floors) have different damage properties. Consequently, because the building types are not consistent within Seaside, significant variations in the percent of loss within a small spatial region can occur (e.g. a wood building is located next to a concrete building). This would lead to a decrease in percent loss for a larger momentum flux.

Figure 1: Spatial variation of slope and intercept resulting from the GWR

Figure 2: Global regression and line of best fit

Critique

While the GWR does provide a means to evaluate correlation between two variables that are within the same geographical region, there are limitations for this particular application. The results showed negative slopes in some locations, which is likely caused by the large variation in the percent loss. To alleviate this, alternative statistical models could be developed using GWR that only consider similar building types. An example of a non-spatial regression for wood buildings with 2 and 3 floors can be seen in Figure 3. The improvement in r-squared values can be observed, and would likely translate to the GWR.

Figure 3: Example of global regression considering specific building types

 

Fire Refugia’s Effects on Clustering of Infected and Uninfected Western Hemlock Trees

Overview

For Exercise 1, I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area (Map 1). I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

In my study site, 2 fires burned. Once in 1829, burning most of the stand, and then again in 1892, burning everywhere except the fire refugia (polygons filled in blue). This created a multi-storied forest with remnant trees located in the fire refugias. One component of the remnant forest are infected western hemlocks. These remnant hemlocks serve as the source of inoculum for the hemlocks regenerating after the 1892 fire.

For Exercise 2, my research question was: How does the spatial pattern of fire refugia affect the spatial pattern of western hemlock dwarf mistletoe?

I predicted that a cluster of infected western hemlocks are more likely to be next to a fire refugia than a cluster of uninfected trees. In order to assess this relationship, I used the geographically weighted regression tool in ArcMap.

Geographically Weighted Regression

Geographically weight regression (GWR) works by creating a local regression equation for each feature in a data set you want to analyze, using an explanatory variable(s) to predict values for the response variable, using the least squares method. The Ordinary Least Squares (OLS) tool differs from GWR because OLS creates a global regression model (one model for all features) whereas GWR creates local models (one model per feature) to account for the spatial relationship of the features to each other. Because the method of least squares is still used, assumptions should still be met for statistically rigorous testing. The output of the GWR tool is a feature class of the same type as the input, with a variety of attributes for each feature. These attributes summarize the ability of the local regression model to predict the actual observed value at that feature’s location. If you have an explanatory variable that explains a significant amount of the variation of the response variable, this is useful for seeing how its coefficient varies spatially.

Execution of GWR

To use this tool, I quantified the relationship between the trees and the fire refugia. I used the “Near” tool for this to calculate the nearest distance to a fire refugia polygon’s edge. This was my explanatory variable. My response variable was the z-score that was output for each tree from the Optimized Hot Spot Analysis. Then I ran the GWR tool. I then used the Moran’s I tool to check for spatial autocorrelation of the residuals. This is to check the clustering of residuals. Clustering indicates I may have left out a key explanatory variable. The figure below displays my process.

I tested the relationship between nearest distance to a fire refugia polygon’s edge and the z-score that was output for each tree from the Optimized Hot Spot Analysis using OLS, which is necessary to develop a well specified model. My R2 value for this global model was 0.005, which is incredibly small. Normally I would have stopped here and sought out other variables to explain this pattern, but for this exercise I continued the process. 

Results

This GWR produced a high global R2 value of 0.98 (Adj R2 0.98) indicating that distance to refugia does a good job of explaining variance in the spatial pattern of infected and uninfected trees. However, examining the other metrics for the local model performance gives a different picture of model performance.

Map 2 displays results for the coefficients for the explanatory variable of distance to nearest refugia. As this variable changes, the z-score increases or decreases. These changes in z-scores indicate a clustering of high or low values. From examining the range of coefficient values, the range is quite small, -0.513 to 0.953. This means that across my study site, the coefficient only changes slightly from positive to negative. In the north western corner, we see a cluster of positive coefficient values. Here, as distance to refugia increases, the z-score of trees increases, predicting a clustering of infected trees. These values are associated with high local R2 values (Map 4). In other places of the stand we see slight clustering of negative coefficients, indicating distance to refugia decreases the z-score of trees, predicting a clustering of uninfected trees.

Map 3 displays the standardized residuals for each tree. Blue values indicate where the local model over-predicted what the actual observed value was, and red values are under-predictions. When residuals from the local regression models are distributed randomly (i.e. not clustered or dispersed) over the study area, then the geographically weighted regression model is fit well, or well specified. The residuals of the local regression models were significantly clustered. (Moran’s Index of 0.265, p-value of 0.000, z-score of 24.344). Because we can observe clustering in my study area of residuals, there is another phenomenon driving the changes in z-scores; in other words, driving the clustering of infected and uninfected trees.

From the previous two map evaluations I saw that the distance of a tree to fire refugia was not the only explanatory variable necessary to explain why infected and uninfected trees clustered. Map 4 displays the local R2 values for each feature. The areas in red are high local R2 values. We see the northwestern corner has a large number of large values which correspond to a cluster of small residuals and positive coefficients. Here, distance to fire refugia explains the clustering of infected trees well. The reverse is observed in several other places (clusters of blue) where distance to fire refugia does not explain why infected or uninfected trees cluster. In fact the majority of observations had a local R2 of 0.4 or less. From this evaluation, I believe this GWR model using distance to refugia does a good job of explaining the clustering of infected trees, but not much else.

Critique

GWR is useful for determining how the coefficient of an explanatory variable can change across an area. One feature in a specified area may have a slightly different coefficient from another feature, indicating these two features are experiencing different conditions in space. This allows the user to make decisions about where the explanatory has the most positive or negative impact. This result is not something you can derive from a simple OLS global model. This local regression process is something you could do manually but the tool in ArcMap makes this process easy. The output of GWR is also easy to interpret visually.

Some drawbacks are that you need to run the OLS model first for your data to determine which variables are significant in determining your response variable. If not, then a poorly specified model can lead to inappropriate conclusions about the explanatory variable (i.e. high R2 values). Also, the evaluation of how the features interact in space is not totally clear. The features are evaluated within a fixed distance or number of neighbors, but there is no description for how weights are applied to each neighboring feature. Lastly, for incidence data, this tool is much harder to use if you want to determine what is driving the spatial pattern of your incidence data. Some other continuous metric (in my case a z-score) must be used as the response variable, making results harder to interpret.

Model Results Follow-Up

After finding that distance to a refugia was not a significant driver for the majority of trees, I examined my data for other spatial relationships. After a hotspot analysis on solely the infected trees, I found that the dispersal of infected trees slightly lined up with the fire refugia drawn on the map (Map 5).

Among other measures, forest structure was used to determine where fire refugia were located. Old forest structure is typically more diverse vertically and less clustered spatially. Also infected western hemlocks are good indicators of fire refugia boundaries because as a fire sensitive tree species, they would not survive most fire damage and the presence of dwarf mistletoe indicates they have been present on the landscape for a while. From the map we can see that the dispersal of infected trees only lines up with the refugia in a few places. This mis-drawing of fire refguia bounds may be a potential explanation for under-performance of the GWR model.