Category Archives: Exercise 2

Multiple Buffer Distance Analysis on Woodpecker Nest Buffers Intersecting Salvage Harvest Areas

Exercise 2: Multiple Buffer Distance Analysis

Question

How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)

Tool

Multiple Ring Buffer in ArcMap

This tool creates shapefiles of concentric circles around features based on user-input values. Users can dissolve buffers to create a single feature class per buffer distance. Buffer widths overlapping with features of interest can indicate spatial relationships between target subjects. In this case, intersecting woodpecker nest buffers with salvage harvest polygons may reveal trends in woodpeckers selecting nest sites closer to or farther from salvage units. An equation producing percent area of each nest buffer intersected by a salvage unit serves as an index.

Above: Buffers created in a test analysis at 15 – 50 meter intervals around nest point features.

Data

For this exercise I used 2016 and 2017 woodpecker nest point shapefiles. I created multiple ring buffer outputs for each shapefile to use in the analysis. I also used a polygon shapefile of 35 salvage harvest units included within the treatment woodpecker survey units. I used another polgyon shapefile of the woodpecker survey units and a WorldView-3 1 m raster for supplementary data.

Multiple Buffer Distance Analysis Steps

Create separate point shapefiles for 2016 and 2017 nest points.
Run the Multiple Ring Buffer tool in ArcMap on each shapefile at 50 meter intervals from 50 – 300 meters. The tool creates
Use the Intersect tool in ArcMap to create a polygon shapefile of the buffers clipped to the salvage harvest units.
Use the Dissolve tool in ArcMap to merge overlapping buffer polygons of the same type for the same nest.

Above: The green polygons indicate areas where the Intersect tool identified overlap between the buffers and salvage unit polygons. The tool creates a polygon shapefile of the overlapping areas.

Above: The final result of six buffers at 50 m intervals around nest points intersected with salvage harvest units. Each color represents an individual polygon in a shapefile. The polygons are given size information and used to calculate percent overlap of nest buffers with salvage units.

5. Use the XTools Pro extension to attribute size information to each intersected buffer shapefile for area in square meters.

6. Create a new field in each intersected buffer attribute table for percent area of the buffer intersected by a salvage unit.

7. Use the field calculator to create a formula for percent buffer area intersected: (Area field/Complete buffer size)*100

8. Export each intersected buffer table to Excel, including the salvage treatment type column and the percent buffer area intersected column for further analysis. Transfer the salvage treatment type and corresponding percent buffer area intersected columns to a combined Excel file with a sheet for every buffer distance.

9. Add zero values to each sheet for the nests falling within a control unit. The control nests act as a fourth treatment and should be included in the results.

10. Use ggplot2 in R to extract data from each sheet and create box plots for each buffer distance.

Above: Excel table of X and Y inputs for boxplots showing percent of the complete buffer intersected by a salvage harvest unit. Each buffer distance from 50 – 300 m has a sheet containing columns for these values.

Results

I generated graphs for each of the six buffer intervals (50-300 m at 50 m intervals) in 2016 (pre-salvage) and 2017 (post-salvage) for 12 total graphs. I presented the 2016 and 2017 results for each buffer distance side by side below. In each graph, the unlabeled grouping of points to the left of salvage treatment type 1 represents nests in the control area, or nests 100% inside salvage treatment 0. Visual analysis reveals interesting patterns about woodpecker nest site selection when considering the silvicultural prescriptions designating salvage treatment types. Below is a reminder of the salvage treatments:

Treatment 1 harvests the most trees overall but retains the most large diameter trees with spacing for Lewis’s woodpeckers. Treatment 2 harvests a moderate number of trees, retaining less large diameter trees but more medium and small diameter trees. Treatment 3 harvests a limited number of trees but retains barely any large diameter with a heavy focus on small diameter for white-headed woodpeckers. The control treatments do not harvest any trees.

An obvious downfall of the following graphs is that as distance from the nest increases, percent of the nest buffer intersecting a salvage polygon will decrease. There is an overall decrease in mean and midrange values for intersecting area as the buffer distance increases. However, some significant points emerge:

Nest distribution in Treatment 1 units generally increases in breadth between 2016 and 2017. Meaning, in 2016 the distribution of woodpecker nest distances from a salvage unit centered more closely around a mean distance near 30 – 50%. In 2017 the values appear to spread out with less preference towards a specific distance from a salvage unit.
Nests in Treatments 2 and 3 units generally decrease in percent area intersected by a salvage unit from 2016 to 2017. Meaning, after the salvage harvest, woodpeckers are selecting nest sites farther from Treatments 2 and 3.

50 Meter Buffer Results

______________________________________________________________________________________________________

100 Meter Buffer Results

______________________________________________________________________________________________________

150 Meter Buffer Results

______________________________________________________________________________________________________

200 Meter Buffer Results

______________________________________________________________________________________________________

250 Meter Buffer Results

______________________________________________________________________________________________________

300 Meter Buffer Results

Problems and Critique

I would perform this analysis again with larger buffer sizes and greater buffer intervals. I am not convinced the buffer size I chose for this exercise captured significant information at this scale. I created 10 buffers from 100 – 1000 m for a future analysis. Buffer size should be determined by assumed travel and foraging distances for each woodpecker species. The 50 – 300 m scale may not be a great enough distance for local trends to develop in the data. I chose these distances because the belt transects for the woodpecker point count surveys are 200 – 300 m apart to avoid interfering with birds on neighboring transects. I thought this would be a comparable scale for the buffer analysis.

This analysis fails to address or quantify the control nests because they are too far away from salvage units for their buffers to intersect. In Exercise 3, a near analysis in ArcMap can quantify trends in control nest distances from salvage areas.

In the future I would like to produce a statistics table for each 2016 and 2017 buffer distance displaying mean, standard deviation, and other metrics for more than a visual analysis.

Watershed recession behavior as a function of watershed environmental variable

Question asked: How is the spatial pattern of watershed recession coefficients related to the spatial patterns of watershed elevation, soils, geology, basin area, and precipitation?

Name of the tool or approach used: To perform this analysis, I used recession analysis to quantify low flow metrics for 12 streams and rivers in the Oregon Coast Range. I classified the a and b recession coefficients using 2 different approaches in ArcGIS: equal interval and natural jenks. The resulting classifications are quite different.

Methods of procedure:

Recession Analysis: the methods employed for the recession analysis are described in Exercise 1. For Exercise 2, I calculated recession coefficients for an additional seven sites in the Oregon Coast Range for a total of nine sites. Data for the present analysis was analyzed on the hourly timestep.
Watershed Delineation: I delineated watersheds in ArcGIS based on each gaged pour point.
Spatial data: I downloaded relevant spatial data from various geospatial databases including Oregon Geospatial Enterprise Office, USGS, USDA, and OSU PRISM Climate Group.
Data Visualization and Classification: I used ArcGIS to visualize my spatial data and to classify the recession analysis results. I used two classification methods: equal interval and natural jenks. Both methods had 5 classes.
1. Equal interval classification: classification categories using this method is calculated by dividing the range of data by the designated number of classes.
2. Natural jenks classification: this is an optimization method of classification that minimizes variance within classes and maximizes variance between classes.
The results of my classification were analyzed by visual comparison. That is comparison between the two classification methods, as well as how the results of each classification methods compared to the environmental variables across the watersheds.

Results:

Results from this analysis demonstrate that the recession coefficient classification method influences the apparent spatial variability of recession metrics. The natural jenks method results in more spatial heterogeneity across the latitudinal gradient (Figure 1). The a and b recession coefficients seem to be related in the way that they vary across space, however coefficient b is slightly more variable than coefficient a. The equal interval classification method results in a more homogenous representation of both a and b coefficients (Figure 2). Coefficient a seems to be largely controlled by basin size using this method, which is apparent because the smallest basin that is in the class with high values. Coefficient b is more variable, and also seems to be controlled by basin area. Precipitation may also be a primary control. Overall, between the two classification methods, the Nehalem watershed (furthest north) and the Chetco watershed (furthest south) seem to demonstrate similar recession coefficients. The watersheds between the Chetco and the Nehalem, in general, demonstrate similar recession behavior.

Figure 1. Classification of recession coefficients using the natural jenks method in ArcGIS, compared to five environmental variables. Coefficient a is in the first panel and coefficient b is in the second panel.

Figure 2. Classification of recession coefficients using the equal interval method in ArcGIS, compared to five environmental variables. Coefficient a is in the first panel and coefficient b is in the second panel.

Critique of method:

This was a simple method of visual comparison across space, however it was quite useful for both: 1) considering the spatial patterns of watershed recession behavior, and 2) comparing classification methods and how they influence the outcome of the analysis. Because it is just a visual comparison, there are no quantifiable differences presented here, which will be important moving forward. Additionally, this was an important exercise to understand the mechanical steps necessary for making this comparison.

Does the presence of Olympia oysters correspond with predicted suitable habitat?

Question explored
In my last blog post, I mapped habitat suitability for Olympia oysters in Yaquina Bay, OR by assessing three environmental parameters: salinity, substrate availability, and elevation. In exercise 2, I brought in oyster location data points collected from field surveys of the intertidal zone to compare against the map of suitable habitat. The question I am examining in this exercise is:

How does the spatial pattern of Olympia oyster location data correspond to the spatial pattern of suitable habitat in Yaquina Bay?

Field surveys of the intertidal zone of Yaquina Bay were conducted on April 19-20 and May 17, 2019 during low tides. Oysters were characterized as ‘present’ if evidence of at least one oyster (living or dead) was detected within the predefined search area.

Name of tool or approach
I first uploaded the data points collected in the field into Google Earth where I could easily verify the locations against my field notes and photos, as well as perform some minor quality control. The points were imported into ArcGIS Pro for spatial analysis. Statistical information was reviewed and plotted in Excel.

Brief description of steps to complete the analysis

After validating the data point locations and performing some minor quality control in Google Earth, I saved the points as a KML file. In ArcGIS Pro, I used the geoprocessing tool ‘KML to layer’ to convert them for analysis. Once I added the data points onto the map as a layer, I edited the symbology to display the points as ‘Present’ or ‘Absent’.
To assess the neighboring habitat, defined by 4 class types from least suitable to most suitable, surrounding each of the data points, I used the ‘Multiple Ring Buffer’ tool to create 3 buffer rings around each of the points at distances of 75, 150, and 300 meters. For the results in this blog post, only the 300-meter buffer was used. I selected ‘overlapping (disks)’ in the dissolve option to assess the habitat around each data point individually.
Once the buffers were created, I used the ‘Zonal Statistics’ tool to overlay the buffered areas onto the raster of habitat suitability. This tool allows the user to select by statistic desired (mean, median, etc.) to generate a spatial output. I chose ‘majority’, which categorizes the buffer zones based on dominant habitat suitability type within the buffer. Majority also represented the median in this output. For example, if the majority of the suitable habitat within the buffer area is class 4 (most suitable) then the buffer display is shown as ‘4’.
In addition to the ‘Zonal Statistics’ spatial output, I used the tool ‘Zonal Statistics as Table’ to generate a table of all the statistical information relevant to this analysis. The same input data is used (overlaying the buffered zones on the habitat suitability raster) to create this table.
I copied the table generated into Excel where I split up the data on a couple levels: 1) Presence vs. Absence and 2) North shore vs. South shore for comparison. The north and south shores are managed very differently: the north shore is largely composed of rip rap and steeper slopes because Yaquina Bay Road runs right along the edge, whereas the south shore is more natural and less developed. I created box and whisker plots of the majority habitat suitability type, minority, variety, and mean.

Results

The results show some mixed information. When looking ‘Presence’ data points, the majority habitat types surrounding these points on both the north and south shore are 3 or 4, most suitable areas. The minority habitat type is very different between the north and south shore, with the south, more natural shoreline showing a stronger correlation with low suitability habitat being found less often around presence data points. The means for presence data points generally correspond with greater habitat suitability.

However, the absence data points show that the majority habitat type tends to be more closely aligned with predicted most suitable habitat, especially on the south shore. Additionally, the minority habitat types surrounding the absence data points are 1, 2, and 3, indicating that the least suitable habitat does not constitute much of the area. This could be partly due to low coverage by the least suitable habitat type overall (see maps). There appears to be very weak correlation between absence of oysters and location of suitable habitat. Absence of oysters is likely to be recorded in both suitable and unsuitable habitat.

Critique of the method
This method further revealed to me that the importance of the resolution of the baseline data. Based on field observations and conversations with shellfish biologists, the distribution of Olympia oysters is very patchy due to substrate availability. The oysters may be found attached to a pile of rocks in the middle of the mud flat, but will not be found elsewhere in the mud flat. The raster layer I have available for substrate has classified substrate into large generalized categories, which does not reflect the nuanced nature of their opportunistic settling strategy. Dividing habitat suitability into only 4 categories limits the complexity of the analysis which can be helpful, but also means that there’s not a lot of distinction between suitable and unsuitable. Additionally, more data points will help make this analysis more robust.

Using the buffers and the ‘Zonal Statistics’ tools created a generalized output that provides some useful information for analyzing habitat suitability for the oysters. The approach is easily duplicated, which was helpful as I needed to add my field data points in batches as I collected them. What would be more informative for the next iteration is to be able to analyze multiple buffers side-by-side; how does the smallest neighborhood around each point compare to the larger ones?

Exercise 2: Possible Influence of ENSO Index on Dolphin Sighting Latitudes

Exercise 2

Question Asked: Are latitudinal differences in dolphin sightings in the San Diego, CA survey area related to El Niño Southern Oscillation (ENSO) index values on a monthly temporal scale?

My previous question for Exercise 1 was: do the number of dolphin sightings in the San Diego, CA survey region differ latitudinally? I was finally able to answer this question with a histogram of sighting count by latitudinal difference. I defined latitudinal difference as the difference from the highest latitude of dolphin sightings (the Northernmost sighting point along the San Diego transect line) to the other sighting points, in decimal degrees. Therefore it becomes a simple mathematical subtraction in ArcMap. Smaller differences would be the result of a small difference and therefore mean more Northerly sighting, with large differences being from more Southerly areas. I used all sightings in the San Diego region (from 1981 through 2015). As you can see from below, there is an unequal distribution of sightings at different latitudes. Because I had visual confirmation of differences at least when all sightings are binned (in terms of all years from 1981-2015 treated the same), I looked for what process could be affecting these differences in latitude.
Comparing the Latitudes with the frequency of dolphin sightings in San Diego, CA

ENSO is a large-scale climate phenomena where the climate modes periodically fluctuate (Sprogis et al. 2018). The climate variability produced by ENSO affects physical oceanic and coastal conditions that can both directly and indirectly influence ecological and biological processes. ENSO can alter food webs because climate changes may impact animal physiology, specifically metabolism. This creates further trophic impacts on predator-prey dynamics, often because of prey availability (Barber and Chavez 1983). During the surveys of bottlenose dolphins in California, multiple ENSO cycles have caused widespread changes in the California Current Ecosystem (CCE), such as the squid fishery collapse (Nezlin, Hamner, and Zeidberg 2002). With this knowledge, I wanted to see if the frequency of dolphin sightings in different latitudes of the most-consistently studied area was driven by ENSO.

Tool/Approach:

Primarily R Studio, some ArcMap 10.6 and Excel

Step by Step:

1.For this portion of the analysis, I exported my table of latitudinal differences within my attribute table for dolphin sightings from ArcMap 10.6. I saved this as a .csv and imported it into R Studio.
Some of the sighting data needed to be changed because R didn’t recognize the dates as dates, rather as factors. This is important in order to join ENSO data by month and year.
Meanwhile, I found NOAA data on a publicly-sourced website that had months as the columns and years as the rows for a matching ENSO index value of either: 1, 0, or -1 for each month/year combination. A value of 1 is a positive (warm) year, a value of 0 is a neutral year, and a value of -1 is a negative (cold) year. This is a broad-value, because indices range from 1 to -1. But, to simplify my question this was the most logical first step.
I had to convert the NOAA data into two-column data with the date in one column by MM/YYYY and then the Index value in the other column. After multiple attempts in R studio, I hand-corrected them in Excel. Then, imported this data into R studio.
I was then able to tell R to match the sighting date’s month and year to the ENSO data’s month and year, and assign the respective ENSO value. Then I assigned the ENSO values as factors.
I created a boxplot to visualize if there were differences in distributions of latitudinal differences and ENSO index. (See figure)Illustrating the number of sightings grouped by ENSO index values (1, 0, and -1).
Then I ran an ANOVA to see if there was a reportable, strong difference in sighting latitudinal difference and ENSO index value.

Results:

From the boxplot, it appears that in warm years (ENSO index level of “1”), the dolphins are sighted more frequently in lower latitudes, closer to Mexican waters when compared to the neutral (“0”) and cold years (“-1”). This result is intriguing because I would have expected dolphins to move northerly during warm months to maintain similar body temperatures in the same water temperatures. However, warm ENSO years could shift prey availability or nutrients southerly, which is why there are more sightings further south. The result of the ANOVA, was a p-value of <2e^-16, providing very strong evidence to reject the null of hypothesis of no difference. I followed up with a Tukey HSD and found that there is strong evidence for differences between both the 0 and -1, -1 and 1, and 1 and 0 values. Therefore, the different ENSO indices on a monthly scale are significantly contributing to the differences in sighting latitudes in the San Diego study area.

Tukey HSD output:

diff lwr upr p adj

0–1 0.01161047 0.004250827 0.01897011 0.0006422

1–1 0.04101170 0.030844193 0.05117920 0.0000000

1-0 02940123 0.020689737 0.03811272 0.0000000

Critique of the Method(s):

These methods worked very well for visualization and finally solidifying that there was a difference on sighting latitude related to ENSO index value on a broad level. Data transformation and clean-up was challenging in R, and took much longer than I’d expected.

References:

Barber, Richard T., and Francisco P. Chavez. 1983. “Biological Consequences of El Niño.” Science 222 (4629): 1203–10.

Sprogis, Kate R., Fredrik Christiansen, Moritz Wandres, and Lars Bejder. 2018. “El Niño Southern Oscillation Influences the Abundance and Movements of a Marine Top Predator in Coastal Waters.” Global Change Biology 24 (3): 1085–96. https://doi.org/10.1111/gcb.13892.

Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Spatial and Temporal Patterns of Reported Salmonella Rates in Oregon

Question Asked

Here I asked if there was evidence supporting temporal autocorrelation of age-adjusted Salmonella county rates within Oregon from 2008-2017 and if so what type of correlation structure is most appropriate. I also investigated spatial patterns of reported Salmonella rates as they related to various demographic variables like: Percent of county which is aged 0-4, percent of county which is aged 80+, percent of county which is female, median age, percentage of county residents who have completed high school, median county income, percent of county who is born outside the US, percent of county who speaks a language other than English at home, percentage of county estimated to be in poverty, and percent of children in a county estimated to be in poverty.

To answer these questions, I used the same data outlined in the first exercise blog post with newer demographic variables being taken from the American Community Survey and published on AmericanFactFinder which provides yearly estimates of demographic information at the county level. Unfortunately, yearly data prior to the year 2009 is unavailable which shortened the window of analysis by a year.

Names of analytical tools/approaches used

The methods used to answer these questions was first to create an exhaustive general linear model where county Salmonella rates were a function of the above listed demographic variables. A Ljung-Box Test was used to assess if there was evidence of non-zero autocorrelation of residuals of the model at various time lags. Following this, an ideal linear model was selected using a stepwise AIC selection process and then different variance structures were compared by AIC, BIC, and log Likelihood metrics as well as ANOVA testing. Following the selection of an appropriate base model and variance structure, I allowed for interaction between all variables and time and performed more ANOVA testing to select the best model which allowed for variable time interaction. A version of this model would later be used in geographically weighted regression analysis. I performed geographically weighted regression allowing the coefficients to vary across space in Oregon.

Description of the analytical process

I created a lattice plot of reported age-adjusted Salmonella rates over the 10-year period in every county to visually assess whether Salmonella rates were changing over time. After seeing that rates of reported Salmonella were changing over time in many Oregon counties I created a “full” linear model in which the rate of reported Salmonella cases in a county was a function of the demographic variables described above. Because this is longitudinal data measured over time I wanted to see if rates of Salmonella were correlated over time, meaning that a county’s rate one year could be predicted from the rate found in another year in the same county. I first performed a Ljung-Box test to assess the need to evaluate for temporal autocorrelation as well as tested normality assumptions, and log-transformed my outcome (Salmonella rates) based on those normality assumption tests. A simple backward step-wise AIC approach was used on my full model to identify and remove extraneous variables. This worked by removing variables in the full model in a stepwise fashion, comparing the AIC values between the two models, and continuing this process until the AIC values between the two models being compared are not significantly different. I then used this model to select an ideal variance structure to compare Salmonella rates autocorrelated at different time lags. The types of variance compared were: Independent variance, compound symmetric, AR1, MA1, ARMA (1,1), unstructured, and allowing for different variance structures across time. After AIC, BIC, log Likelihood, and ANOVA testing an ideal variance structure was determined and the model using this variance structure was evaluated for basic interaction with time. All variables present in the model were allowed to have time interaction including time itself (i.e. allowing for a quadratic time trend). Once again AIC, BIC, log Likelihood, and ANOVA testing were used to select the most ideal model.

Following this I moved on to GWR, where I was able to use the model identified above to create a new data frame containing Beta coefficient estimates of the significant variables in my final model for every county in Oregon. This data frame of coefficients was merged with a spatial data frame containing county level information for all of Oregon. Plots of the coefficient values for every different county were created.

Brief Description of Results

Every panel represents one of Oregon’s 36 counties and panels containing an error did not have any cases of reported Salmonella. Some counties are seen decreasing with time, others show slightly increasing trends, and others show a fairly level rate over time. Clearly there is evidence of some time trend for some counties.

Results of normality testing found that age- adjusted rates of reported Salmonella in Oregon were not normally distributed, for the ease of visualization and as an attempt to address the failure to meet the assumption of normality in linear modeling I log-transformed the rates of reported Salmonella. Results of the Ljung-Box test with my full model provided evidence of non-zero time autocorrelation in the data and a visual inspection of the lattice plot supports this with most counties showing a change in rates over time.

The stepwise AIC model selection approach yielded the following model:

logsal ~ %female + %childpov + medianage +year

Covariance structure comparison:

Covariance Model	logLikelihood	AIC	BIC
Independent	-431	874	896
Compound Symmetry	-423	860	887
AR(1)	-427	869	895
MA(1)	-427	869	895
ARMA(1,1)	-423	862	892
Unstructured	-387	859	1017
Compound Symmetry Different Variance Across Time	-412	854	910

Mostly AIC, BIC, and Log Likelihood values were clustered together for the different models. I decided to base my choice primarily on the lowest AIC because that’s how I did variable selection to this point. This resulted in me choosing a compound symmetric model which allowed for different variances across time.

Next, I built models which allowed for simple interaction with time meaning that any three-way interaction with time was not evaluated for. Subsequent ANOVA testing comparing the different interaction models to each other, to a model where no interaction was present, and a model where time was absent were used in my selection of a final model.

Final Model: *logsal ~ %female + %childpov + medianage + year +(medianageyear)** This model follows a 5×5 compound symmetric correlation variance model which allows for variance to change over time.
Code: interact_m <- gls(logsal ~ female + childpov + medianage + year +(medianage*year), na.action=na.omit, correlation= corCompSymm(form= ~1\|county),weights=varIdent(form=~1\|year),data=alldata)
Within County Standard Error (95% CI): 0.92
Estimate Name	Estimate (log-scale)	Std. Error	p-value
Intercept	-759.42	237.79	0.002
% Female	18.06	4.46	<0.001
% Child Poverty	0.03	3.27	0.001
Median Age	17.16	3.11	0.002
Year	0.38	3.18	0.002
Median Age*Year	-0.01	-3.12	0.002

Estimates are on the log scale making them difficult to interpret without exponentiation, however it can be seen that a percent change in the number of females in a county or a year change in median age are associated with much larger changes in rates of reported Salmonella incidence compared to changes in percent of child poverty and the year. Overall, incidence rates of reported Salmonella were shown to increase with time, county percentage females, county percentage of child poverty, and county median age with a significant protective interaction effect between time and median age.

For my GWR analysis I used a function derived from the rspatial.org website and looks like:

regfun <- function(x) {
dat <- alldata[alldata$county == x, ]
m <- glm(logsal ~ female + childpov + medianage + year + (medianage*year), data=dat)
coefficients(m)
}

As can be seen, I retained the same significant variables found in the OLS regression for my time series analysis. GWR in this case allows for the coefficient estimates to vary by county.

This allowed me to create a data frame of all coefficient estimates for every county in Oregon. Subsequent dot charts showed the direction and magnitude of all covariates varied across the counties. Plots and dot charts of Oregon for the different coefficient estimates were made.

For % Female

For % Child Poverty

For Median Age

For Year

For Median Age * Year

For the most part, county % female, median age, year, and the interaction term clustered close to 0 for most counties. Some counties were showed highly positive/negative coefficient estimates though no consistently high/low counties could be identified. The maps for the coefficients of median age and year are very similar though I do not have a clear idea why this is the case. The map of the coefficients of child poverty showed the most varied distribution across space. Autocorrelation analysis using Moran’s I of the residuals from this GWR model did not find any evidence of significant autocorrelation. I could not find evidence of a significant non-random spatial pattern for the residuals of my model.

Critique of Methods Used

While the temporal autocorrelation analysis was useful in that it provided evidence of temporal autocorrelation present in the data and prior univariate spatial autocorrelation provided limited evidence of variables being spatially autocorrelated at different spatial lags, I was unable to perform cross correlation analysis between my outcome and predictor variables. One important note: I do plan on performing this analysis I just need to figure out how the “ncf” package works in R. This is one of the more glaring shortcomings of my analysis so far is that I do not have evidence that my outcome is correlated with my predictor variables at various distances. Another critique is that the choice of an ideal temporal correlation structure was fairly subjective with my choice of model selection criteria being AIC. Basing my decision on other criteria would likely change my covariance structure. A similar argument could be said for my choice of variable selection being based on a backwards stepwise AIC approach where other selection criteria/methods would likely have different variables in the model.

Finally, the results of my GWR analysis do not show actual drivers of reported Salmonella rates. Rather it shows the demographic characteristics associated with higher reported rates. While this information is useful it does not identify any direct drivers of disease incidence. Further analysis will be required to see if these populations have different exposures or severity of risky exposures.

Determining accuracy of segmentation versus ground truth disease patches of Blackleg on Turnip

Background
My research involves the classification of a leaf spots on turnip derived from the pathogen blackleg. I had hypothesized that spatial patterns of pixels based on image classification are related to manually classified spatial patterns of observed disease on turnip leaves because disease has a characteristic spectral signature of infection on the leaves. Here I focus on the accuracy of previously determined disease patches through segmentation and ground truth classification. To do this a confusion matrix is used and allows for detection of true positives, true negatives, false positives and false negatives. All of the image processing took place in ArcGIS Pro. The next step involved the accuracy assessment which was conducted in R.

Questions
How accurate is the computer image classification?
• Can the accuracy be quantified?
• How can the image classification error be visually represented?

Tools and Methodology
I began in ArcGIS Pro to help visually represent the false negative and false positive regions of the disease patches through the segmentation process of classification. To start I turned on both layers where you can see the overlap between the two classification methods in Image 1.

I then went to the symbology of the segmented image and changed the color to white. I placed this layer on top, so it covered all the raster cells of the manually classified patches leaving only the false negatives, seen in Image 2. Next, I went back into the symbology, but for the manually classified image and changed the color scheme to white. I moved this layer to the top and changed the segmented image color scheme back to unique values. I was left with Image 3 showing the false positives. This was an easy way to visualize these disease patches and how well the classification method was working.

Next, I exported a Tiff file of both the manually classified patches and the segmented patches. To ensure each cell between the two layers lined up in R I had to make sure the extent was the same for both layers when I exported. To do this I right clicked on the layer and went down to data and selected export raster. A pane appears in right side of ArcGIS Pro where I hit the drop-down arrow for clipping geometry and selected current display. I did this with both layers and had one Tiff file for computer classified image through segmentation and one for the manually classified disease patches.

Using the raster, rgdal and sp packages I was able to upload my two Tiff files in raster format to R. I gave the two files each a name and used the plot function to view the two images. I noticed they both had values associated with each patch which were on a gradient scale. To correct for this, I converted my two raster layers in R to tables. This provided a coordinate for each cell and the value associated with it. In the image segmentation raster table I had 0 to 2 and the manually classified image I had 1 to 6. For all the white space I was given ‘N/A’, which was another issue. I used the xlsx package to export my data tables to excel files. I opened the two files in excel and used the sort smallest to largest function. From here I was able to use the replace function and change all the ‘N/A’ values to 0’s and all the values associated with pixels to 1’s. The values were arbitrary associated with the pixels and I needed the raster in 1’s and 0’s format. After doing this with both excel sheets I copy and pasted the two side by side and deleted all the associated coordinates. These were also unnecessary because each pixel from the same coordinate between the layers were in the same row. I saved this excel file and uploaded it into R. I downloaded the caret package and performed a confusion matrix which can be seen below in Table 1.

Results
The visual representation for my false positive and false negative results can be seen below in Image 2 & 3 with Image 1 for comparison. You can see the false negatives for disease covers a much larger area than does the false positives. This may imply that the segmentation is limited in its assessment of diseased area. What it tends to miss is the margins of the disease but does a fair job of predicting the center of the disease where it likely originated and is most severe. To correct this, setting a larger threshold may allow for less severe regions of the disease to be classified. Because the segmentation is based on pixel reflectance value at the red band, this would mean the threshold value needs to be slightly lowered.

Additionally, an entire patch of disease was missed which can be seen in the right corner of Image 1. Currently, the classification system is set to only create segmented patches of 10 or more pixels. This patch is 9 pixels and therefore just missed the cutoff. Even though it was just shy of this requirement, we are unsure if the segmentation would have detected a difference in this diseased patch or if it was also out of the threshold for classification. If it is common for disease patches to be this small, it may be an indicator to lower the value to 5 or 6 for what it is allowed in the segmentation blocking.

There are multiple steps in the processing of the image where different routes could have been taken and potentially increased classification accuracy. The objective of this classification method is to have as little percent error as possible or at least determine if you’d rather have more false positive than false negatives, vice versa or equal. Here we have greater percent of cells which are false negatives and modest assessment of diseased pixels when simply visualizing the images.

To help quantify these images and determine how accurate the model was, a confusion matrix was used in R, seen in Table 1. The segmented classification correctly identified non-diseased regions very well and did a pretty good overall job of predicting disease that was confirmed with ground truth. There were 2075 true positives, 55 false negatives, 41 true negatives, and 12 false positives. The model correctly identified 97.1% of the 2183 total cells. The precision of the model was 42.7%. The sensitivity of the model was 77.4% and the specificity was 97.4%. The accuracy was very good for this model and is essentially a percent error calculation of the model. The sensitivity measures the proportion of actual positives that are correctly identified while the specificity is the opposite and measures the proportion of true negatives identified. The precision gives us a sense of how useful or complete the results are. The model did well overall but provided insights to possible adjustments that could be made which would increase the predictive power here.

Critique of method
One critique I have is the small sample size. While I simply intended to only lay down the framework for creating a stepwise process in disease classification, it supplies results that can hardly be statistically backed. I would like to increase my sample size to five images for part three and look for similarities and differences between the five. I also intend to make some adjustments to the process to try and increase overall accuracy, precision, specificity and sensitivity. So essentially this critique is the limited conclusion that can be drawn from a sample size of one and the need to increase that to five for now.

The second critique I have is the number of steps I have used to get to this point. I would like to find a more manageable way to do the segmentation process and the image processing steps to get to this point. I have found small changes I can make along the way. Ideally, I can use the Modelbuilder in ArcGIS Pro where most of the processing is done. This will streamline the process when I find a way to do this.

An error which is present in my results is the confusion matrix. The matrix is considering 2183 raster cells in order to perform the matrix. These cells were determined by a defined rectangle when exporting a Tiff file from ArcGIS Pro. Many of these cells are not even of the leaf and is simply classifying regions outside of the leaf. To correct this, I would need to export a Tiff file which is symbolic of the leaf shape. The confusion matrix results provided were therefore erroneous in a sense or a bit misleading.

Partner ideas
My partner talked about how they were doing a neighborhood analysis and it could be practical for me to do. She had mentioned doing it in earth engine which I haven’t used but could get some help from her. There is a multiple rings buffer and I could look at the false positives in this light. She also mentioned using geographically weighted regression. We didn’t discuss much about it, but it seemed like a good regression to perform on my error analysis. We related on some level with our projects data and issues but at the time didn’t have any clear resolves. I will be curious to follow up on our chat and see what type of analysis was performed and share my results as well.

Appendix

Image 1. Overlap between segmentation on top and manual classification below

Image 2. False negatives after subtracting segmented regions from manually classified.

Image 3. False positives after manually classified cells are subtracted from segmented.

Table 1. Confusion matrix

R Code
##raster upload

install.packages(“raster”)
install.packages(“rgdal”)
install.packages(“sp”)

library(raster)
library(rgdal)
library(sp)

img20_seg <- raster(“E:/Exercise1_Geog566/MyProject3/RasterT_afr7_Polygon_1.tif”)

img20_ground <- raster(“E:/Exercise1_Geog566/MyProject3/diseased_20_PolygonT_1.tif”)

img20_seg
img20_ground

plot(img20_ground)
plot(img20_seg)

##export data

raster.table <- as.data.frame(img20_seg, xy=T)
truth <- as.data.frame(img20_ground, xy=T)

install.packages(“xlsx”)
library(xlsx)
setwd(“E:/”)

tableimg <- raster.table

write.table(tableimg, file = “dataexport.csv”, sep = “,”)
write.table(truth, file = “truth1.csv”, sep = “,”)

##confusion matrix

#######this isn’t working

install.packages(“caret”)
library(caret)

table(confusionM$reference, confusionM$predicted)

confusionMatrix(confusionM$reference, confusionM$predicted)

##For the confusion matrix if above doesn’t work

myconfusionM <- table(confusionM$predicted, confusionM$reference)
print(myconfusionM)

##accuracy of matrix

2075+41+12+55 #total
(12+55)/2183 #misclassified/total

##Precision

41/(55+41)

##Sensitivity

41/(12+41)

##Specificity

2075/(2075+55)

Ex2: Incremental pixel greenness while moving away from refugee settlement boundaries

Exercise 2: Incremental pixel greenness while moving away from refugee settlement boundaries

The question I asked centered on how the pixel greenness / NDVI varied in buffered increments around settlements within BidiBidi, Imvepi, and Rhino Refugee camps. I wanted to compare the settlements in Bidi Bidi and Imvepi, which have a larger settlements, to Rhino, which tended to have smaller settlements more uniformly spaced and spread out. Given the more uniformed and wider spacing of Rhino, I expect that green-up will happen more quickly in comparison to the settlements in Bidi Bidi and Imvepi, which are closer together and varying in size and development pattern. This leads me to believe that there’s more cleared or available land and that
Map of Regions of Interest & Buffers

settlement size is based on political organization of camp blocks rather than natural boundaries that might exist already. One of the questions that I’m asking with these settlements and the geography of exclusion is essentially why different settlement areas and parts of these areas are included or excluded from global settlement datasets. One of the factors that contributes to this is a spectral and spatial distinction – that is, how might the green space in and around a settlement change as you move away from said settlement? With this exercise, I wanted to compare the settlements in Rhino to the settlements in Bidi Bidi and Imvepi to see if the pixel greenness changed at a different rate or in a different pattern as one moved away from the settlement center.

I used multiple different tools, including the Multi-Ring Buffer tool in QGIS (since I was working with export JSONs), EarthEngine to extract NDVI mean values from Landsat 8 satellite images at these settlement locations, and the smoothing factor in ggplot in R to plot and statistically examine the way NDVI changed.
I first needed to buffer the regions of interest that I wanted to study, of which there were 44 in Bidibidi and Impvipi and 41 in Rhino. I performed this buffering in QGIS using the Multi-Ring Buffer plugin and made buffers at 100-meter increments from 0 to 1000 meters. Some of the buffers overlapped, but for the sake of simplicity of this assessment, I ignored this. Within EarthEngine, I pulled Landsat8 images from 2018 that covered these settlements. After adding NDVI and NDBI calculated bands to the image collection of 2018 images, I performed a quality mosaic to compress the image collection into one image. I based this quality mosaic on NDVI, meaning that the pixel chosen from the image collection would be the pixel with the highest NDVI. While this can sometimes pull pixels from different dates, it does exclude the possibility of clouds and seasonality affecting the dataset by comparing just the most vegetated pixel that occurred in that area. If I were to re-do this, I might choose a single date image to capture phenological nuance. After reducing the image across all of the buffers (that is, calculating the mean NDVI within each buffer), I exported the geoJSONs, brought them back into QGIS, ensured that there was a spatial selection component linking all of the buffers and regions of interest together, and brought this data into R to plot the NDVI change over distance for all regions of interest.

Bidi Bidi Settlement; Imveppi Settlement Buffers

Rhino Settlement Buffers

The pattern of greening in the buffers around settlements in Rhino versus Imvipi and BidiBidi did present different patterns, but not particularly significant different patterns. It appears that the Rhino settlements had a faster rate of increase in greenness while moving away from the settlements especially in the first 500 meters, whereas Bidi Bidi and Imveppi showed a more gradual green-up, although there also seems to be a small shift at 500 meters. These results are somewhat expected, but also not very drastic. It would be interesting to see how the green-up changes if I increased my buffer extent or decreased my buffer increments.

I think that looking at NDVI in buffers was an interesting approach, but as I said above, my choice of pixel quality selection (highest NDVI) could alter a neutral selection of data. Also, what buffers I chose were relatively arbitrary – I chose equal intervals, but this does mean that when the mean NDVI is calculated, the mean is reduced across a larger area as each buffer gets further from the center. I could also try testing with larger buffers (200 or 500 meter buffers) that extend beyond 1000 meters from the settlement edge. Further, some of my buffers overlapped and encroached on other actual boundaries: this means that the buffers sometimes contained pixels from other identified settlements. For this reason, I chose to present the data in a smoothing trend. I will likely need to fix some of these errors for the final project, because I do think that this is a typical and useful spatial analysis to perform on this type of data and some of the errors are relatively easy to fix and would show stronger data integrity.

Exercise 2: Geographically weighted regression on two forested stands.

Bryan Begay

Initial Spatial Question: How does the spatial arrangement of trees relate to forest aesthetics in my areas of interest?

Context:

To understand forest aesthetics in my stand called Saddleback, I did a Ripley’s K analysis for Saddleback and on a riparian stand called Baker Creek to determine if the stands are clustered or dispersed. The Baker Creek location is a mile west of the Saddleback stand.

Geographically weighted Regression:

I performed a geographically weighted regression on both the Saddleback and the Baker Creek stands. The dependent variable was a density raster value and the explanatory value was tree height.

Tools and Workflow

Figure 1. The workflow for creating the Geographically Weighted Regression for the Saddleback Stand. The Baker Creek stand followed the same workflow as well.

Results:

Figure 2. Geographically Weighted Regression showing the explanatory variable coefficients in the Saddleback and Baker Creek stands near Corvallis Oregon. Yellow color indicates negative relationships and the hotter colors indicate positive relationships between tree height and density.

Figure 3. Geographically Weighted Regression showing the Local R2 values in the Saddleback and Baker Creek stands near Corvallis Oregon. Yellow color indicates that the local model is performing poorly, while hotter colors indicate better performance locally.

Table 1. Summary table output for the Saddleback stand’s geographically weighted regression.

Table 2. Summary table output for the Back Creek stand’s geographically weighted regression.

4. Interpretation/Discussion:

Having done the Ripley’s K analysis, I wanted to have a connection with this exercise, so I created a point density raster on both my stands (Figure 1). The point density raster calculates a magnitude-per-unit area from my tree points and outputs a density for the neighborhood around each tree point. The raster values would then be a descriptor of the trees neighborhood density. Having the density neighborhood values describes the stands tree spatial arraignment and relates to the Ripley’s K analysis outputs of telling if a stand is spatially clustered or dispersed.

Figure 2. shows that there is a spatial pattern in the Saddleback stand between density and height. There is a positive relationship on the edges of the stand and a decreasing relationship in the middle of stand between the two variables. This makes sense when thinking about how the stand would have denser and higher trees on the edges of the managed stand to screen the forest operations. The coefficient values for the baker creek showed a positive relationship on the north eastern portion of the stand, which would need further investigation to understand the relationship between density and height. Overall the relationship was negative in the Baker creek stand between density and height, but this may be attributed to the low local R2 values that indicate poor modeling (figure 3). Table 2. also shows that the Baker Creek model only accounted for 50% of the variance for the adjusted R2 values, which would indicate that more variables would be needed for the riparian stand. Figure 1. shows the summary table for GWR in the Saddleback stand.

Critiques

The critiques for this exercise is that I only look at height and density. If I had more knowledge of working with LAS data sets I would have liked to have implemented the return values on the LiDAR data as an indicator of density. Another critique would be that I used density as a dependent variable and height as an explanatory variable. Using density as the dependent value allows me to see the spatial patterning of my trees when plotted in ArcMap so I can reference the Ripley’s K outputs for further analysis. Having height as a response variable with density as an explanatory is something that would have been easier for me understand and explain that relationship. Density can affect tree height in a stand but understanding tree height as a factor that affects density is not as intuitive. Looking at how tree height responds to density in my stand would tell something about tree height, but that relationship has already been explored in great depth.

Spatial Distribution of Trees by Height Class, Slope and Elevation in the HJ Andrews Forest

Guiding Questions: How do distances between trees differ depending on tree height? How does the spatial pattern of tall trees relate to the spatial pattern of slope and elevation?

Methods: I used a combination of ArcMap, QGIS and R to perform analyses and view results. I used the results of my previous distance analysis within the HJ Andrews Forest, which grouped individual trees into ten height classes and calculated the mean distance between trees within the same height class, to correlate tree spacing with other spatial phenomena. I wanted to know if hot spots in within class tree spacing correlated with hot spots in tree height, so I examined hot spots and cold spots of tree distances within each height class and compared them to tree heights, slope and elevation. Height class 1 is the shortest class of trees and height class 10 is the tallest class of trees.

I used the Hot Spot Analysis Tool in the Arc Toolbox > Spatial Statistics Tools > Mapping Clusters > Hot Spot Analysis (Getis-Ord Gi*) to perform a Hot Spot Analysis on each of the ten height classes by mean distance to the 30 closest trees of the same height class. In the context of this analysis, the interpretation of a hot spot is that it is a region of greater than expected distances between trees of the same height class. For example, in the shortest height class, 1, hot spots are regions of greater than expected mean distance between a short tree and the 30 closest short trees. Cold spots would then be regions of closer than expected mean distance between short trees.

The Hot Spot Analysis in ArcMap used a self-generated distance band of 113m for my original hot spot analysis of the global dataset (not broken up by height class), so I decided to use a distance band of 100m for each subsequent hot spot analysis. Each height class has a different number of total trees in it, so by holding the distance band constant, I hoped to avoid influence from any differences in total number of trees between height classes.

After viewing the hot spot results, I plotted the z-scores of heights for each height class against the z-scores of the distances between trees to visually examine their relationship. If both heights and distances between trees were perfectly normally distributed, one would expect a circular distribution on the density plots with a slope of zero.

I then compared the mean slopes, elevations, and standard deviation of slopes and elevation within height classes across the entire forest. Since HJA is a research forest with many different management areas, including harvested patches and research plots, I limited the next part of the analysis to only within control areas of the forest. I downloaded the most recent (2014) land use designations from the HJA data repository (http://andlter.forestry.oregonstate.edu/data/abstract.aspx?dbcode=GI008). For this analysis, I used Entity Title 3: Reserved areas (controls) within the Andrews Experimental Forest. I compared slopes and elevations within the control plots only by height class, to see if there were differences between the global dataset and the control regions of the forest.

Results:

The density plots of height z-score versus distance z-score revealed a different pattern between smaller height classes of trees and tree spacing than the relationship between larger height classes of trees and tree spacing. As we go from shorter height classes of trees to taller height classes, the density plot distributions change (Figures 1-10). There is strong evidence of positive correlation between hot spots of short trees and hot spots of distance between short trees, but from height class 6-10, there is little to no evidence of a relationship between hot spots of trees and distance between them. Tall trees are more or less distributed randomly throughout the forest.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

There is clearly some structure to the density plots (especially in height classes 1-5), so we can assume that the trees are not randomly distributed and that there is a relationship between height and distance between trees. I compared mean and standard deviation of slope, as well as mean and standard deviation of elevation for each height class of trees (Table 1). Mean slopes do not significantly differ between height classes, so slope is likely not a main driver of tree height. However, there is some evidence that tree heights differ at lower and higher elevations, with the shortest height class of trees at a mean elevation of 1014 m, and the tallest height class at a lower mean elevation of 831 m. It’s important to note that mean elevations have large standard deviations, so the trend may not be as strong as it first appears. I wanted to know if there was more evidence for this pattern, so I calculated the same statistics for subsets of the hotspot analyses constrained to only the control areas of the forest (Table 2) to get an idea of how management may or may not influence the relationship between tree height and tree spacing throughout the forest. Mean slopes and elevations, accounting for standard deviation from the mean, do not differ significantly between the global and control datasets, meaning that the control regions are reasonable representations of the rest of the forest. To examine this further, I examined the same data for the entire forest excluding the control areas (Table 3). The same pattern holds between the three datasets; class 10 is the only class that is consistently at lower elevations across the forest. I made two density plots to display this relationship. Figure 11 shows the distribution of elevation for the shortest height class (class 1) of trees (green) versus the global dataset of all trees (red). The short trees follow the same distribution as the rest of the dataset, meaning that they are dispersed more or less evenly across elevations. Figure 12 shows the distribution of the tallest height class of trees (class 10; light blue) by elevation versus the global dataset of trees by elevation (red). This clearly shows that tall trees are not distributed at higher elevations.

Figure 11. Density of trees by elevation in height class 1 (shortest trees; green) versus global dataset of all trees (red).

Figure 12. Density of trees in height class 10 (tallest trees; light blue) versus global dataset of all trees (red) by elevation.

Table 1: Global dataset

Height Class	Mean Slope	SD Slope	Mean Elevation (m)	SD Elevation
1	23	11.7	1014	294
2	23	11.6	1008	291
3	25	11.6	990	295
4	25	11.5	948	291
5	25	11.3	931	293
6	26	10.9	972	291
7	27	10.5	982	250
8	26	10.6	959	221
9	25	10.7	926	207
10	23	11.2	831	197

Table 2: Subset of control regions

Height Class	Mean Slope	SD Slope	Mean Elevation (m)	SD Elevation
1	27	11.9	1104	317
2	27	11.5	1136	328
3	28	11	1157	329
4	28	10.8	1143	311
5	28	10.2	1133	285
6	28	10	1071	262
7	28	10	992	228
8	27	10.6	955	193
9	26	10.8	935	187
10	24	11	869	189

Table 3: Global dataset excluding control regions

Height Class	Mean Slope	SD Slope	Mean Elevation (m)	SD Elevation
1	22	11.4	991	284
2	22	11.4	979	273
3	24	11.6	952	272
4	24	11.5	902	266
5	24	11.5	867	265
6	25	11.3	908	290
7	26	10.7	973	267
8	25	10.5	964	242
9	23	10.4	919	221
10	22	11.2	793	198

Critique of the method:

A criticism of hot spot analysis is that it’s basically a smoothing function that places a focal around an area but does not account for the distribution of values within that area. So, the tallest tree in the dataset could be in cold spot (region of shorter than expected trees) and the hot spot analysis would give you no indication of that, so one may miss out on potentially useful/interesting information.

This is only a cursory look at the data and a next step is to more closely examine how slope, elevation and aspect influence distribution and height of trees, particularly within the control areas of the forest.

Ex 2: Relationship between stream cross-sectional change and across-channel slope

For Exercise 1, I investigated autocorrelation in patterns of cross-sectional change across and along stream reaches.

For Exercise 2, I wanted to know whether these patterns of change were related to channel geometry.

Specifically, I wanted to know if erosion or deposition happened more frequently close to cut banks than further away from them.

I was originally going to investigate change in both the along-channel and across-channel directions, but after consulting with my classmates, I decided that I would not be able to draw as many conclusions in the along-channel direction because I don’t have good information about the spacing and orientation of my cross sections.

Methods:

To find cross-sectional change, I paired sequential years and calculated change along each cross section for each pair of years as I did in exercise #1.

I wanted to compare the change to the location of the steepest bank, but I couldn’t figure out how to identify what parts of a cross section counted as a “bank” using a computer. Instead, I looked at the spatial pattern of change in relationship to the point of steepest slope in the across-channel direction.

I used the original cross section profiles to identify the point of steepest across-channel slope. Since the point of steepest slope may move from year to year, I used the steepest slope from the first year in every year pair. I used the loess function in R to lightly smooth each cross-sectional profile before extracting slope in hopes of reducing the effects of some small bed features. This worked well in most cross sections, but in some cross sections, especially those with prominent midchannel features, the point of steepest slope occurred in the middle of the channel.

Once I had identified the steepest point in each of the cross section for each year pair, I calculated how far every other point in the cross section was from the steep point. Then, within each reach, I aggregated the data by distance, rounding to the nearest decimeter, and calculated the mean absolute elevation change (that is, counting both erosion and deposition as positive values). I wanted to see broad patterns overall, so the aggregate combines data for every cross section and every pair of sample years.

I plotted the resulting data from each reach. In the figure below, the colors represent how many horizontal centimeters of reach are aggregated into each point on the line graph. Bigger numbers and more blue colors represent averages from more cross sections and years while smaller numbers and more yellow colors represent distances where fewer cross sections or fewer years had data at that distance from the steep point.

Results:

The figure implies that perhaps a lot of channel change tends to happen very close to the steepest point, but then stabilizes. Far from the point, the average vertical change values become very unsteady, perhaps because fewer data points are integrated into the average.

Critique:

I thought that this was an interesting and fairly straightforward analysis to conduct, but I am not sure how physically meaningful the results are, since the steepest points are not placed in the same location in each cross section. The results figure looks a bit like a channel cross section itself, which I thought was very interesting! I wonder if this is because the averaging falls apart at a distance roughly equal to the average channel width or if there really is more change happening near channel banks on average in these streams.

Ex 2: A stain on the neighborhood… How does management relate to infection for black stain root disease?

Sorry for the bad puns in the title… Could not help myself.

QUESTION

How do spatial patterns of black stain root disease infection probabilities relate to spatial patterns of forest management classes? How do these spatial relationships differ between landscapes where similar management classes are clustered and landscapes where management classes are randomly distributed?

TOOLS AND APPROACH

I used neighborhood analysis to analyze the spatial relationship between hot and cold spots of infection probabilities and the three forest management classes simulated in my model (extensively managed plantations, intensively managed plantations, and passively managed old-growth stands).

I used a combination of ArcMap (to perform the majority of the procedure) and R (mostly for spatial data wrangling and analyzing and plotting results).

DESCRIPTION OF STEPS YOU FOLLOWED TO COMPLETE THE ANALYSIS

1. Compare the distribution of infection probabilities between landscapes and management classes

I performed this step in R to determine whether there was evidence of significant differences in the infection probabilities (a) between the clustered and the random landscapes and (b) between management classes both within and among landscapes.

2. Hotspot analysis

Converted my raster data of infection probabilities to point data (in R, using the “raster” package) and perform a hotspot analysis (hotspot tool in ArcMap) (Fig. 1). For the hotspot analysis, I used inverse squared distance weighting to conservatively include trees within hotspots.

I also created polygon shapefiles for areas identified as hotspots to the 99% confidence level in each of the landscapes and calculated the area of these hotspots for each landscape, management class, and landscape X management class.

Figure 1. Hotspot analysis overlaid on forest management classes for the “clustered” landscape. The same analysis was performed on the “random” landscape.

3. Select point for neighbor analysis

For this step, I chose one point in a hot spot and one point in a cold spot for each of the two landscapes to perform the neighborhood analysis. In the future, I would write a script to repeat this procedure with a random sample of a large number of points, but for this exercise, I just used one point for each.

For the hotspots, I only used points identified in hotspots at the 99% confidence level. For the cold spots, I used a point from the 99% confidence level for the clustered landscape, but my only options for the random landscape were points identified at the 90% confidence level. I visually selected the points for analysis (non-random), but I would not do so for my full analysis.

4. Create concentric ring buffers for the neighborhood analysis

I used the Multiple Ring Buffer tool in ArcMap to generate hollow ring buffers at a series of distances around each hotspot and cold spot point selected for analysis (Fig. 2). I then intersected these buffers with the management class shapefile and calculated the proportion of each buffer ring composed of each management class (by area). I plotted these proportions as a function of distance to complete the neighborhood analysis.

Figure 2. Concentric ring buffers used for the neighborhood analysis, with outer ring distances labeled. Example shown for the “random” landscape for both hot and cold spots.

RESULTS

Non-spatial analysis

Before performing the neighborhood analysis, I wanted to know whether there was any difference in the infection probabilities between the two landscapes and the three management classes. Visual assessment of the box plots (Fig. 3) led me to believe that there were significant differences between the means of the infection probabilities between the landscapes and management classes, which was supported by the results of student t-tests (p << 0.01). There were also significant differences in the infection probabilities between management classes both within and among the two landscapes. Since the outputs I am analyzing are “dummy data” because my model is not fully complete, this does not surprise me and I did not perform further, more rigorous statistical analysis. The higher infection probabilities simply reflect the density of trees in each of these management classes. Extensive management had the highest infection probabilities (highest initial planting density, evenly spaced), followed by old-growth (lower density but clustered trees) and finally intensive management.

Figure 3. Comparisons of infection probability distribution between (a) landscapes, (b) management classes, (c) management classes by landscape; (d) the proportion of each management class covered by an infection hot spot in each of the two landscapes (by area).

Neighborhood analysis

For the clustered landscape, the management class in which the point was located made up the largest proportion of the first several distances analyzed (cold spot: old-growth, hot spot: extensive). This makes sense given that the management classes were spatially clustered in this landscape. At a certain threshold, the proportion of this management class started to decrease, as the other two management classes increased in proportion at a similar rate. For the random landscape, the decrease in the starting management class proportion was relatively rapid with distance, and all three management classes converged at their landscape proportion by 150 meters, with some fluctuation (in both landscapes, each of the three management classes makes up about 1/3 of the total landscape).

Figure 4. Neighborhood analysis showing the proportion of each (hollow) ring buffer accounted for by each management class for hot and cold spot points in each of the two landscapes.

Critique of the method – what was useful, what was not?

Most of the drawbacks of this analysis were due to my process and the nature of my data. Because I only used one hotspot and one cold spot point for each of the two landscapes, it is difficult to say much from this analysis. If I were to automate the analysis and run it on a series of random points drawn from the hot and cold spots, I would get a lot more insight as to the patterns of neighborhood effects if any exist. In addition, the “dummy data” used for analysis come from model runs that only have local disease transmission (between neighbors at a radius of several meters). However, the full model will also incorporate long-distance dispersal by insect vectors (on the scale of kilometers), which will likely be much more interesting and less predictable when neighborhood analysis is performed.

Issues with the method itself is that it is quite time intensive – an issue that could be cleared up with a script written in Python or R to automate this process given a set of input files. If there is a way to do this already built into either of these platforms (or Arc/QGIS), I was not able to find it. Also, there is a lack of clear quantitative interpretation for these plots to separate statistically significant variability in the proportion of each management class at each distance from non-significant variability. A means of doing so would enrich the analysis.

Exercise 2: Using Neighborhood Analysis to Identify Relationships Between Seascape Classes and Rockfish Abundance Hotspots

Background and Question Asked

In Exercise 1, different interpolation methods were used to create a heat map of rockfish abundance based off of a large collection of point data. That blog discussed some of the challenges that arose while attempting to use a time series of point data with many points in close proximity to one another (if not overlapping). The exploring was in many ways successful: it was discovered that the Kriging method provided a more robust representation of the data than Inverse Distance Weighting. However, in the time since that post was published, my interpolation methods have been refined:

Instead of using the entire time series as an input for the interpolation, four individual years were selected to represent the whole dataset (2003, 2007, 2011, 2015). Kriging was then used to create heat maps for each individual year.
Additionally, the union tool was used to remove the land boundaries from the environment so that the interpolation only affected parts of the ocean
The symbology of the abundance point data was synced across all four years being used in the analyses so that they could be easily compared to one another
The symbology of the interpolated heat maps was also modified to be consistent throughout the analyses

For this exercise, I plan to compare my new, interpolated data to an already existing set of data, effectively comparing my two variables. Specifically, I hope to answer the question “Is there a spatial relationship between areas of significantly high and low rockfish abundances and specific seascape classes?”

An example of the most recent Kriging output using the 2007 data.

Name of Tool or Approach Used

I will be using a neighborhood analysis to seek an answer to this question. The neighborhood analysis requires taking areas of interest and examining the environmental conditions around that area from the perspective of another variable. By varying the distance from your original point of interest, a researcher is able to infer about the spatial relationship between the two variables.

Methods

Data Used

“Points of Interest” chosen from plot below
Buffers created around points of interest at 5km, 10km, and 25km radii
YOYRock Kriging Abundance Interpolation for 2007
Seascape NetCDF Raster File for May 5th, 2007

Rockfish abundance plotted against water column depth for trawls from 2007.

The first thing that was needed to complete this analysis was points of interest. I chose to use four points form the year 2007, as the data from this year provided the largest spatial footprint of all of the years of interest. Two of the points represented trawls that found significantly high rockfish abundance, and the other two represent trawls which found no rockfish. All four points vary spatially and physically (latitude, longitude, water column depth, etc). All points were selected from interpolated areas with different modeled outputs. Next, circular buffers were created around each point of interest with 5km, 10km, and 25km radii.

Map showing Points of Interest with circular buffers overlaid on seascape NetCDF file.

In order to use the overlay tool in ArcGIS, two polygon features are needed. In order to convert my NetCDF Raster files into a polygon, I used the Raster to Polygon tool. Once the seascape classes were converted to polygons, the Intersect tool was used to measure the shape area of each seascape class within each buffer. Those statistics were then converted to .xlsx files and summarized in Excel.

Results and Discussion

Example of data after Raster to Polygon and Intersect tools used

The neighborhood analysis found evidence that specific seascape classes may have impacted young of the year rockfish abundances in the locations selected to be a part of this analysis.

The low-abundance trawls were dominated by three seascape classes: Class 14 (Temperate Upwelling Blooms), Class 19 (Subpolar Shelves), and Class 21 (Warm, Blooms, high Nutrients). While there were more classes represented overall by the high abundance trawls, those areas were mostly dominated by two seascape classes: Class 7 (Unnamed) and Class 12 (Subpolar Nutrient). Additionally, there was very little overlap between the two areas – the only seascape class that appeared in both the high abundance radii and the low abundance radii was Class 14. Further analyses would be needed to determine if these trends are representative to the entire region or year, but this neighborhood analysis provides results that give us a place to start. Overall, I found this analysis to be extremely useful despite the number of steps needed to make it work. In addition to working in GIS normally, the data type of my seascapes had to be changed and much of my analysis had to be done in Excel, as ArcGIS cannot summarize key statistics. However, I feel as though streamlining this method could be done now that I am familiar with it.

Exercise 2: Cross Variogram and Kriging

In Volcanic Regions fluid flow paths are limited to the rubbley bases and flow tops of lava flows where permeability promotes transitivity.

Hypothesis:

The depths to first water will corresponds to depths located near lava flows.

Definitions:

Contact: location on the surface, or at depth where two different rock types touch.

Depth to first water: the depth from a particular well log where water was first noted, this is not always listed.

Depth to first lava: the depth at which the first lava is noted in a particular well log. There can be multiple contacts to a lava in a well log, which is why I specified first.

Figure 1: Block diagram of what well log depicts. The green and red planes represent contacts between the two rock types. Well (grey) and well logs record these contacts as depths from the surface (brown).

Question:

Does the depth to first water correspond to the depth to first lava in my data set?

Tool:

Cross Variogram and Kriging

Like the variogram, the cross variogram is a tool that allows you to compare spatial data at multiple scales. Unlike the variogram, the cross variogram compares one data set to another data set at multiple scales.

Kriging uses the variogram to interpolate a surface.

Brief Description:

In order to use both the variogram and the cross variogram you must normalize the data you are working with. Otherwise the semivariance values can range from 0 to infinity. Normalizing the data allows you to distinguish data that are correlated (semivariance<1) from data that have no correlation with eachother (semivariance>1).

In order to use the R function gstat, I had to turn the data into a spatial data frame.

The function gstat allows you to simultaneously create variograms for each of the induvidal data sets you are working with and compares them with each other. In this case I just compared the depth to first lava with the depth to first water.

I used Arc GIS kriging formula (ordinary) to krige a surface that represented the difference between between the depth to first lava and the depth to first water. In other words I subtracted the depth to first lava from the depth to first water and kriged that “surface”. I wanted to see if there was a spatial distribution around which those difference were low. I tried to use R to krige, but did not have the time to work out the krige function.

Results:

My results were strange, though ultimately unsurprising.

Figure 2: Variograms of my two variables water1 and lava1 as well as the cross variogram that comes from comparing the two. Note that the water1lava1 cross variogram has negative values for the semivariance.

Figure 3: Plot of well log data with the difference between first lava and first water plotted on top of its kriged surface.

The strangest thing to resolve from this exercise are the negative semivariance values for the cross variogram. Semivariance is a squared values and therefore should not have negative values. I have no idea what is happening here. I need to ruminate upon it. Either way, the data does not appear to be well correlates, or at the very least, I am not comfortable making conclusions about it with the negative semivariance values.

The ordinary Kriged surface interpolated from the difference between lava and water lets me know that the highest Kriged surface (Fig 3, white) between water and contact lies in the middle of the study area . In geographic and geologic space this corresponds to a basin filled with sediment and inter-fingered with lava. Many of the well logs in the region are not deep, they don’t have to be because the water here is near the surface, and close to some of these buried basalt flows. The data at the far edges of the map are spatial outlies, and thus we can’t look at any of the map that lies far from the main cluster of data points.

From physically looking at the well logs I know that while the well logs do often correspond to a lava flow, it is almost never the first lava flow. I am not surprised that the semivariance indicates that the data are not correlated.

Critique of the method:

what was useful, what was not?

It was not particularly useful, because it told me what I already know and left me with more questions than answers. However, I did walk away with some considerations. The cross variogram (and variogram) might work at a smaller scale.

In other words, if I broke my field area up in regions where I think the lava layers might source from the same place (Lassen Peak or Medicine Lake Volcano) I would be able to make the assumption that lava layers that are at similar depths in the well log correspond to the same lava flow in space. If we consider figure 1, in a small area we would be able to link the green layer to other locals where the green layers lies at depth, and would be able to spatially autocorrelate them with the variogram.

The next step, one I narrowed down my area, would be to correlate the depth to water with the depth to every lava flow I found in the well log. This would allow me to see which lava layer best corresponds to the depth to first water.

One of the things I discussed with my partner was trying to figure out what the negative values meant in my variogram. As I stated above, I still need to think about this, or figure out what I did wrong. I also discussed taking data out of lat-long space and into UTM space; that is something I am also still thinking about.

One final note: At the moment my data is both clustered around certain spots, and I do not have much of it. Every time I add a few data points, the shape of the variogram changes. Some of the spikiness I am seeing is likely from that.

Exercise 2: Neighborhood analysis of Texas counties with high colorectal cancer mortality rates

Question being asked

How do rurality indicator variables shift as distance increases away from Texas counties with high colorectal cancer (CRC) mortality?

In exercise 1, I used principal component analysis (PCA) to create a PCA-weighted rural index of the state of Texas using 3 scaled variables: population density, land development percentage, and median income. In this exercise I applied these same variables to determine how they change as distance increases away from the 4 Texas counties with the highest CRC mortality rates. To do this, I created multi-ring buffers around Anderson, Gonzales, Howard, and Newton county and computed averages of each rural indicator variable for each successive buffer “donut.” I hypothesize that as distance increases away from high CRC county centroids, rurality indicator measures will have more “urban” values (i.e. higher population density, higher percent developed, higher median income) and CRC mortality rates will decrease.

Tools and Data Sources Used

For this exercise, I utilized the intersection, feature-to-point, and multi-ring buffer tools in ArcGIS along with the latticeExtra/gridExtra plotting packages in R. The rural indicator data used in this analysis are from the same sources I used in Exercise 1: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database). These data were then scaled using the same procedure from Exercise 1, which can be found here. Aggregated CRC mortality rates for Texas counties were obtained for the years 2011-2015 from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program U.S. Population Data – 1969-2015.

Methods

Attribute Table Wrangling: The Texas county indicator variables were linked to county polygons in my Exercise 1, but cancer mortality data was not. For polygon linkage in this exercise, I imported the mortality data excel sheet into Arc and used the join procedure to insert the data into the existing attribute table (with indicator variables) for county polygons.

Centroid & Multi-ring Buffer Creation: First, I utilized the point-to-feature tool in Arc to create a layer of county centroid points from the county polygon layer. Once the county polygons had been converted to centroids, I identified the 4 Texas counties with the highest CRC mortality rates. Then, using the select features tool and multi-ring buffer procedure, I selected each of the 4 counties separately and created multi-ring buffers at 50, 75, and 100 Km. These distances were chosen based on the size of the selected counties and the size of the full state of Texas.

Intersection & Donut Summary Statistics: Once the multi-ring buffer layers were created, I intersected each of the 4 buffer layers with the original county polygon layer containing all relevant variables. Then, mean via the summary statistics tool were computed in Arc for population density, percent developed land, and median income for each successive donut in the multi-ring buffers. The computed tables of buffer donut means for each variable and county were then exported to Excel files.

R Plotting: The Excel files were then imported into R and line plots of buffer means by distance were created using the xyplot function within the latticeExtra package. Plots were then combined into figures by county using the gridExtra package.

Results

This figure of scaled population density means for county multi-ring buffer donuts indicates varying trends for population density between the 4 high CRC counties. Two counties have increasing population density as distance increases away from centroids, and two have decreasing population density as distance increases away from centroid. Only the buffer map for population density was presented above for post conciseness and space limitations on this blog site. More specific neighborhood relationships between CRC death rates and all indicator variables can be seen in the line plots and explanations below.

Line Plots of County Indicator Variables Over Buffer Distances

The above plots display with more specificity than the buffer map that areas surrounding the 4 counties have differences in indicator variable trends as distance increases away from county centroids. Both Anderson and Newton counties largely follow the trend hypothesized: as distance from the county centroids increases, rural indicator variables have more “urban” values and CRC mortality rate decreases. For Newton county, this trend does not hold for median income, because as CRC mortality decreases away from the county centroid, median income also decreases. For the other other 2 counties, Gonzales and Howard, the hypothesized relationship does not hold, because as distance increases away from county centroids, the rural indicator variables become more “rural” as CRC mortality decreases. This indicates that the associations between CRC mortality and rural indicator variables are complex and that neighborhood analysis does not capture all relationships.

Critique

This sort of neighborhood analysis was effective at determining trends in rural indicator variables in the areas surrounding high CRC mortality counties in Texas. The buffer map produced great broad results, where more generalized trends can be determined. The line plots specifically were highly useful for visualizing more specific changes in indicator variables and CRC mortality rates over distance. These results should be considered in the light of some limitations. First, county level data was used for all variables and the buffer donuts may be too large or too small to capture the true neighborhood relationships in the analysis, as statistical procedures were not utilized to determine the distances. Further, more buffer donuts may have been useful to see more nuanced trends over distance. Secondly, as can be seen in the buffer map, there is a lack of CRC mortality data for many counties in west and southern Texas due to data suppression in order to preserve patient confidentiality. This presents significant bias in result interpretation, especially in Howard county, where many of the counties surrounding it have suppressed CRC mortality data.

My future analysis of this data will likely be a comparative confusion matrix of the PCA-weighted index I created in Exercise 1 and CRC mortality data used in this exercise.

Exercise 2: Spatial Patterns of Perceptions of Natural Resource Governance and Environmental Condition

Questions

What are the spatial patterns of natural resource governance perceptions in relation to environmental condition?

Environmental condition was represented as: (1) Environmental restoration sites, and (2) aggregate ‘environmental effects’ scores by census tract on a scale from (1) low to (10) high that include lead risk from housing, proximity to hazardous waste treatment storage and disposal facilities, proximity to nation priorities list facilities (Superfund Sites), proximity risk management plan facilities, and wastewater discharge.

Sub-questions related to the primary question are:

a1. Does being near a restoration site associate with individual perception? a2. Does being near more restoration sites associate with individual perception?

b1. Do the environmental effects where an individual lives associate with individual perception? b2. Do the environmental effects around an individual associate with individual perception?

Tools and Approaches

Nearest Neighbor analysis in ArcGIS Pro and R studio
Geographically weighted regression in R studio
Neighborhood analysis in ArcGIS Pro

Analysis Steps

Nearest neighbor analysis was used to examine questions a1 and a2. First, a file of restoration sites was loaded into R. The sites were points in the same projection as my participant location points. This analysis required four libraries (as the points were from different files. The libraries were: nlem, rpart, spatstat, and sf. To use the tool I needed, I first had to turn my points into point objects in R (.ppp objects). First I used the convenxhull.xy() function to create the ‘window’ for my point object, then I used that to run the ppp() function. After doing this for both sets of points, I was able to use the nncross() function. This function produced min, max, average, and quartile distances from individuals to the nearest restoration site. I added the ‘distance’ from the nearest neighbor as a variable in a linear regression to determine an association for a1.

To examine a2, I used these distances (1^st quartile, median, and 3^rd quartile) to produce buffers in ArGIS Pro. After creating the buffers, I ran spatial joins between them and the restoration sites. This produced an attribute table that had a join count—the number of restoration sites within the buffer. I exported the three attribute tables from the three buffer distances back to R. In R I ran a linear regression with join count as an independent variable.

To test b1, I preformed geographically weighted regression. I used both ArcGIS Pro and R studio to run this analysis. Initially I used the GWR tool in Arc to run the regression, but wanted the flexibility of changing parameters and an easily readable output that R provides. First I joined individual points to my rank data, a shapfile at census tract level. This gave individuals a rank for environmental effects at their location. In R, I used two different functions to run GWR, gwr(), and gwr.basic(). The gwr() required creating a band using gwr.sel, and the gwr.basic required creating a band using bw.gwr. The difference between these functions is that gwr.basic produces p-values for the betas. I ran gwr on both my entire data set and a subset based on perceptions. The subset was the ‘most agreeable’ and ‘least agreeable’ individuals who I defined as those one standard deviation above and below the mean perception.

I preformed neighborhood analysis to test the final question, b2. First, I created a dataset that was just the upper and lower values of my governance perceptions (one standard deviation above and below the means. I then added buffers to these points at 1 and 5 miles. I then joined the buffers in ArcGIS Pro to the rank values to get an average rank within those buffers. I exported the files for each buffer to R. I R I created a density plot of average rank for the low governance values at each buffer, and for the high governance values at each buffer.

Results

The median distance for individuals from restoration site was 0.037 degrees, 1^st quartile was 0.020 degrees, and 3^rd quartile was 0.057 degrees.

The regression on whether distance correlates with individual perception was insignificant (p = 0.198). This led to the conclusion that distance from the nearest restoration site does not influence perceptions.

For each regression on the number of sites near an individual, all coefficients were negative. This implies that the more sites near an individual, the more disagreeable their perception was. All produced significant results, but the effect size of number of sites near individuals was very minimal (Table 1).

Table 1. Regression results for ‘nearest neighbor’ of individuals to restoration sites.

Buffer Size	b	p-value	Effect Size (r_pb)

Buffer 1 (1^st quartile)	-0.003	0.0181	.003
Buffer 2 (median)	-0.002	0.045	.002
Buffer 3 (3^rd quartile)	-0.001	0.002	.003

For the geographically weighted regression, I am presenting the results from the gwr.basic() model. For this model, I included demographic variables to control for these factors. In the model, rank, life satisfaction, years lived in the Puget Sound, and Race are significant (Table 2). All other variables were insignificant, so I will not discuss their trends. For rank (the main variable of interest), the coefficient was positive. In this case higher rank values are worse environmental effects, so as agreeable perceptions increase, environmental condition decreases. Life satisfaction is a variable of how satisfied individuals are with their life overall, which correlated positively with perception (Table 1). Years lived in the Puget Sound correlated negatively, or perception decreased the longer someone lives there (Table 1). Race, a dummy variable of white or not-white, indicated higher perceptions were held by white individuals.

Overall, the effect size of this model on governance perceptions was small, explaining about 10% of the variance in the data (Table 1).

Table 1. Regression results for environmental effects at individuals’ locations.

Variable¹	b	p-value²

Rank	0.077	0.007**
Life Satisfaction	0.478	<0.001**
Years Lived in the Puget Sound	-0.010	0.002**
Sex	-0.156	0.289
Area	-0.150	0.179
Education	-0.002	0.922
Income	-0.022	0.622
Race	0.006	0.034*
Ideology	0.103	0.533

¹R² = 0.094

²** = significant at the 0.01 level, * = significant at the 0.05 level

The plot of high and low governance values at the two buffers is presented below. The black and red curves represent respondents from the survey that were at least one standard deviation lower than the mean (the mean was neutral). The black curve is average rank with a one mile buffer, and the red curve is average rank at the five mile buffer. The green and blue curves represent respondents from the survey that were at least one standard deviation higher than the mean. The green curve is the average rank at the one mile buffer, and the blue curve is the average rank at the five mile buffer.

What this figure indicates is that there are two peaks in average rank at low environmental effects (~1) and at mid-environmental effects (~4.75). Those with lower perceptions of environmental governance had higher peaks at low environmental effects for each buffer size. Those with higher perceptions of environmental governance had a bimodal distribution with peaks at low and mid-environmental effects. The bimodal nature of these density functions leads me to believe there is some variable moderating governance perception related to environmental effects.

Critique

The methods I used were all helpful in determining the spatial relation of environmental condition to perceptions of natural resource governance. However, switching between the two programs (ArcGIS and R) was a little bit of a hassle. R has greater flexibility when running analyses, as in running is easier. This meant it was ideal for the analyses that required me to tweak my models multiple times. I also ran into little problems with Arc in terms of loading the data which made everything run slower, but Arc is more intuitive in terms of finding and executing analyses/

Does the spatial arrangement of vegetation cover influence ventenata invasion?

Question Asked

To predict the invasion potential of a species, it is necessary to understand the spatial pattern of the invasion in relation to landscape scale variables. For exercise 2, I explored how the spatial pattern of invasion by the recently introduced annual grass, Ventenata dubia (ventenata) relates to the spatial pattern of vegetation cover categories throughout the Blue Mountain Ecoregion of eastern Oregon.

Tools and Approaches Used

To unpack this question, I performed a neighborhood analysis to explore how the proportion of different vegetation type cover differ at increasing distances from plots with high versus low ventenata cover.

The neighborhood analysis required several steps performed in ArcGIS and in R:

I split my sample plot layer in ArcGIS into two layers – one containing plots with only high ventenata cover (>50%) and one containing plots with only low cover (<5%).
I buffered each plot by 10m, 50m, 100m, 200m, 400m, and 800m using the “buffer” tool in ArcGIS and then erased each buffer layer by the preceding buffer layer to create “donuts” surrounding each sample points using the “erase” tool in ArcGIS (Fig. 1).
I brought in a vegetation cover categories raster file (Simpson 2013) that overlaps with my study area and used the “tabulate area” tool in ArcGIS to calculate the total cover of each vegetation type (meadow, shrub steppe, juniper, ponderosa pine, Douglas-fir, grand-fir, hardwood forest/ riparian, and subalpine parkland) that fell within each buffer for every point. I repeated this for high and low ventenata points.
Finally, I consolidated the tables in R and created a line graph with the ggplot2 package to plot how the proportion of vegetation type differed by buffer distance from point (Fig. 2). Cover represents percent cover of each vegetation type at each buffer distance. Error bars at each distance represent standard error. VEDU refers to the plant code for Ventenata dubia (ventenata).

I was also curious to how the high and low points differed from random points in the same area. To explore this I:

Created 110 random points that followed the same selection criteria of 1000m proximity to fire perimeter used to select the ventenata sampling points.
Repeated steps 2 through 4 above to graphically represent how vegetation cover differs as a function of distance from these random points in relation to low and high ventenata points (Fig. 3).

Results & Discussion

My analysis revealed that vegetation type differs between high and low ventenata sites and random sites within the study area. The high ventenata plots were located entirely in ponderosa pine and shrub steppe vegetation types, but as distance increased from the plots, the distribution of about half of the vegetation types became more evenly distributed (Fig. 2). Ponderosa covers over 75% of the high ventenata 10m buffer areas with shrub steppe making up the remaining 25%. However, as distance increased, ponderosa cover dropped sharply to under 35% at 400m. Shrub steppe gradually declined throughout the 800m distance, and was surpassed by grand fir and Douglas fir by 800m. Meadows covered about 10% of the 50m buffer but declined to about 5% by the 400m buffer. The remaining vegetation types, juniper, riparian, and subalpine fir, were consistently under 5% cover throughout the buffer analysis.

In the low ventenata sites, shrub steppe vegetation was the most dominant, but the distribution was spread more evenly across the vegetation types than in the high ventenata sites (Fig. 2). Shrub steppe vegetation droped from 45% to 30% from the 10m to the 50m buffer, and then remained relatively constant throughout the remaining buffer distances. Like the high ventenata sites, grand-fir gradually increased in cover throughout, becoming the most dominant vegetation type of the 800m buffer. Unlike the high sites, ponderosa pine made up only about 10% of each buffer. Riparian vegetation was the only cover type that remained 0 throughout all the buffers.

In the random sites, the distributions of vegetation type were steady throughout the 800m, with only small fluctuations in cover with increasing distances (Fig. 3). Shrub steppe vegetation type was the highest at about 30% throughout, followed by juniper, ponderosa pine, and grand fir at about 20% cover.

This analysis demonstrates that ventenata could be dependent on specific vegetation types not only at the sample location, but also in the vicinity surrounding the sample area. This is evident in the high ventenata analysis where ponderosa pine cover remains much higher than the low sites and the random sites throughout the 800m buffered area. This analysis also depicts my sample bias as it demonstrates which community types I was targeting for sampling (shrub steppe and dry forest communities), which may not be representative of the area as a whole (as demonstrated in the random points analysis).

Critique of Method

The neighborhood analysis was a useful way of visualizing how vegetation type changes with distance from high and low ventenata points and may have helped uncover the importance of large areas of ponderosa pine as a driver of invasion; however, the results of the analysis could be a relic of my sampling bias towards shrub steppe and dry forest communities rather than an absolute reflection of community drivers of ventenata. The vegetation layer that I used was also not as accurate or as detailed as I would have liked to capture the nuance of the different shrub steppe and forest community types that I was attempting to differentiate in my sampling. If I were to do this again, I would try to find and use a more accurate potential vegetation layer with details on specific community attributes. Additionally, the inclusion of error bars was not possible using the “multiple ring buffer” tool in ArcGIS, so, I instead had to make each buffer distance as a separate layer and erase each individually to maintain the variation in the data. I like the idea of the random points as a sort of randomization test; however, more randomizations would make this a more robust test. With more time and more knowledge of coding in ArcGIS/ python, I would attempt a more robust randomization test.

Simpson, M. 2013. Developer of the forest vegetation zone map. Ecologist, Central Oregon Area Ecology and Forest Health Program. USDA Forest Service, Pacific Northwest Region, Bend, Oregon, USA

On the relationship between spatial variation of economic losses and tsunami momentum flux

Question

For Exercise 1, I evaluated the spatial pattern of economic losses resulting from a joint earthquake and tsunami event. I then deaggregated the losses by hazard (earthquake only, tsunami only, and combined) as well the intensity of the event.

For Exercise 2, I evaluated how the spatial pattern of economic losses resulting from a tsunami relates to the spatial pattern of tsunami momentum flux (a measure of velocity and inundation depth) by performing a geographically weighted regression (GWR). For this analysis, I only considered the tsunami because there is significant spatial variation in the hazard, whereas the spatial variation for the earthquake is minimal.

Tool and approach

I performed the GWR using the python library PySAL (Python Spatial Analysis Library). The independent variable was defined as the momentum flux, and the dependent variable defined as the percent loss of economic value.

Description of steps

The average losses at each building resulting from an earthquake/tsunami loss model were first converted to percent loss (loss divided by real market value), and added as an attribute to a GIS shapefile. The percent loss was used as opposed to the economic losses because each building has a different initial value. Consequently, the percent loss serves to normalize the economic losses across all buildings within Seaside. For this analysis, the results from the “1000-year” tsunami event were analyzed.

The GWR was then performed using PySAL with the momentum flux defined as the independent variable and the percent loss defined as the dependent variable. The GWR resulted in a slope and intercept at each tax lot, as well as a global r² value. Two separate maps were generated wherein each tax lot was color coded based on values of the slope and intercept.

Description of results

The results from the GWR and a global regression are shown in Figures 1 and 2 respectively. A global r-squared value of 0.575 was obtained, indicating that the data is moderately correlated. In Figure 1, it can be seen that the intercept is larger near to the ocean, and decreases as the distance to the shore increases. This can be explained by the fact that the momentum flux is the largest near to the coast, and decreases as the tsunami propagates over the land.

Similar trends would be expected for the slope coefficients; however, it can be seen that along the coast the results are negative indicating that the economic losses decrease as the momentum flux increases. This can likely be explained by inconsistent building types within Seaside. For example, concrete buildings are able to better withstand the impact of a tsunami compared to their wood counterparts. Similarly, buildings of different heights (number of floors) have different damage properties. Consequently, because the building types are not consistent within Seaside, significant variations in the percent of loss within a small spatial region can occur (e.g. a wood building is located next to a concrete building). This would lead to a decrease in percent loss for a larger momentum flux.

Figure 1: Spatial variation of slope and intercept resulting from the GWR

Figure 2: Global regression and line of best fit

Critique

While the GWR does provide a means to evaluate correlation between two variables that are within the same geographical region, there are limitations for this particular application. The results showed negative slopes in some locations, which is likely caused by the large variation in the percent loss. To alleviate this, alternative statistical models could be developed using GWR that only consider similar building types. An example of a non-spatial regression for wood buildings with 2 and 3 floors can be seen in Figure 3. The improvement in r-squared values can be observed, and would likely translate to the GWR.

Figure 3: Example of global regression considering specific building types

Fire Refugia’s Effects on Clustering of Infected and Uninfected Western Hemlock Trees

Overview

For Exercise 1, I wanted to know about the spatial pattern of western hemlock trees infected with western hemlock dwarf mistletoe. I used a hotspot analysis to determine where clusters of infected and uninfected trees were in my 2.2 ha study area (Map 1). I discovered a hot spot and a cold spot, indicating two clusters, one of high values (infected) and one of low values (uninfected).

In my study site, 2 fires burned. Once in 1829, burning most of the stand, and then again in 1892, burning everywhere except the fire refugia (polygons filled in blue). This created a multi-storied forest with remnant trees located in the fire refugias. One component of the remnant forest are infected western hemlocks. These remnant hemlocks serve as the source of inoculum for the hemlocks regenerating after the 1892 fire.

For Exercise 2, my research question was: How does the spatial pattern of fire refugia affect the spatial pattern of western hemlock dwarf mistletoe?

I predicted that a cluster of infected western hemlocks are more likely to be next to a fire refugia than a cluster of uninfected trees. In order to assess this relationship, I used the geographically weighted regression tool in ArcMap.

Geographically Weighted Regression

Geographically weight regression (GWR) works by creating a local regression equation for each feature in a data set you want to analyze, using an explanatory variable(s) to predict values for the response variable, using the least squares method. The Ordinary Least Squares (OLS) tool differs from GWR because OLS creates a global regression model (one model for all features) whereas GWR creates local models (one model per feature) to account for the spatial relationship of the features to each other. Because the method of least squares is still used, assumptions should still be met for statistically rigorous testing. The output of the GWR tool is a feature class of the same type as the input, with a variety of attributes for each feature. These attributes summarize the ability of the local regression model to predict the actual observed value at that feature’s location. If you have an explanatory variable that explains a significant amount of the variation of the response variable, this is useful for seeing how its coefficient varies spatially.

Execution of GWR

To use this tool, I quantified the relationship between the trees and the fire refugia. I used the “Near” tool for this to calculate the nearest distance to a fire refugia polygon’s edge. This was my explanatory variable. My response variable was the z-score that was output for each tree from the Optimized Hot Spot Analysis. Then I ran the GWR tool. I then used the Moran’s I tool to check for spatial autocorrelation of the residuals. This is to check the clustering of residuals. Clustering indicates I may have left out a key explanatory variable. The figure below displays my process.

I tested the relationship between nearest distance to a fire refugia polygon’s edge and the z-score that was output for each tree from the Optimized Hot Spot Analysis using OLS, which is necessary to develop a well specified model. My R² value for this global model was 0.005, which is incredibly small. Normally I would have stopped here and sought out other variables to explain this pattern, but for this exercise I continued the process.

Results

This GWR produced a high global R²value of 0.98 (Adj R² 0.98) indicating that distance to refugia does a good job of explaining variance in the spatial pattern of infected and uninfected trees. However, examining the other metrics for the local model performance gives a different picture of model performance.

Map 2 displays results for the coefficients for the explanatory variable of distance to nearest refugia. As this variable changes, the z-score increases or decreases. These changes in z-scores indicate a clustering of high or low values. From examining the range of coefficient values, the range is quite small, -0.513 to 0.953. This means that across my study site, the coefficient only changes slightly from positive to negative. In the north western corner, we see a cluster of positive coefficient values. Here, as distance to refugia increases, the z-score of trees increases, predicting a clustering of infected trees. These values are associated with high local R² values (Map 4). In other places of the stand we see slight clustering of negative coefficients, indicating distance to refugia decreases the z-score of trees, predicting a clustering of uninfected trees.

Map 3 displays the standardized residuals for each tree. Blue values indicate where the local model over-predicted what the actual observed value was, and red values are under-predictions. When residuals from the local regression models are distributed randomly (i.e. not clustered or dispersed) over the study area, then the geographically weighted regression model is fit well, or well specified. The residuals of the local regression models were significantly clustered. (Moran’s Index of 0.265, p-value of 0.000, z-score of 24.344). Because we can observe clustering in my study area of residuals, there is another phenomenon driving the changes in z-scores; in other words, driving the clustering of infected and uninfected trees.

From the previous two map evaluations I saw that the distance of a tree to fire refugia was not the only explanatory variable necessary to explain why infected and uninfected trees clustered. Map 4 displays the local R² values for each feature. The areas in red are high local R²values. We see the northwestern corner has a large number of large values which correspond to a cluster of small residuals and positive coefficients. Here, distance to fire refugia explains the clustering of infected trees well. The reverse is observed in several other places (clusters of blue) where distance to fire refugia does not explain why infected or uninfected trees cluster. In fact the majority of observations had a local R² of 0.4 or less. From this evaluation, I believe this GWR model using distance to refugia does a good job of explaining the clustering of infected trees, but not much else.

Critique

GWR is useful for determining how the coefficient of an explanatory variable can change across an area. One feature in a specified area may have a slightly different coefficient from another feature, indicating these two features are experiencing different conditions in space. This allows the user to make decisions about where the explanatory has the most positive or negative impact. This result is not something you can derive from a simple OLS global model. This local regression process is something you could do manually but the tool in ArcMap makes this process easy. The output of GWR is also easy to interpret visually.

Some drawbacks are that you need to run the OLS model first for your data to determine which variables are significant in determining your response variable. If not, then a poorly specified model can lead to inappropriate conclusions about the explanatory variable (i.e. high R² values). Also, the evaluation of how the features interact in space is not totally clear. The features are evaluated within a fixed distance or number of neighbors, but there is no description for how weights are applied to each neighboring feature. Lastly, for incidence data, this tool is much harder to use if you want to determine what is driving the spatial pattern of your incidence data. Some other continuous metric (in my case a z-score) must be used as the response variable, making results harder to interpret.

Model Results Follow-Up

After finding that distance to a refugia was not a significant driver for the majority of trees, I examined my data for other spatial relationships. After a hotspot analysis on solely the infected trees, I found that the dispersal of infected trees slightly lined up with the fire refugia drawn on the map (Map 5).

Among other measures, forest structure was used to determine where fire refugia were located. Old forest structure is typically more diverse vertically and less clustered spatially. Also infected western hemlocks are good indicators of fire refugia boundaries because as a fire sensitive tree species, they would not survive most fire damage and the presence of dwarf mistletoe indicates they have been present on the landscape for a while. From the map we can see that the dispersal of infected trees only lines up with the refugia in a few places. This mis-drawing of fire refguia bounds may be a potential explanation for under-performance of the GWR model.

Courtney’s EX2: Comparing faults and principal components

Question asked in this exercise:

How does ion principal component 2 at a well vary with the well’s distance from faults along the groundwater flow path?

In EX1, I used principal component analysis to evaluate how parameters accounted for variance between the wells I studied. Based on my knowledge of how chemistry varies as water flows through the basalt, ion PC2 accounts for variance caused by ion exchange between the basalt and the groundwater, with increasing sodium ion concentration/pH and decreasing calcium/magnesium ion concentrations as the water spends more time underground.

In this exercise, I estimated the groundwater flow directions in my study area using interpolation, calculated fault incidence direction, calculated angular difference between flow direction and fault direction, and then manual measurement of the distance between each well and the distance to the nearest fault segment that had flow direction within 45 degrees of parallel to the fault along the estimated flow path.

Name of tool or approach used:

Interpolation, reclassification, raster math, distance measurement in ArcGIS Pro

Methods:

Input data:

2018 static water levels in wells provided by Oregon Water Resource Department (OWRD)
Well lithology from OWRD groundwater information system (GWIS)
Well seal depth from well logs accessed through OWRD GWIS
Fault polyline shapefile from Madin and Geitgey 2017
Well locations from OWRD database, with ion concentration information based on my sampling in the summer of 2018

Steps:

Classified wells in the static water level dataset by the basalt formation that they were open to, based on lithology and seal depth. Excluded wells that lacked this information.
1. Output: wells classified as open to Saddle Mountain Basalt (Smb) Wanapum Basalt (Wb), Grande Ronde Basalt (Grb), or both Wanapum and Grande Ronde Basalt (WbGrb). These classifications were joined to the static water level information.
Created an interpolation surface for static water levels of wells open to both the Wanapum and Grande Ronde. I ignored the Saddle Mountain formation, since the wells that I had sampled were not open to it. I used kriging with a cell size of 200 ft, and this created an estimated potentiometric surface for these wells. The interpolation was a bit ugly because my wells were not ideally distributed for this.
1. I tried creating interpolations based on other combinations of formations to better approximate the potentiometric surfaces posited by past studies in the region. I ended up creating two potentiometric surfaces: one using wells that were only open to the Wanapum Basalt (Wb), and a second using wells that were open only the Grande Ronde Basalt as well as those that were open to both the Grande Ronde Basalt and Wanapum Basalt(WbGrb_Grbonly).
Calculated flow direction in the two aquifer groups – Wb and WbGrb_Grbonly
1. Used the Hydrology toolbox to fill sinks and then calculate flow direction.
2. This creates out a raster with eight possible values between 1 and 256.
3. I then reclassified it so the values corresponded to the eight primary cardinal directions (N, NE, E, SE, S, SW, W, NW), which range from 0 to 360 degrees
Calculated fault incidence angle
1. Added a cell to the polyline attribute table called “angle”
2. Split the polyline at each vertex, which creates a new shapefile
3. Used the field calculator and a python script to assign an angular value between 0 and 359 to each fault polyline segment
4. Performed the polyline to raster function, using the angle as the cell value. I used a cell size of 200 ft.
5. This created a raster where only pixels that include part of a fault polyline had values, and those values ranged between 0 and 359.
Subtracted the flow direction raster from the fault incidence angle raster, in order to create fault line pixels that had values that reflected the difference between the fault incidence angle and flow direction. I did this twice, once each for the Wb and WbGrb_Grbonly rasters
1. Reclassified this raster so that pixels with fault direction within 45 degrees of parallel to the flow direction were 0, and the pixels with fault direction with 45 degrees of perpendicular to the flow direction were 1.
2. I then added the WB and WbGrb_Grbonly results together, so that pixels with fault direction within 45 degrees of parallel to the flow direction in both were 0, and the pixels with fault direction within 45 degrees of perpendicular in either raster to the flow direction were 1, and pixels with fault direction within 45 degrees of perpendicular in both rasters were 2. I named this allwbgrb_ff.
On a map layout, I added my sampled sites with PCA data, the interpolated potentiometric surface for wells open to the Wanapum and Grande Ronde aquifers, and allwbgrb_ff.
1. I added a field to the sampled sites with PCA data called “dist_from_fault”
2. Using the measure tool, I measure the distance from each well along the path most perpendicular to potentiometric contours to the nearest allwbgrb_ff pixel with a value of 1 or 2.
3. This was subjective because the potentiometric surface is imperfect because of the erratic spacing of wells. In areas where the potentiometric surface had noticeable glitches, I used my own judgement based on topography and literature about groundwater flow direction in the region.
I then graphed the “dist_from_fault” against the ionsPC2 category.

Discussion and Results:

I hypothesized that ion PC2 would increase with decreasing potentiometric surface elevation. An increased score in ionPC2 indicates an elevation sodium concentrations and pH and a decrease in calcium and magnesium caused by a progressive ion exchange reaction between the groundwater and the basalt. Because water flows from higher potentiometric elevations to lower potentiometric elevations, I would expect water samples from lower potentiometric elevations to show chemical evidence of increased interaction with the basalt. If this process of down-gradient groundwater flow were the only process influencing the ion exchange reactions, the well symbols on the map below would become progressively darker as the interpolated well level surface elevation decreased and the wells were further from the up-gradient recharge zone.

However, upon examining a map of ion PC2 values this is not the case – there are anomalously low values of PC2 in the valley, where one would expect to see increased values if groundwater flowed uninterrupted from the up-gradient recharge zones. In this study I introduced the variable of distance from faults in order to test another hypothesis: that faults compartmentalized groundwater flow, blocking lateral flow through the aquifer while promoting vertical permeability and modern recharge into the down-gradient aquifer. I also hypotehsized that if a fault was a barrier, PC2 values up-gradient of the fault would be elevated as the fault trapped water behind it. This would result in more chemically evolved groundwater backed up behind the fault, and less chemically evolved groundwater down-gradient of the fault.

The results of this study tentatively support the conceptual model of fault compartmentalization. In particular, water samples from wells in the valley down-gradient of the fault zones have evidence of less exposure to the basalt than wells further towards the recharge zones. 15 of the 18 wells sampled show a positive correlation between distance from a fault along a flow path and ion PC2 score, especially when graphed points are compared to their up-gradient and down-gradient neighbors (i.e. 57946 being up-gradient from 57235 and down-gradient from 54277).

Because of the width of the raster cells indicating faults and their flow direction in this model, four wells ended up have a distance from faults of 0. This does not seem unreasonable, because examination of exposed fault zones in the area indicated that many are up to a couple hundred meters wide. Additionally, while geological studies of the area indicate that the faults are close to vertical, their exact dip angles are unknown and this introduces a certain amount of uncertainty about their location at depth. Ion PC2 values of wells with dist_from_fault = 0 show an interesting dichotomy of either very high (4167, 4179) or very low (3929, 3962) values. I believe this indicates that wells 4167 and 4179 are up-gradient of faults that are acting as barrier, while 3929 and 3962 are slightly down-gradient or within the fault damage zone itself.

map of fault locations, potentiometric surface, and wells, plus a chart of distance vs ion PC2

Critique of Method:

This is admittedly a crude method of estimating groundwater flow direction. However, a certain amount of estimation is necessary to model a system with relatively few and unevenly spaced measured water level points in a hydrogeologic regime with many individual water-bearing layers of lava interflow zones that are unpredictably connected. I wish I had been able to find an automated way to measure well distance from faults along the groundwater flow path, which would have taken away some of the subjectivity of measuring the flow paths by hand the old-fashioned way.

Because of the uneven spacing of wells used for interpolation, the flow direction rasters that I created have glitchy areas and some of these areas of physically improbable values influenced the Fault and GW Flow Direction raster. I reclassified that raster into broad categories to avoid creating a false sense of precision.

Spatial Pattern of NDVI change from the Center of an Artisanal Gold Mine

For Exercise 2, I wanted to explore how the spatial distribution of land use/land cover (LULC) varied in relation to an artisanal, small-scale gold mine (ASGM). To do this, I took the NDVI change maps which I had generated for Exercise 1 (modified slightly to produce better, more accurate results), as well as my dataset of mapped footprints of known areas of ASGM activity, found the median center of those areas, generated buffers/donuts at 250m apart, from 250m to 2000m, clipped the NDVI change layer to each buffer, and counted the amount of each type of NDVI pixel contained within each buffer. In addition, I performed this same analysis around non-mining villages, to examine how the spatial pattern of NDVI loss changes with distance from the center of non-mining villages. With this data, I could generate a chart showing how the percentage of pixels representing loss or decrease in NDVI changed as you moved further away from the center of mining activity. This examination looked at nine mining villages and nine non-mining villages

In order to perform this analysis, I continued my work using ArcGIS Pro, in addition to Excel for a bit of charting work.

To begin, I imported my NDVI change map, which detailed increases and decreases in NDVI values between imagery taken in April 2018 with Landsat 8, and imagery taken in April 2007 with Landsat 5, representing 11 years of NDVI change. I also imported my shapefile containing polygons which I had digitized over my high-resolution satellite imagery depicting areas of known ASGM activity. With this shapefile, I used ArcGIS’s Median Center tool, which found the median center of the mining areas near the village of Douta (fig. 1). From there, I generated buffers/rings at 250m intervals (e.g. 0-250m, 251-500m, 501-750m, etc.), from 250m to 2000m around this median center. I then used the clip tool to clip the overall NDVI change layer to each buffer, resulting in NDVI change for 0m to 250m from center, NDVI change for 251m to 500m from center, and so on.

Fig 1: Center of the village of Douta with a 2km NDVI buffer; the blue buffer shows the 1750m extent, whose values were subtracted from the 2000m values, to only give values between 1751 and 2000m

Once I had accomplished this, I generated histograms for each buffered NDVI change layer in order to count the amount of pixels contained within each buffer, and assign them to one of two classes: negative values, representing overall NDVI decrease from 2007 to 2018, and positive values, representing overall NDVI increase from 2007 to 2018. I did not account for magnitude of change, as I wanted a general idea of how NDVI was changing from the center. Fig. 2 shows an example of these histograms, specifically for the 500m buffer. The values from the previous buffers, e.g. the one to 250m, were subtracted to only show values from 251 to 500m.

Fig. 2: Douta histogram

I entered all the pixel values for negative change at each distance into Excel, as well as all of the pixel values for positive change, and was able to generate a chart showing how the percentage of the overall pixels in each subsequent buffer from center representing NDVI decrease change over distance (fig. 3)

Fig. 3: Pale blue lines indicate individual mining villages. The dark blue line indicates the average of those villages. Thin red lines indicate non-mining villages. The dark red line indicates the average for those villages.

On the whole, this exercise was useful for illustrating the problem I’m attempting to grapple with. I was frustrated with the lack of Landsat imagery from the late 2000s — I was unable to find any Landsat 5 imagery corrected for surface reflection aside from the year 2007. Additionally, there are problems with comparing this 2007 image to the 2018 image. I found that the rainy season before the 2018 image was taken, in 2017, was wetter than average, while I was unable to determine the rain fall that preceded the image in 2007. As such, it is possible that the 2018 Landsat 8 image shows a non-normal vegetative pattern — or it’s even possible that the 2007 image is showing a non-normal vegetative pattern! I require some more investigation into the historical meteorology of the area before I can say either way. Regardless, I feel that this is a useful first step in investing how LULC change relates to the establishment of ASGM.