Category Archives: Exercise 1

Multi-Distance Spatial Cluster Analysis for Woodpecker Nests (Ripley’s K)

Exercise 1: Multi-Distance Spatial Cluster Analysis (Ripley’s K)

Questions 

How is the spatial presence of postfire woodpecker nests related to the spatial presence of salvage-logged forest stands?

  • How are woodpecker nests clustered within survey units? (Exercise 1 and 3)
  • How does this clustering relate to salvage treatment units within the survey units? (Exercise 2 and 3)

Tool

Multi-Distance Spatial Cluster Analysis (Ripley’s K) in R

Multi-Distance Spatial Cluster Analysis (Ripley’s K) analyzes point data clustering over a range of distances. Ripley’s K indicates how spatial clustering or dispersion changes with neighborhood size. If the user specifies distances to evaluate, starting distance, or distance increment, Ripley’s K identifies the number of neighboring features associated with each point if the neighboring features are closer than the distance being evaluated. As the evaluation distance increases, each feature will typically have more neighbors. If the average number of neighbors for a particular evaluation distance is higher/larger than the average concentration of features throughout the study area, the distribution is considered clustered at that distance.

K Function Graphic

 

 

 

 

 

 

 

 

 

 

 

 

 

Referenced from ArcGIS Desktop 9.3 Help (http://resources.esri.com/help/9.3/arcgisdesktop/com/gp_toolref/spatial_statistics_tools/how_multi_distance_spatial_cluster_analysis_colon_ripley_s_k_function_spatial_statistics_works.htm)

Data Description

My polygon dataset includes 10 survey units (6 treatment, 4 control) totaling over 7,000 ac. Each survey unit is between 397 and 1154 ac2. 34 salvage logging units are dispersed throughout the 6 treatment units. The salvage units are characterized by three silvicultural prescriptions replicating preferred postfire habitat for the three target woodpecker species: black-backed, white-headed, and Lewis’s woodpeckers. The 4 control survey units and the unlogged landscape in the 6 treatment units act as the undisturbed control habitat.

In 2016 and 2017, belt transects with corresponding avian point counts were surveyed in each of the 10 survey units. These surveys detected woodpecker occupancy and nest locations based on audio playback and nest searching methodology. Woodpecker nest datasets were developed from these surveys for pre-salvage (2016, n = 71) and post-salvage (2017, n = 77) conditions. I wanted to run Ripley’s K on the two datasets separately to determine differences in nest clustering before and after the salvage treatments. In the future, with successive datasets collected in 2018 and 2019, I can analyze clustering trends up to three years after salvage logging. I will also integrate 3D forest structure variables from pre- and post-salvage lidar datasets.

A subset of the 2016 and 2017 nest datasets picturing clockwise from top right: East Fork Canyon Creek (Ctrl), Wall Creek (Ctrl), Alder Gulch (Trt), Lower Fawn Springs (Trt), Upper Fawn Springs (Trt).

By using Exercise 1 to determine how nests are clustered within the study units, I can use this clustering to inform my Exercises 2 and 3, which should reveal how the clustering relates to the salvage harvest units.

Ripley’s K Steps

  1. Export nest points as their own shapefile. My preliminary point dataset includes both nest points and vegetation survey points, so I needed to isolate only nest points. This will indicate how nest points cluster within study units.
  2. Further export nest points by survey unit. Isolate nests in each survey unit as their own shapefiles. Running Ripley’s K on 2016 and 2017 nests as a whole will not be useful, since clustering across an entire nest dataset will reflect the 10 survey units selected for this study. In that case, Ripley’s K would falsely demonstrate high clustering within those 10 survey units. In reality, the nests are only clustered here because the surveys intentionally restricted nest detection to these areas instead of across the entire Canyon Creek fire complex. I ran Ripley’s K on all the 2016 and 2017 nests and the output proved true to this phenomenon, indicating statistically significant clustering at smaller distances:

  1. Use R to run Ripley’s K on each survey unit’s 2016 nest shapefile.
  2. Use R to run Ripley’s K on each survey unit’s 2017 nest shapefile.

Below is the example script for Ripley’s K using the 2016 Alder Gulch survey unit. For 2017 I  changed the variable names to 2017:

library(spatstat)
library(rgdal)
library(maptools)
library(ggplot2)
library(sp)
library(nlme)
library(rpart)
library(readxl)
library(raster)
setwd(“N:/AIS_Blue/Woodpeckers”)
getwd()
AGnests2016 <- readOGR(“./data”, layer = “AG_2016_nests”)
plot(AGnests2016)
class(AGnests2016)
AGnests2016.ppp <- as(AGnests2016,”ppp”)
n <- 100
AGnests2016RK <- envelope(AGnests2016.ppp, fun= Kest, nsim=n, verbose=F)
plot(AGnests2016RK)

I gave the option to run either 100 or 1000 iterations. The difference is that the confidence interval (grey area in the output graphs) widened with 1000 iterations, shown below for the Big Canyon survey unit:

Below are the Ripley’s K results for each survey unit in 2016 and 2017 over 100 iterations. The accompanying visual plots display nest point distribution in each survey unit. Some survey units did not have large enough sample sizes for the tool to function correctly. Notable results from a visual analysis of each graph include nest clustering in Big Canyon (Treatment 1) and Crawford Gulch (Control) in 2016 and Lower Fawn Creek (Treatment 2) in 2017.

Alder Gulch 2016 (n = 5)

 

Alder Gulch 2017 (n = 7)

______________________________________________________________________________________________________________________________________

Big Canyon 2016 (n = 13)

 

Big Canyon 2017 (n = 10)

______________________________________________________________________________________________________________________________________

Crazy Creek 2016 (n = 10)

Crazy Creek 2017 (n = 11)

______________________________________________________________________________________________________________________________________

Crawford Gulch 2016 (n = 12)

 

Crawford Gulch 2017 (n = 8)

 

______________________________________________________________________________________________________________________________________

East Fork Canyon Creek 2016 (n = 3); Not surveyed in 2017

 

______________________________________________________________________________________________________________________________________

Lower Fawn Creek 2016 (n = 8)

 

Lower Fawn Creek 2017 (n = 14)

______________________________________________________________________________________________________________________________________

Overholt Creek 2017 (n = 4); Not surveyed in 2016

______________________________________________________________________________________________________________________________________

Sloan Gulch 2016 (n = 7)

 

Sloan Gulch 2017 (n = 7)

 

______________________________________________________________________________________________________________________________________

Upper Fawn Creek 2016 (n = 4)

 

Upper Fawn Creek 2017 (n = 5)

______________________________________________________________________________________________________________________________________

Wall Creek 2016 (n = 9)

Wall Creek 2017 (n = 11); Cannot import figures due to unresolved maximum file size error.

Problems/Critique

Because of the small sample size for each of the survey units (as few as 4 nests in some years/units), my Ripley’s K output graphs appeared blocky instead of continuous. Notice how continuous the graphs appeared when I ran all 2016 nests. Overall this analysis did not perform well with these datasets. I am also unclear how edge effects were factored into this particular analysis. There may be more parameters I could define in the code when running this analysis in R. For example, I did not manually specify distances in the R code.

Ripley’s K requires (X,Y) coordinates for each point location. I first tried to perform this analysis in ArcMap. However, my woodpecker nest point shapefiles contain UTM coordinates divided into two separate fields for northing and easting. This caused a problem when the Ripley’s K tool in ArcMap asked for the dependent variable and I could only select one field, northing or easting. Therefore, I needed to run the Ripley’s K tool in RStudio instead.

The above Step 2 explains another issue with trying to analyze all nests at once, but I was able to resolve it by individually isolating each survey unit. Nest dataset analyses must also address a temporal component related to the salvage logging. 2016 nests must be analyzed separately from 2017-2019 nests, but 2017-2019 nests can be analyzed either by year or as a group.

Determining suitable habitat for Olympia oysters in Yaquina Bay, OR

Exercise #1

Question that you asked:
My goal for my thesis work is to evaluate the distribution of native Olympia oysters in Yaquina Bay, Oregon by assessing habitat suitability through spatial analysis of three habitat parameters: salinity, substrate availability, and elevation. A map of predicted suitable habitat as a result of the spatial analysis will be compared with field observations of oyster locations within Yaquina Bay. The main research question I am examining in this project is:

How is the spatial pattern of three habitat parameters (salinity, substrate, elevation) [A] related to the spatial pattern of Olympia oysters in the Yaquina estuary [B] over time [C]?

For this blog post, I will be evaluating the [A] portion of this question and the three habitat parameters simultaneously to identify where habitat is least suitable to most suitable. To better understand the spatial pattern of the habitat parameters, I am evaluating a raster layer for each parameter, then combining them to determine where overlap between the layers shows the best environmental conditions for oysters to survive.

Name of the tool or approach that you used:
For this portion of my research analysis, I wanted to be able to make an educated guess about where the best and worst habitat for Olympia oysters would be located within Yaquina Bay by ranking different subcategories within each of the salinity, substrate, and elevation datasets.

To do this, I started by looking through the available literature on the subject and consulting with shellfish biologists to get an idea of what conditions oysters prefer in order to apply a ranking value. The following table is a compilation of that information:

Habitat parameter Subcategories Subcategory variable range Olympia oyster tolerance Rank value applied
Mean wet-season salinity (psu) Upper estuary < 16 psu somewhat, but not long-term 1
Upper mid estuary 16.1 – 23 psu X 4
Lower mid estuary 23.1 – 27 psu X 3
Lower estuary > 27 psu somewhat 2
 
Substrate availability 1.2 Unconsolidated mineral substrate possible 3
1.2.1.3.3 Gravelly mud unlikely 2
1.2.2.4 Sandy mud no 1
2 Biogenic substrate yes 4
3 Anthropogenic substrate yes 4
3.1 Anthropogenic rock yes 4
3.1.2 Anthropogenic rock rubble unlikely 2
3.1.3 Anthropogenic rock hash no 1
9.9.9.9.9 Unclassified uncertain
 
Bathymetric depth (compared to MLLW) 1.5 – 2.5m supratidal no 1
1 – 1.5m supratidal no 1
0.5 – 1m intertidal maybe 2
0 – 0.5m intertidal yes 3
-2 – 0m intertidal yes 4
-3 – -2m subtidal yes 4
-4 – -3m subtidal yes 4
-6 – -4m subtidal yes 4
-8 – -6m subtidal yes 3
-12.5 – -8m subtidal yes 3
 

Once I established my own ranking values, I decided to use the ‘weighted overlay’ function, found within the Spatial Analyst toolbox in ArcGIS Pro. Weighted overlay applies a numeric rank to values within the raster inputs on a scale that the ArcGIS user is able to set. For example, on a scale from 1-9 ranking 1 as areas of least fit and 9 as areas of best fit. This can be used to determine the most appropriate site or location for a desired product or phenomenon. I used the ranking value scale 1-4 where 1 indicates the lowest suitability of subcategories for that parameter and 4 indicates the highest suitability.

Brief description of steps you followed to complete the analysis:

To apply the weighted overlay function:

  1. Open the appropriate raster layers for analysis in ArcGIS Pro. Weighted overlay will only work with a raster input, specifically integer raster data. Here, I pulled all three of my habitat parameter raster layers from my geodatabase into the Contents pane and made each one visible in turn as I applied the weighted overlay function.
  2. In the Geoprocessing pane, type ‘weighted overlay’ into the search box. Weighted overlay can also be found in the Spatial Analyst toolbox.
  3. Once in the weighted overlay window within the Geoprocessing pane, determine the appropriate scale or ranking values for the analysis. I used a scale from 1-4, where 1 was low suitability and 4 was high suitability.
  4. Add raster layers for analysis by selecting them from your geodatabase and adding them into the window at the top left. To add more than one raster, click ‘Add raster’ at the bottom of the window.
  5. Select one of the raster inputs and see the subcategories for that raster appear on the upper right. Here, ranking values within the predetermined can be individually applied to the subcategories by clicking from a drop-down list. Do this for each subcategory within each raster input. I ranked each subcategory within each of my habitat rasters according to the ranks listed on the table above.
  6. Determine the weights of each raster input. The weights must add up to 100, but can be manipulated according to the needs of the analysis. A raster input can be given greater or lesser influence if that information is known. For my analysis, I made all three of my habitat raster inputs nearly equal weight (two inputs were assigned a weight of 33, one was weighted 34 to equal 100 total).
  7. Finally, run the tool and assuming no errors, an output raster will appear in the Contents pane and in the map window.

Brief description of results you obtained:

The first three images show each habitat parameter weighted by suitability, with green indicating most suitable and red indicating least suitable.

Salinity —

Bathymetry —

Substrate —

The results of the final weighted overlay show that the oysters are most likely to be in the mid estuary where salinity, bathymetry, and substrate are appropriate.

 

Critique of the method – what was useful, what was not?:

The weighted overlay was a simple approach to combining all of the raster layers for habitat and creating something spatially meaningful for my research analysis. The areas indicated in green in the resulting map generally reinforce what was found in the literature and predicted by local shellfish biologists. While the weighted overlay tool did generate a useful visual, it is highly dependent on the quality of the raster inputs. In my analysis, the detailed resolution of the bathymetry layer was very helpful, but the substrate layer is a more generalized assessment of sediment types within Yaquina Bay. It doesn’t show the nuances of substrate availability that might be important for finding exactly where an opportunistic species like Olympia oysters might actually have settled. For example, in Coos Bay Olympia oysters have been found attached to shopping carts that have been dumped. The substrate raster is a generalized layer that uses standardized subcategories and does not pinpoint such small features.

Additionally, the salinity layer is an average of wet-season salinity, but it can change dramatically throughout the year. Some in situ measurements from Yaquina Bay this April showed that the surface salinity with the subcategory range of 16-23 psu were actually <10 psu. While it is more reasonable to generalize salinity for the purposes of this analysis, it is important to note that the oysters are exposed to a greater range over time.

This spatial information serves as a prediction of suitable oyster habitat. The next step is to compare this predicted suitability to actual field observations. I’ve recently completed my first round of field surveys and will be analyzing how closely the observations align with the prediction in Exercise #2.

What is the spatial pattern of the ground cover vegetation within 100 m, 500 m and 1 km of 20 wetland sites?

Exercise 1: What is the spatial pattern of the ground cover vegetation within 100 m, 500 m and 1 km of 20 wetland sites?

To determine these spatial patterns I first imported the latitudinal and longitudinal locations of my wetland sites as a point layer by adding a csv file with this data from the catalog to my map and choosing the Display X, Y Data option after right clicking on the imported file.
Display X, Y Data: X= longitude
Y= latitude

I then created 3 buffers around each point using the Buffer tool in spatial analyst. The buffers created had a 100m radius, a 500m radius, and a 1km radius.
Buffer: Input vector= wetland site points
Buffer= 100 (then 500, then 1000)
Unit= Meters

I used Clip in raster processing to clip my raster layer to the extent of my largest buffer layer so that the computer did not have to process so much extra raster data moving forward.
Clip: Input= raster layers
Clip to= 1kmBufferLayer

Then I created an NDVI raster layer from Landsat data obtained by the USGS Earth Explorer. I downloaded and then imported data from 1995 and 2019. The LIDAR data from 2019 was collected via the LandSat 8 satellite which collects, among other things, Near Infra-Red (NIR) and red wavelength reflectance. NIR is categorized as Band 5 and Red as Band 4. To create an NDVI layer from this data I had to rescale the Band 5 and Band 4 raster layers because the USGS self-correcting algorithm overcorrected for atmospheric disturbances. To do this I used to Raster Calculator in spatial management. This resulted in 2 new raster layers within the acceptable range (0-1000) for Band 5 and Band 4 of the 2019 LIDAR data. I did not have to rescale the 1995 data because there was no self-correcting algorithm used in that data.
Raster Calculator: Query= [(grid – Min value from grid) * (Max scale value – Min scale
value) / (Max value from grid – Min value from grid)] + Min scale value

I used the Raster Calculator again to create the final NDVI layer for 1995 and one for 2019 using the NIR and red wavelength data layers. The 1995 data was collected via LandSat 4 which categorizes NIR wavelengths as Band 4 and red wavelengths as Band 3.
Raster Calculator: Query= ((“NIR_Layer”-“red_Layer”)/(“NIR_Layer” + “red_Layer”))

2019 NDVI result below

I then used the Raster to Point tool to turn both the 1995 NDVI and 2019 NDVI data into a polygon layer.

Raster to Point: NDVI layers to point

Finally, then I used the Clip in spatial analyst tool to clip the polygon NDVI layers to the buffer layers I created earlier.
Clip: Clip “NDVI layers” to “Buffer layers”

Now I have the land cover type (in the form of a number signifying vegetation density) information for all of my sites within all of my buffer zones.

Ripley’s K analysis on two forested stands

Bryan Begay

  1. Question asked

Can a point pattern analysis help describe the relationship between the spatial pattern of trees in my area of interest with forest aesthetics? More specifically, how does Ripley’s K function describe forest aesthetics in different parts of the forest on the McDonald-Dunn Forest.

  1. Tool

The main tool that I used for the point pattern analysis was the Ripley’s K function.

  1. Steps taken for analysis

My workflow involved doing preprocessing of the LiDAR data, then creating a canopy height model to obtain an individual tree segmentation. The individual tree segmentation would then allow me to extract tree points with coordinates that could be usable points for the Ripley’s K Function.

 

LiDAR preprocessing

I started off with finding my harvest unit area of interest (Saddleback stand) and finding a nearby riparian stand that would be used to compare the Ripley’s K function outputs.   I create polygons to clip the LiDAR point clouds onto. I found the LiDAR files that were over the AOIs and used the ArcMap Create LAS Dataset (Data Management) to make workable files, then clipped the data sets to the polygons using the Extract LAS (3D analyst) tool. Fusion was used to merge the clipped LiDAR tiles to make one continuous data set for both AOIs. Then I normalized the point cloud with FUSION by using a 2008 DEM raster from DOGAMI, and the FUSION tools ASCII2DTM and Clipdata.

 

CHM, tree segmentation, and Ripley’s K

With the normalized point cloud, a canopy height model (CHM) was created in R-studio, and then an individual tree segmentation was made with an R package called lidR by using a watershed algorithm. The segmented trees were exported as a polygon shapefile that could be used in ArcMap. The Feature to Point tool (Data Management) was used to calculate the centroid of the polygons to identify individual trees as points. The points could then be used in RStudio spatstat package to be used in a Ripley’s K Function. The function was calculated for both Saddleback stand and a nearby riparian area.

  1. Results

The results show that the pattern for both the Saddleback stand and the riparian area were clustered. Both stands observed lines were plotted above the expected line for a random spatial pattern.  The lines were significantly different, being above the higher confidence envelope. The riparian stand has higher levels of clustering compared to Saddleback stand. The Saddleback stand showed a plotted clustering pattern as well, but not to the degree of the riparian stand.

 

  1. Critiques

Some critiques for my analysis would be to use a more robust individual tree segmentation algorithm analysis. For the sake of processing speed and creating delineated polygons with reduced noise, I used a resolution of 1 meter for my CHM. The 1 meter resolution for my CHM smoothed over the tree segmentation, possibly removing potential tree polygons but creating more defined segmented trees. The CHM lower resolution was used with a relatively simple watershed algorithm. Past algorithms I’ve used showed better results than watershed but required more detailed inputs. Another criticism I have is that using the feature to point does not necessarily give me the tree tops, but finds the centers of polygons that the tree segmentation identified as individual trees. Finding a more robust method for determining tree points would be more preferable.

Recession Analysis in Two Coastal Basins

Research Question

The question that I asked for this exercise was: What is the spatial variability of the recession timescale at different flow rates in rain-dominated, coastal basins?

Approach

Stream flow regimes are generally defined by five components: magnitude, frequency, duration, timing, and rate of change (Poff et al., 1997). For the purposes of this exercise, I am investigating the rate of change (or “flashiness”) component of flow regime in two river systems in the Oregon Coast Range: the Nehalem and Alsea River watersheds. I am analyzing recession behavior in these two systems to quantify a rate of change metric. I used the recession curve method, largely developed by Brutsaert and Nieber (1977), and later built upon by Krakauer et al. (2011) among many others. Recession curves describe the rate at which streamflow recedes in various streamflow conditions. In more general terms, recession curves provide an indication of watershed storage and groundwater behavior.

Methods

To complete the recession analysis in the Nehalem and Alsea watersheds, I followed the following steps:

  1. Downloaded streamflow data for the available period of record in 1-hour observation intervals using the dataRetrieval and EGRET (both developed by the USGS) packages in R.
  2. The data were in cubic-feet-per-second (cfs) units, which I converted to unit discharge (mm/hour) using respective basin areas.
  3. For recession analyses, only data points with insignificant precipitation are viable. Therefore, all time steps when precipitation was greater than 10% of total streamflow were removed. To do this, local precipitation data was necessary, however it is often hard to come by. As a result, the precipitation data sources for the Nehalem and the Alsea are different and the methods diverge slightly hereafter.
    1. Nehalem precipitation data was sourced from the USGS Vernonia site, which is upstream from the mainstem river gauging site where streamflow data for this analysis was sampled. For the purposed of this exercise, I assumed that rainfall (measured in inched at 30-minute intervals) was spatially consistent across the basin. For a more precise estimate of precipitation, multiple, spatially distributed rain gages would be needed.
    2. Alsea precipitation data in hourly timesteps was not readily available. There are a couple NOAA rain gages in the vicinity, but the data appeared to be spotty. As a result, I used daily precipitation data from PRISM to identify days in which total daily precipitation was greater than 10% of total daily streamflow. Next, I used the subset out the identified dates from the hourly streamflow data.
  4. Calculating rate of change: For this component of the analysis, I only wanted data from the receding hydrograph. I used the equation to estimate hourly -. Hourly streamflow (Q) for the corresponding hours was estimated as .
  5. (or  ) is the rate at which flow jumps values from one time-step to the next at a given flowrate. Because this is a recession analysis, I only wanted data that was on the receding slope and therefore subset the data to time-steps when  was negative.
  6. Lastly, I developed a simple linear regression model for vs Q for each watershed using the following equation:

 Results

Alsea

When log-transformed and plotted, the discharge and rate of change of discharge follow a linear relationship, however the relationship is different for data analyzed at different temporal resolutions (Figures 1 a-b). The recession results from the daily timestep are likely muted because recession processed occur on shorter timescales, on the order of minutes to hours.

Figures 1 a-b. On the left, recession curve results for the Alsea River on a daily time step, after removing days with high precipitation. On the right, recession curve results for the river on an hourly time step, after days with high precipitation were removed. Both a and b include only recession data.

 

Table 1. Table showing different coefficient values for different temporal resolutions of Alsea recession data.

Data a coefficient b coefficient
Alsea (daily) -3.029 1.639
Alsea (hourly) -5.2346 0.9138

Nehalem

Recession analysis on year-round, hourly Nehalem River data resulted in an a coefficient value of -4.696 and a b coefficient value of 1.049, which are similar values for the Alsea hourly results. However, the Nehalem recession data is showing two linear trends in the plotted data (Figure 2). The two trendlines remained after controlling for both diurnal flux and season (Figure 3).

Figure 2. Recession results from the Nehalem River data. The red line is the calculated linear regression model, however two lines may be more representative of the recession behavior, such as those estimated in blue.

Figure 3. Recession data separated into seasons and only including receding slopes, night time hours, insignificant precipitation. The two trendlines are apparently less distinguished in certain seasons of the year.

Critique of Method

            Recession analysis is a method that I will use in my research, and this was a helpful exercise in that it helped me understand the nuances of recession data analysis, including data acquisition and availability, and opportunities for improvement. Moving forward, this analysis would benefit from: more spatially accurate precipitation data, more investigation of the explanations for the observed patterns and trends, and more sophisticated statistical comparison between sites.

References:

PRISM Climate Group, Oregon State University, http://prism.oregonstate.edu, created 4 Feb 2004.

Krakauer, N. Y., & Temimi, M. (2011). Stream recession curves and storage variability in small watersheds. Hydrology and Earth System Sciences, 15(7), 2377–2389. https://doi.org/10.5194/hess-15-2377-2011

Sawaske, S. R., & Freyberg, D. L. (2014). An analysis of trends in baseflow recession and low-flows in rain-dominated coastal streams of the pacific coast. Journal of Hydrology, 519, 599–610. https://doi.org/10.1016/j.jhydrol.2014.07.046

Brutsaert, W., & Nieber, J. L. (1977). Regionalized drought flow hydrographs from a mature glaciated plateau. Water Resources Research, 13(3), 637–643. https://doi.org/10.1029/WR013i003p00637

Poff, N. L., Allan, J. D., Bain, M. B., Karr, J. R., Prestegaard, K. L., Richter, B. D., … Stromberg, J. C. (1997). The Natural Flow Regime. BioScience, 47(11), 769–784. https://doi.org/10.2307/1313099

Leaf spot patch analysis caused by Blackleg

Review
My research involves the classification of a leaf spots on turnip derived from the pathogen blackleg. I had hypothesized that spatial patterns of pixels based on image classification are related to manually classified spatial patterns of observed disease on turnip leaves because disease has a characteristic spectral signature of infection on the leaves. This post focuses on the analysis of the clusters based on image classification through segmentation as well as the manually classified clusters. These clusters of pixels are expected to be representative of the diseased patches on the leaves. Here we seek to obtain some patch statistics which will be compared for relationships and accuracy at a later time. A large portion of this process went into image processing before the analysis could be conducted. All of the image processing took place in ArcGIS Pro. The next step involved the patch analysis which was conducted in Fragstats.

Questions
Some questions I asked myself about element A & B of my hypothesis included:

Can diseased patches be successfully identified and isolated by computer image classification and through manual classification?

If yes; what is the area and perimeter of the diseased patches based on image classification and manually classified diseased patches? I was mostly looking to obtain and gain some experience in the patch analysis provided by Fragstats. Much more thorough analysis can be and will be completed in Fragstats when variable A & B are compared for accuracy assessment in exercise two.

Tools used and methodology
The image processing and classification of pixels was conducted in ArcGIS Pro. This analysis required extracting both manually classified diseased patches and computer classified patches. I began by uploading band 3, which captures light energy in the red wavelength at 670 – 675 nm.

1. For computer classification of the image I used Segmentation.

a) I went to the Imagery tab and the Image Classification group. Click the drop-down arrow and an option for Segmentation should appear. Be sure the layer you desire to perform the segmentation on was selected prior to selecting the Segmentation Classification Tools. In the pane you have options to adjust Spectral detail, Spatial detail and Minimum segment size in pixels. There are variable and depend on image resolution, level of accuracy you wish to achieve, etc. Because my resolution to high I chose for higher spectral detail and spatial detail. I adjusted the spectral detail from 15.50 to 17 and the spatial detail from 15 to 17 as well. For the minimum segment size in pixels I chose 10. As mentioned these values are variable and although the standard values worked well for segmentation, I previewed other values and achieved greater results by adjusting. After previewing other options and deciding what values work best, click the Run button in the bottom right corner of the pane.

b) Masking all the cells that weren’t diseased was the next step and crucial to this diseased region selection process. To do this, go to the Analysis tab and the Raster group to find the Raster Functions. Open the Raster Functions and it should appear in the pane to the right. Search Mask and select it. Open the segmented image in the raster box and three options will be available for Included Ranges. These three options are representative of the blue, green and red bands. Because the data was adjusted from 12-bit to 8-bit in the segmentation process, we should have a range of values between 1 and 255. Despite having three band options, all the bands hold the same value because we are only working with the red band. Click on the segments you wish to include in the working window to check their values. Include a minimum and maximum which should be the same for all 1, 2 and 3 that includes the segments you want. The range I had included was 100 to 160, which is the RGB.Pixel Values that appear when clicking a pixel. Once you have these entered, click create new layer in the bottom of the right pane.

c) Next, I used the Clip function by navigating to the Imagery tab, the Analysis group and selecting Raster Functions. In the window pane to the left a search bar can be found along the top where you can enter Clip. Although there are other methods and locations the clip function can be found, I experienced different results depending on how I navigated to clipping, as well different clipping options in the window pane. In the Parameters section for Clip Properties, select the drop-down arrow for the raster you previously segmented. Leave Clipping Type and Clipping Geometry/Raster as the default. Adjust the active map view by zooming in or out to the region you want preserved after the clip. I use as small of area as I possibly can when clipping, while keeping all required segmented regions. Click the Capture Current Map Extent button found just to the right of the extent options in the pane (Green square map). To clip, click Create new layer at the bottom of the pane and a new map layer with the clipped region should appear in the map contents.

d) From here, go to the Analysis tab and Geoprocessing group where you can find the Tools. Click the Tools to open the options in the left pane. Here we want to use the Raster to Polygon (Conversion Tools) which can be found by searching at the top of the tools window pane. Use the most recent segmented-clipped raster for the Input raster, leave the Field blank and select where you would like the Output Polygon features to save. I left the remaining option as the default and clicked the Run arrow in the bottom right corner. Although we need the final product to be in raster format, this step is important for creating separate polygons which are grouped together as one unit.

e) As mentioned we must get the polygons back into a raster. Before we use the Polygon to Raster tool, polygons which overlap must be merged. This is one of the two reasons why we converted the raster to polygons. When the segmentation was conducted in (step d), diseased regions were not all categorized in the same bin. This resulted in regions which were maintained after masking but are different shades grey. Even one diseased patch may have two to three tones to it, resulting in segments for a patch. Now that everything is a polygon, we can merge these polygons that should be one polygon. Click the Edit tab and select Modify in the Features group. In the Modify Features, scroll down to the Construct group and click Merge. In Existing Features click Change the Selection and while holding shift, select the polygons which belong in one group. They will be highlighted in blue and appear in the pane if properly selected and if so, click Merge. Navigate to the Map tab and Selection group and click Clear. Use the same merge steps to merge any other polygons which belong to the same diseased lesion but were classified separately in raster segmentation.

f) Finally, the polygons can be converted back to a raster for the last step of image processing. Click the Analysis tab and find the Geoprocessing group to Tools once again. Search Polygon to Raster (Conversion Tools) in the Tools pane and select it. Use the polygons layer for Input Features and the window pane options should adjust to accommodate your raster. The only adjustment I made was to the Cellsize which I changed to 1 from 0.16 to maintain the cell size in my original raster. Select Run in the bottom right corner of the pane for the final layer. Each of the polygons should now be converted to a rasterized polygon with a unique ObjectID, Value and Count which can be found by going to the Attribute table for that layer or clicking on a group of pixels. This allows for each diseased patch to be aggregated and analyzed as such in Fragstats.

g) Go to the final raster layer, right click and go to data and export data. What I wanted is a TIFF file for analysis in Fragstats.

2. To compare the accuracy of the segmentation as a classification method for diseased pixels, manual classification was used as a ground truth technique.

a) For manual classification of the original red band, the Training Sample Manager was used. This allows for manual classification which can be used for supervised machine learning techniques of classification. Here, I simply used it to select diseased regions manually, but have an end goal of using the support vector machine learning model.

b) To get to the Training Sample Manager, go to the Imagery tab and Image Classification group and select the Classification Tools where you can click Training Sample Manager. In the Image Classification window pane go to create new scheme. Right click on New Scheme and Add New Class and name it diseased pixels and supply a value which is arbitrary and a description if desired. Click ok and select the diseased pixels scheme and then the Freehand tool in the pane. Draw an outline around the diseased regions in the image. Save these polygons and the classification scheme. Go to Add data in the Layer group on the Map tab to upload the training sample polygons.

c) Once the polygons are uploaded, use the Polygon to Raster protocol which can be found in step 1, part (f) followed by part (g).

Fragstats was used for the analysis of the polygons. For exercise 1 the intent in Fragstats was to simply obtain information about the diseased patches that were extracted from the red band image. This included the pixel number, area, perimeter, perimeter-area ratio and shape index. Many more options were available for patch analyses and will be considered for exercise 2.

1. Open up Fragstats for the patch analysis.

a) To import your first image click Add layer and go down to GeoTIFF grid and select it. In the Dataset name, search for your tiff file you created in ArcGIS Pro and select it and click ok. You have to do this for each of the datasets. Then go to the Analysis parameters and click Patch metrics in the Sampling strategy. For General options hit Browse to find a location for the output to save and click the Automatically save results checkbox.

b) Click on the red box called Patch metrics and in the Area – Edge tab select the Patch Area and Patch Perimeter. Click on the Shape tab and click Perimeter – Area ratio and Shape Index.

c) In the upper left corner, you can hit Run and it will check everything you have uploaded and the analysis you have selected. You must then hit Proceed after it checked the model consistency and it runs the analyses.

d) After this hit the Results options for viewing the output.

Results
The patches aligned pretty well through visualization in ArcGIS Pro when comparing the two layers. The comparison between patches will be done in exercise 2. I did notice that the segmentation process missed many of the smaller diseased patches. Looking at Image 2, you can see that they were segmented from the surrounding regions but were lumped into a bin with that caught areas that appeared lighter in the image because of a reflection. Different thresholds could be used in the segmentation process in order to include those smaller patches. This could maybe be done if they followed a parameter involving shape. This may be considered for future segmentation. A concern is that you set the threshold too low and it includes many regions that aren’t diseased but include all the diseased regions also. This would result in many false positives for disease. The alternative is to set a high threshold value for diseased regions which would result in false negatives. The idea is to find a good balance between the two. Currently the segmentation is strictly based on the segments value which is attributed to the reflectance in the red band. As mentioned another parameter worth considering is shape. Since the diseased regions tend to be leaf spots resulting in a circular area typically, adjustments could be made to include more circular patches that don’t extend past a certain number of pixels. In table 1 we see that number of pixels for patches ranges from 8 to 50. The patch analysis provides some insights into possible spatial factors that explain these segmented regions and how the classification process could be done more accurately.

Table 1. Showing the 11 different patches that were manually selected and analyzed in Fragstats. The color indicates the corresponding patch that was detected through segmentation shown in Table 2.

Table 2. The 4 different patches that were identified through segmentation and analyzed in Fragstats. The color indicates the corresponding patch that was detected through segmentation shown in Table 1.

 

Critique of the method
The MicaSense red-edge camera has 5 bands which can be very helpful for applying different vegetative indices and compositing bands in order to help bring out the variation in spectral signatures between diseased and un-diseased tissue. Although the pixel size is just under 1 mm, which appears to be adequate for identifying lesions on the leaf, the bands do not have perfect overlap. This is due to the design of the camera which has five different lenses for the five different bands that are all separated by one to two inches. This could be corrected for if the extent for each of the five bands was manually adjusted for a near perfect overlap. Until this is resolved, the red band seems to show the most variation in pixel value for diseased patches in comparison to the other four bands and was used for this analysis.

Additionally, the amount of processing that is done in part 1 step (a) to get the segmented raster patches is has many steps. Methods for speeding up this process need to be considered for future analysis especially when conducting this analysis on the 500 leaves.

As mentioned before, Fragstats has many more statistically capabilities that were not applied in this analysis. Getting more statistics on the manually and segmented patches would be helpful for determining the level of accuracy as well as other parameters worth considering.

Appendix

Image 1. The red band tif file uploaded into ArcGIS Pro for classification. The white patches are what we would expect the diseased regions to look like.

Image 2. The result of the segmentation performed on Image 1 described in step 1 part (a)

Image 3. The patches determined based on segmentation in part 1 from Image 2. Separate patches were created by using the Raster to Polygon tool, followed by the Polygon to Raster tool. Two layers were turned on here for the intent of showing the overlap between the raster patches and the original image.

Image 4. The patches determined based on the manual classification described in part 2 from Image 1. Training sample manager followed by polygon to raster was used to get these patches. Two layers were turned on here for the intent of showing the overlap between the raster patches and the original image.

Spatial Patterns of Salmonella Rates in Oregon

  1. Question Asked

At this stage I asked several questions regarding the spatial distribution of population characteristics in all counties in Oregon in 2014: What are the county level spatial patterns of reported age-adjusted Salmonella rates within Oregon in 2014? County level spatial patterns of proportions of females? Median Age? Proportion of infants/young children aged 0-4 years?

To answer these questions I used several different datasets. The first dataset used is a collection of all reported Salmonella cases in Oregon from 2008-2017 which includes information like sex, age group, county in which the case was reported, and onset of illness. The information in this dataset was deidentified by Oregon Health Authority. The second dataset used was a collection of Oregon population estimates over the same time period. This dataset includes sex and age group specific county level population information. I also obtained county level median ages from AmericanFactFinder. The last dataset used is a shapefile from the Oregon Spatial Data Library containing polygon information of all Oregon counties.

  1. Names of analytical tools/approaches used

I used a direct age adjustment (using the 2014 statewide population as the standard population) to obtain county level age-adjusted Salmonella rates. After calculating county level summary data e.g. proportion of females, proportion of children aged 0-4, median age, and age-adjusted Salmonella rates, I merged this information with a spatial dataframe containing polygonal data of every county in Oregon. After doing this I did both local (between 0-150 km) and global (statewide) spatial autocorrelation to get a Moran’s I statistic for each of the population variables listed above. I produced choropleth maps of each of the variables for Oregon as well. Finally, I produced a heatmap for county-level age-adjusted Salmonella rates using a Getis-Ord Gi* local statistic to evaluate statistically significant clustering of high/low rates of reported Salmonella cases.

  1. Description of the analytical process

After extensive reformatting, I was able to organize cases of Salmonella by age group and by county for the year 2014. After this I formatted 2014 county level population estimates in the same way. I then divided the Salmonella case dataframe by the population estimate dataframe to get rates by the different age groups. To get county age-adjusted rates I created a “standard population”, in this case I used Oregon’s statewide population broken down into the same age groups as above. I then multiplied the each of the county’s age-specific rates by the standard population’s matching age groups to create a dataframe of hypothetical cases. This dataframe represents the number of cases we would expect in each of the counties if they had the same population and age distribution as Oregon as a whole. I summed the expected Salmonella cases by county and divided this number by the 2014 statewide population. This yielded age-adjusted reported Salmonella rates by county.

Given that the population data contained county level populations broken down by age group and by sex I was able to calculate proportions of county populations which were female, and which were young children aged 0-4 years by dividing those respective group populations by the total county population.

After this I performed local and global spatial autocorrelation with Moran’s I using the county level median age, proportion of children, proportion of females, and age adjusted Salmonella rates which were associated with centroid points for each county. The global Moran’s I was calculated using the entire extent of the state and the local Moran’s I was calculated by limiting analysis to locations within 150 km of the centroid. Both global and local Moran’s I statistics were calculated using the Monte-Carlo method with 599 simulations.

Finally, I completed a Hot Spot Analysis using Getis-Ord Gi* to assess for any statistically significant hot or cold spots in Oregon. This was only done for the age-adjusted Salmonella rates. This was completed using the same county centroid points as above. I completed this analysis with a local weights matrix using Queen Adjacency for neighbor connectivity. The weighting scheme was set to where all neighbor weights when added together equaled 1.

  1. Brief description of results you obtained

Choropleth Maps of Oregon:

From the median age map, we can see that there are some clusters of older counties in the northeastern portion of the state and along west coast. Overall, the western portion of Oregon is younger than the eastern portion of the state.

From the proportion of children map there are a few clusters of counties in the northern portion of the state with high proportions of children compared to the rest of the state. Overall, the counties surrounding the Portland metro area have higher proportions of children compared to the rest of the state.

From the proportion of females map, we can see that the counties with the highest proportion of females are clustered in the western portion of the state.

Finally, from the age-adjusted county Salmonella rates map we can see that the highest rates of Salmonella occur mostly in the western portion of the state with a few counties in the northeast having high rates as well. Overall, the counties surrounding Multnomah county have the highest rates of Salmonella.

The global Moran’s I statistics:

  • County proportions of females: 0.053 with a p-value of 0.15. This suggests insignificant amounts of slight clustering.
  • County median age: 0.175 with a p-value of 0.02. This provides evidence of some significant mild clustering.
  • County proportions of children: 0.117 with a p-value of 0.05. This provides evidence of significant mild clustering
  • County age-adjusted Salmonella rates: -0.007 with a p-value of 0.32. This suggests insignificant amounts of higher dispersal than would be expected.

Local Moran’s I Statistics:

  • County proportions of females: 0.152 with a p-value of 0.02. This suggests significant amounts of mild clustering.
  • County median age: 0.110 with a p-value of 0.07. This provides evidence of some insignificant mild clustering.
  • County proportions of children: 0.052 with a p-value of 0.1617. This provides evidence of insignificant slight clustering
  • County age-adjusted Salmonella rates: -0.032 with a p-value of 0.5083. This suggests insignificant amounts of higher dispersal than would be expected.

Getis-Ord Gi*:

  • The heatmap shows a significant hotspot (with 95% confidence) in Clackamas county with another hotspot (with 90% confidence) in Hood River County. Three cold spots (with 90% confidence) are seen in Malheur, Crook, and Morrow counties.

  1. Critique of Methods

The choropleth maps were very useful at showing areas with high/values however this method was not able to detect counties with significantly different values compared their neighbors. Overall, it was useful as an exploratory tool. The global and local Moran’s I calculations were able to detect if high/low values were closely clustered or more dispersed than what is expected. However, I am unsure if this method was completely appropriate given the coarseness of this county level data. At a local scale, only the proportion of women showed a significant amount of clustering, and globally median age and proportion of children showed some amount of significant clustering. Given that most of the Moran’s I statistics were not associated with significant values, I don’t believe this analytical method highlighted a particularly meaningful spatial pattern in my data. The heatmap provided evidence of some significant hot and cold spots in Oregon, however this was based on immediate neighbor weights and perhaps global weights would be more appropriate. Overall, this tool was very useful in detecting significantly high/low Salmonella rates.

Exercise 1: Preparing for Point Pattern Analysis

Exercise 1

The Question in Context

In order to answer my question: are the dolphin sighting data points clustered along the transect surveys or do they have an equal distribution pattern? I need to use point pattern analysis. I am trying visualize where in space dolphins were sighted along the coast of California, specifically from my San Diego sighting area. In this exercise, the variable of interest is dolphin sightings. These are x,y coordinates (point data) indicating the presence of common bottlenose dolphins along a transect. However, these transect data were not recorded and I needed to recreate these lines to my best abilities. This process is more challenging than anticipated, but will prove useful in the short-term view of this class and project and long-term in management ramifications.

The Tools

As part of this exercise, I used ArcMap 10.6, GoogleEarth, qGIS, and Excel. Although I was only intending on importing my Excel data, saved as a .csv file into ArcMap, that was not working, so other tools were necessary. The final goal of this exercise was to complete point-pattern analyses comparing distance along recreated transects to sightings. From there, the sightings would be broken down by year, season, or environmental factor (El Niño versus La Niña years) to look for distributing patterns, specifically if the points were ever clustered or equally distributed at different points in time.

Steps/Outputs/Review of Methods and Analysis

My first step was to clean up my sightings data enough that it could be exported as a .csv and imported as x-y data into ArcMap. However, ArcMap, no matter the transformation equation, seemed to understand the projected or geographic coordinate systems. After many attempts, where my data ended up along the east coast of Africa or in the Gulf of Mexico, I tried a work around; I imported the .csv file into qGIS with the help of a classmate, and then exported that file as a shape file. Then, I was able to import that shape file into ArcMap and select the correct geographic and projected coordinate systems. The points finally appeared off the coast of California.

I then found a shape file of North America with a more accurate coastline, to add to the base map. This step will be important later when I add in track lines, and how the distributions of points along these track lines are related to bathymetry. The bathymetric lines will need to be rasterized and later interpolated.

The next step was the track line recreation. I chose to focus on the San Diego study site. This site has the most data and the most consistently and standardly collected data. The surveys always left the same port of Mission Bay, San Diego, CA traveled north at 5-10km/hr to a specific beach (landmark), then turned around. It is noted on sighting data whether the track line was surveyed on both directions (South to North and North to South), or unidirectional (South to North). Because some data were collected prior to the invention of a GPS and the commercial availability, I have to recreate these track lines. I started trying to use ArcMap to draw the lines but had difficulty. Luckily, after many attempts, it was suggested that I use Google Earth. Here I found a tool to create a survey line where I can mark the edges along the coastline at an approximate distance from shore, and then export that file. It took a while to realize that the file needed to be exported as a .kml and not a .kmz.

Once exported as a .kml, I was able to convert the .kml file to a layer file and then to a shape file in ArcMap. The next step in this is somehow getting all points within one kilometer of the track line (my spatial scale for this part of the project) to associate with that track line. One idea was snapping the points to the line. However, this did not work. I am still stuck here: the major step before I can have my point data with an association to the line and then begin a point pattern analysis in ArcMap and/or R Studio.

Results

Although I do not currently have results of this exercise, fully. I can say for certain, that it has not been without trying, nor am I stopping. I have been brainstorming and milking resources from classmates and teaching assistants about how to associate the sighting data points with the track line to then do this cluster analysis. Hopefully, based on this can be exported to R studio where I can see distributions along the transect. I may be able to do a density-based analysis which would show if different sections along the transect, which I would need to designate and potentially rasterize first, have different densities of points. I would expect the sections to differ seasonally.

Critiques

Although I add in my opinions on usefulness and ease above, I do believe this will be very helpful in analyzing distribution patterns. Right now, it is largely unknown if there are differences in distribution patterns for this population because they move rapidly and at great distances. But, by investigating data from only the San Diego site, I can determine if there are differences in distributions along the transects temporally and spatially. In addition, the total counts of sightings in each location per unit effort will be useful to see the influx to that entire survey area over time.


Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Point Pattern Analysis of Tree Distribution by Height in the HJ Andrews Forest

1. Given that HJ Andrew Experimental Forest is a 16,000-acre forest with manipulations spanning back to its establishment as a Long-Term Ecological Research (LTER) site in 1948, it is a highly spatially heterogeneous ecosystem. Forest harvest began in the 1950s and resulted in a mosaic of young plantation forest (~30 percent of total forest area) and old growth (~40 percent of forest) (http://andrewsforest.oregonstate.edu/). My objective is to quantify the spatial pattern of trees across the forest and eventually relate that to quantifiable landscape features.

Motivating Questions: How does the spatial pattern of trees vary across the HJ Andrews Forest? Specifically, I’m exploring the relationship between tree height and tree spacing. One specific question of interest is: How does the mean distance between trees in the same height class differ from the mean distance between a single height class of tree and all other trees? This question attempts to address the clustering vs. dispersion of trees by height.

This is an analysis of the spatial distribution of one variable, tree height, so a consideration of the internal processes that may influence the spatial distribution of this variable is necessary.

  • Microclimate caused by the clustering or dispersion of trees could be either an attraction or repulsion process. Microclimate influences relative humidity and exposure to wind, along with many other factors, so clusters of trees would tend to have different microclimate features than more dispersed trees.
  • Population and community dynamics will influence the spatial distribution of trees. The speed at which a colonizer can take over a space and competition between different colonizers influence the distribution. Aboveground and belowground tree growth adn spacing of trees will be influenced by these factors.
  • Source and sink processes in a forest may result from topographical features, like valleys and hillslopes. I expect valleys to be a source (capable of producing a surplus of organisms) sink they will tend to be in areas near streams, so not water limited and in areas that serve as catchments for nutrients, so not nutrient limited.
  • Spatial distribution of tree height certainly is different according to the scale. The spatial pattern looks different at a single tree scale compared with a 50-m scale, compared with a 5-km scale. Some of these differences are revealed by Figures 3, 4 and 5, below.

2. My approach is to use k-nearest neighbor to examine distance between a given tree and proximal trees. I created ten height classes by using kmeans to find ten cluster centers in the tree height data, then used K nearest neighbor to examine the distance between each cluster center and the 30 closest trees.

3. For this analysis, I used a LiDAR dataset of vegetation heights downloaded from the HJ Andrews online data portal (http://andlter.forestry.oregonstate.edu/data/abstract.aspx?dbcode=GI011). The selected data is from the third entry, titled, “Vegetation Height digital elevation model (DEM) from 2011 LiDAR, Upper Blue River Watershed (Oct 27th, 2011 – Nov 1st, 2011). A description of this dataset can be found here: http://andlter.forestry.oregonstate.edu/data/spatialdata/gi01103.htm.

I performed the majority of the analyses in R and used QGIS for data visualization. I used the ‘st_read’ function from the ‘sf’ package in R to read in the stem map shapefile (stem_map.shp). The stem map shapefile includes crown radii (m) and tree heights (m) as well as the point locations of trees.

Because the stem map shapefile essentially provides a census of trees within the HJA Forest, and that is (1) way too much data to deal with at one time, and (2) not useful to perform statistical analyses on since we can just look at the data mapped out and visually discern where trees are located, I decided to use kmeans to group the trees into 10 height classes, resulting in the data space being partitioned into Veronoi cells.

I used the ‘nngeo’ package to examine k-nearest neighbors relations between and within height clusters. I examined the relationship between a single tree (point) and its nearest neighbor of the same height class. I also examined the relationship between a single tree of one height class and its nearest neighbor of any height class to start to elucidate to what extent trees of similar heights cluster or are dispersed spatially.

I performed T-tests on each tree height class to test if there was a significant difference between (1) the mean distance between a tree and the next closest tree within the same height class and (2) the mean distance between a tree and the next closest tree of any height class. I calculated the mean, standard deviation and kurtosis of each height class distance (both within and between height classes).

4. Results

The table below shows the center of each of the ten height class groupings, mean distances between and within tree height classes, standard deviation and kurtosis of those means, and p-values. Results of all t-tests were significant (p<0.01), meaning that there is strong evidence that the within group mean distances are not equal to zero, so there are differences in mean distances between trees of the same height class (within group) and mean distance to the closest trees of any height class. In other words, the distance between a large tree and its nearest neighbor (of any height class) is significantly different than the distance between a large tree and its nearest neighbor within the large tree height class. The same is true of trees within and outside of the small height class, as well as each of the other ten height classes.

Table 1. Results from k-nearest neighbor analysis and t-tests

Tree Class Class Center (Height (m)) Mean Distance to 30 closest trees (m) St Dev Mean Distance Kurtosis Mean Distance Mean Distance to Closest Tree (m) St Dev Mean Mean Distance to Closest Tree Within Group (m) St Dev Group Mean P-value
1 11.8 12.5 4.4 6.4 2.1 2 4 6.3 <0.01
2 15.1 11.7 3.9 3 2.4 1.7 4.9 6.6 <0.01
3 19.9 12.6 3.4 4 3.6 1.3 6.8 6.7 <0.01
4 24.4 13.4 3.1 5.6 4 1.4 6.9 6.5 <0.01
5 29.7 14.6 2.9 3.9 4.6 1.4 8 6.9 <0.01
6 35.5 16.7 3 2.8 5.3 1.6 9.3 7.2 <0.01
7 42 18.6 2.6 4.5 6.1 1.7 10.1 7 <0.01
8 50 19.7 2.5 6.6 6.9 1.8 11.4 7.3 <0.01
9 59 20.5 2.6 3.1 7.5 1.8 13.5 8.7 <0.01
10 70 21.2 2.7 0.7 8.1 1.9 15 11 <0.01


Fig 1. Distribution of all tree heights (m) in HJ Andrews Forest.

 

 

 

Fig 2. Mean distance (m) and standard deviation (m) between trees of each of ten height classes and the next closest 30 trees within that height class. Generally, as trees get larger, mean distance between them is larger.

Fig 3. Distribution of the tallest tree class (70 m tree class) across the HJ Andrews Forest.

 

Fig 4. A representative distribution of all tree height classes on either side of a road in the HJ Andrews Forest, showing the extent of clustering and the extent of dispersion of height classes. Height classes are in ascending order from smallest class (~12m tall; Class 1) to tallest class (70m tall; Class 10).

 


Fig 5. A closer look in the same area at Fig. 4, where small trees are clustered near the road and clustered tightly together, while larger trees are more dispersed.

 

Fig 6. Mean distance between trees of the same height class and other trees of the same height class (blue) and mean distance between trees of one height class and any other tree (red). The overlapping red confidence interval with the blue points suggests that the average distance between small trees is not significantly different than distance between small trees and any trees. The general upward trend suggests that as trees get taller, the distance between them increases and variance slightly decreases.

5. Critique of the method:

The results make sense, but do not provide much more information about the actual distribution of trees (clustering vs. dispersion) than simple maps of point data, so the next step might be to examine tree heights within different management regimes. The current analysis tells me that trees are somewhat clustered by height, and that the mean distance between a tree of one height class to a tree of the same height class is, in most cases, different from the mean distance of a tree of one height class to a tree of any other height class. I’ve examined a map of different management regimes within the HJ Andrews Forest and there are clear areas of old growth, harvested areas, clearly defined plots, etc., so I would expect some of these areas to show tree clustering by height class. The patterns I found using this analysis were not as clear as I was expecting. Using kmeans and nearest neighbor analysis is a great way to start to examine the spatial relationships between and among data, but with such a large and highly varied dataset there can be shortcomings, especially when it comes to drawing any concrete conclusions.

References:

HJ Andrews Online Data Repository: http://andlter.forestry.oregonstate.edu/data/catalog/datacatalog.aspx

Johnson, S.; Lienkaemper, G. 2016. Stream network from 1997 survey and 2008 LiDAR flight, Andrews Experimental Forest. Long-Term Ecological Research. Forest Science Data Bank, Corvallis, OR. [Database]. Available: http://andlter.forestry.oregonstate.edu/data/abstract.aspx?dbcode=HF013 (10 April 2019) DOI:http://dx.doi.org/10.6073/pasta/66d98881d4eb6bb5dedcbdb60dbebafa.

Spies, T. 2016. LiDAR data (August 2008) for the Andrews Experimental Forest and Willamette National Forest study areas. Long-Term Ecological Research. Forest Science Data Bank, Corvallis, OR. [Database]. Available: http://andlter.forestry.oregonstate.edu/data/abstract.aspx?dbcode=GI010 (10 April 2019) DOI:http://dx.doi.org/10.6073/pasta/c47128d6c63dff39ee48604ecc6fabfc.

Spies, T. 2016. LiDAR data (October 2011) for the Upper Blue River Watershed, Willamette National Forest. Long-Term Ecological Research. Forest Science Data Bank, Corvallis, OR. [Database]. Available: http://andlter.forestry.oregonstate.edu/data/abstract.aspx?dbcode=GI011 (10 April 2019) DOI:http://dx.doi.org/10.6073/pasta/8e4f57bafaaad5677977dee51bb3077c.

Spies, T. 2014. Forest metrics derived from the 2008 Lidar point clouds, includes canopy closure, percentile height, and stem mapping for the Andrews Experimental Forest.. Long-Term Ecological Research. Forest Science Data Bank, Corvallis, OR. [Database]. Available: http://andlter.forestry.oregonstate.edu/data/abstract.aspx?dbcode=TV081 (10 April 2019) DOI:http://dx.doi.org/10. 6073/pasta/875e10383e8c8aee3c9a49e0155eef1d.

Exercise 1: Principal component analysis for rural index weighting & rural indicator measure contributions for Texas counties

Questions Being Asked:

How much do different indicator measures of rurality in Texas counties each contribute to a basic index of rurality both statistically and spatially? How much does the weighted basic index differ from a simple calculated mean for rurality?

Tools and Approaches Used

The major statistical method I used to answer these questions was principal component analysis (PCA) via the prcomp function in R and subsequent rurality indicator variable weighting for counties in Texas. The raw indicator data used in this analysis are as follows: Texas county median household income (2010 Census Data), Texas county population density (2010 Census data), and rasterized discrete data of land use for the state of Texas (2011 National Land Cover Database).

Description of Analysis Steps

1) Data Wrangling: Both income and population density data were natively in county-level polygon form, so no conversion was required in ArcGIS. Land use required extensive conversion in order to produce the indicator variable that would both be at the appropriate spatial definition and measure the correct aspect of land use. The raw land use discrete data was first reclassified from many categories of land, such as “deciduous forest,” “barren land,” and “scrub/shrub,” into binary “developed” and “undeveloped” classification groups for the entire state of Texas. Then, the resulting binary raster was converted to unsimplified polygons and the “tabulate intersection” ArcGIS tool was utilized in order to intersect Texas county boundaries with the produced unsimplified polygons. This tool allowed for calculation of percent of area of each Texas county that is considered developed land by the USGS.

2) Data Scaling: After the 3 variables were in analyzable form, each were scaled on a 0 to 100-point scale for visual comparison, PCA analysis, and computation of index weights.

3) Creation of Maps of Scaled Variables for Comparison: ArcGIS was utilized to create maps of each of the scaled variables to visually compare the 3 factors to one another.

4) PCA and Weight Computation: The PCA was completed in R using the prcomp function and variable weights were extracted from the procedure by determining the proportion of variance each variable contributed to. This specific step helps answer how much each indicator variable contributes to rurality in Texas.

5) Simple Means Map, Weighted Index Map, and Statistical Comparison: The scaled variables were used to calculate both an unweighted mean rurality score and weighted mean rurality score for all Texas counties. Maps of each of these measures were produced for visual comparison. A student’s t-test was then completed to compare unweighted and weighted rurality scores for counties.

Results

Maps of Scaled Variables of Texas Counties for Comparison

Rough but similar spatial patterns can be seen in all three maps. Counties with large cities and generally more urbanized counties have higher median income, population density, and percent of land area developed (Green=high, yellow=medium, red=low). There are more extreme values in the population density and percent of land area developed maps than median income.

PCA and Weighting Computation

The biplot of the PCA procedure helps visualize how the samples of the 3 variables relate to one another and how much each variable contributes to the first principal component. Important factors to notice in this plot are the direction of the arrows and clustering of the points. Typically, the more the variable arrow points toward the principal component 1 (PC1) axis, the more it contributes to the proportion of variance. The population density and percent land area developed variables happen to overlap one another in this plot, as they have highly similar contributions to the variance in PC1.

This table shows that population density contributes to 44.4% of the variance in PC1, income contributes to 11.11%, and percent land area developed contributes 44.5%. This PCA procedure indicates that when weighting rurality scores by county in Texas using these 3 variables, percent land area developed should have the highest weight, population density should have the second highest weight, and income should have the lowest. The weighting of variables in the weighted county rurality mean score follows these proportions.

Simple Means Map, Weighted Index Map, and Statistical Comparison

Visually comparing the two maps of rurality scores, there is is some overlap, but quite stark differences. County scores overall have become more rural (lower scores) based on simple visual comparison. Statistical analysis will allow a more nuanced view of the actual differences between the two.

The computed t-test indicates that the weighted index, on average, produces county rurality scores that are significantly more rural (lower scores on the 0-100 scale) than the unweighted index (p<0.001). The overall unweighted rurality score mean for all Texas counties is ~17, while the overall weighted rurality score mean is ~7.5. This indicates that the basic weighted index may better capture the multidimensionality of rurality for Texas counties than unweighted mean scores.

Critique

Based on my understanding, PCA is quite useful for creation of weighted indexes. Unfortunately, the variables I utilized for this analysis have somewhat varying patterns of clustering and possible outliers, which can be seen in the above biplot. These possible outliers could have an effect on the weighting of variables in the index. Due to the small sample size of counties and in order to produce a complete picture of all counties in Texas, I did not want to remove any counties from the analysis. PCA as it relates to index weighting also requires that there is either an a priori understanding or statistical proof of a correlation between variables. Based on the directions of the arrows in the biplot, there is likely correlation between the variables, but more considerations may be needed to confirm this. Also, the scaling of the median income variable is slightly off because the maximum value is over 100. This will need to be corrected in a later analysis.

My future analysis will include more health-related variables in the PCA procedure to determine how health variables contribute to rurality in comparison to the basic rural indicator variables analyzed in this exercise.

Ex 1: Mapping the stain: Using spatial autocorrelation to look at clustering of infection probabilities for black stain root disease

My questions:

I am using a simulation model to analyze spatial patterns of black stain root disease of Douglas-fir at the individual tree, stand, and landscape scales. For exercise 1, I focused on the spatial pattern of probability of infection, asking:

  • What is the spatial pattern of probability of infection for black stain root disease in the forest landscape?
  • How does this spatial pattern differ between landscapes where stands are clustered by management class and landscapes where management classes are randomly distributed?

    Fig 1. Left: Raster of the clustered landscape, where stands are spatially grouped by each of the three forest management classes. Each management class has a different tree density, making the different classes clearly visible as three wedges in the landscape. Right: Raster of the landscape where management classes are randomly assigned to stands with no predetermined spatial clustering. The color of each cell represents the value for infection probability of that cell. White cells in both landscapes are non-tree areas with NA values.

Tool or approach that you used: Spatial autocorrelation analysis, Moran’s I, correlogram (R)

My model calculates probability of infection for each tree based on a variety of tree characteristics, including proximity to infected trees, so I expected to see spatial autocorrelation (when a variable is related to itself in space) with the clustering of high and low values of probability of infection. Because some management practices (i.e., high planting density, clear-cut harvest, thinning, shorter rotation length) have been shown to promote the spread of infection, there is reason to hypothesize that more intensive management strategies – and their spatial patterns in the landscape – may affect the spread of black stain at multiple scales.

I am interested in hotspot analysis to later analyze how the spatial pattern of infection hotspots map against different forest management approaches and forest ownerships. However, as a first step, I needed to show that there is some clustering in infection probabilities (spatial autocorrelation) in my data. I used the “Moran” function in the “raster” package (Hijmans 2019) in R to calculate the global Moran’s I statistic. The Moran’s I statistic ranges from -1 (perfect dispersion, e.g., a checkerboard) to +1 (perfect clustering), with a value of 0 indicating perfect randomness.

Moran’s I = -1

Moran’s I = 0

Moran’s I = 1

 

 

 

 

 

 

 

 

I calculated this statistic at multiple lag distances, h, to generate a graph of the values of the Moran’s I statistic across various values of h. You can think of the lag distance of the size of the window of neighbors being considered for each cell in a raster grid. The graph produced by plotting the calculated value of Moran’s I across various lag values is called a “correlogram.”

What did I actually do? A brief description of steps I followed to complete the analysis

1. Imported my raster files, corrected the spatial scale, and re-projected the rasters to fall somewhere over western Oregon.

I am playing with hypothetical landscapes (with the characteristics of real-world landscapes), so the spatial scale (extent, resolution) is relevant but the geographic placement is somewhat arbitrary. I looked at two landscapes: one where management classes are clustered (“clustered” landscape), and one where management classes are randomly distributed (“random”). For each landscape, I used two rasters: probability of infection (continuous values from 0 to 1) and non-tree/tree (binary, 0s and 1s).

2. Masked non-tree cells

Since not all cells in my raster grid contain trees, I set all non-tree cells to NA for my analysis in order to avoid comparing the probability of infection between trees and non-trees. I used the tree rasters to create a mask.
c.tree[ c.tree < 1 ] <- NA # Set all non-tree cells in the tree raster to NA
c.pi.tree <- mask(c.pi, c.tree) # Combine the prob inf with tree, leaving all others NA
# Repeat with randomly distributed management landscape
r.tree[ r.tree < 1 ] <- NA # Set all non-tree cells in the tree raster to NA
r.pi.tree <- mask(r.pi, r.tree) # Combine the prob inf with tree, leaving all others NA

Fig 2. Filled and hollow weights matrices.

3. Calculated Global Moran’s I for multiple values of lag distance.

For each lag distance, I created a weights matrix so the Moran function in the raster package would know how to weight each neighbor pixel at a given distance. Then, I let it run, calculating Moran’s I for each lag to create the data points for a correlogram.

I produced two correlograms: one where all cells within a given distance (lag) were given a weight of 1 and another “hollow” weights matrix when only cells at a given distance were given a weight of 1 (see example below).

4. Plotted the global Moran’s I for each landscape and compare.

 

 

 

 

 

 

What did I find? Brief description of results I obtained.

The correlograms show that similar values become less clustered at greater distances, approaching a random distribution by about 50 cell distances. In other words, cells are more similar to the cells around them than they are to more-distant cells. The many peaks and troughs in the correlogram are present because there are gaps between trees because of their regular spacing in plantation management.

In general, the highest values of Moran’s I were similar between the landscape with clustered management landscape and the landscape with randomly distributed management classes. However, the rate of decrease in the value of Moran’s I with increasing lag distance was higher for the random landscape than the clustered landscape. In other words, similar infection probabilities had larger clusters when forest management classes were clustered. For the clustered landscape, there was actually spatial autocorrelation at lag distances of 100 to 150, likely because of the clusters of higher infection probability in the “old growth” management cluster.

Correlogram for the clustered and random landscape showing Moran’s I as a function of lag distance. “Filled” weights matrix.

Correlogram for the clustered and random landscape showing Moran’s I as a function of lag distance. “Hollow” weights matrix.

 

 

 

 

 

 

 

 

 

 

 

 

 

Critique of the method – what was useful, what was not?

My biggest issue initially was finding a package to perform a hotspot analysis on raster data in R. I found some packages with detailed tutorials (e.g., hotspotr), but some had not been updated recently enough to work in the latest version of R. I could have done this analysis in ArcMap, but I am trying to use open-source software and free applications and improve my programming abilities in R.

The Moran function I eventually used in the raster package worked quickly and effectively, but it does not provide statistics (e.g., p-values) to interpret the significance of the Moran’s I values produced. I also had to make the correlogram by hand with the raster package. Other packages do include additional statistics but are either more complex to use or designed for point data. There are also built-in correlogram functions in packages like spdep or ncf, but they were very slow, potentially taking hours on a 300 x 300 cell raster. That said, it may just be my inexperience that made a clear path difficult to find.

References

Glen, S. 2016. Moran’s I: Definition, Examples. https://www.statisticshowto.datasciencecentral.com/morans-i/.

Robert J. Hijmans (2019). raster: Geographic Data Analysis and Modeling. R package version 2.8-19. https://CRAN.R-project.org/package=raster

 

Exercise 1: Comparison of Interpolation Methods Using Fisheries Abundance Data

Question Asked

I would like to understand how the spatial variability of forage fish in the California Current Ecosystem is related to the spatial pattern of seascape classes (based on remotely sensed sea-surface variables)? In exercise 1, I will asking “what is the spatial pattern of forage fish in the California Current ecosystem?” In order to address this question, I will be testing the use of different types of interpolation on my point-data.

Approaches Used and Methods

To address these questions, the Kriging Interpolation and Inverse Distance Weighting Interpolation tools were employed in ArcMap. All processes were completed in ESRI ArcMap 10.6.1. Interpolation is described as a method of constructing new data using existing data as references. In spatial and temporal analyses, there are a range of different types of interpolation that can be used.

The original data, which includes a series of about 1300 trawls, the catch per unit effort (CPUE) per species per trawl, the latitude, longitude, and water column depth of each trawl, and the functional group of each species caught, were loaded into ArcMap using the “Display X,Y Point Data” function. The four functional groups used in this analysis are Forage, Young-of-the-Year Rockfish (YOYRock), YOYGround, and crustaceans. In the case of this analysis, all fishes that are part of the YOYRock functional group were included.

Representation of the raw YOYRock Trawl data in ArcMap

After the data were uploaded, they had to be converted to feature classes. For the purposes of this exercise, all trawl data from 2002 to 2015 was included as one feature class, though the way in which the data are organized make it easy to break the trawls down at a finer temporal resolution. The result was a Point Shapefile that was then binned into four abundance groups to make interpretation easier. The IDW and Kriging tools (from the “Spatial Analyst” Toolbox) were then employed. The base equation for both interpolations is virtually the same, and both are commonly used methods for continuous data in point form, but there are some major differences in the calculation of some of the stated variables:

Representation of the equation used to assign values to missing cells using the IDW and Kriging methods. Z(si) = the measured value at the ith location, λi = an unknown weight for the measured value at the ith location, s0 = the prediction location, and N = the number of measured values

Z(si) = the measured value at the ith location, λi = an unknown weight for the measured value at the ith location, s0 = the prediction location, and N = the number of measured values

1) IDW: IDW, or Inverse-distance weighting, is what’s known as a deterministic interpolation method, as it relies on surrounding values to populate the resulting surface. One of the major assumptions with using IDW is that it assumes that the variables being mapped decrease in influence linearly as you move away from a given value. In IDW, the weight given to the “measured value at the ith location” is solely calculated linearly, decreasing as you move farther away from a given value. In this case, the “power” value, which corresponds to the weight, was the default 2.

2) Kriging: Kriging is an interpolation method based on geostatistics including autocorrelation. The equation used to calculate the missing values is the same as for IDW, except that the weight variable is calculated using a mathematical function within a certain specified radius of the missing value. For this reason, one of the main assumptions when using Kriging is that the “distance or direction between sample points reflects a spatial correlation that can be used to explain variation in the surface.” (ESRI, ND).

Results

The resulting interpolations can be found below. I’ve included output that focuses on the significant region of Monterey Bay to provide context regarding the proximity of the trawls to one another and to show detail on the boundaries of the interpolations.

Full Interpolation using IDW

Full Interpolation using Kriging

Detail of IDW in Monterey Bay, CA

Detail of Kriging in Monterey Bay, CA

Critiques of Methods

Any analysis of the results will conclude that the Kriging Tool provided a much more robust interpretation of the same patterns that can be observed in the original data and in the IDW interpolation. Both interpolations displayed significant patterns near Monterey Bay, and the Kriging Interpolation also represented additional areas of higher abundance north of San Francisco. While I do believe that both interpolation methods provide a good place to begin analysis, I believe that several adjustments will have to be made in order to create a useable result.

My first mistake was using the entire time series – since the interpolations are distance-based and extremely sensitive to points within close proximity to one another, the clear clusters of point most likely influenced the interpolations. A next step for me will be breaking the data down to an annual resolution, as I feel that shapefiles with one point at each trawl location will provide better data for interpolation. Additionally, this will provide multiple maps, which will allow for a chance to observe how the modeled patterns of YOYRock abundance have changed over time.

Another next step will be exploring the fine adjustments available within each interpolation method. I now have a greater understanding of the mechanics which drive IDW, so I’m eager o rerun the analyses at different powers to see how each impacts the interpolation. Similarly, the radius used to decide which points are considered while calculating a Kriging Interpolation can be adjusted, so that will be done in the future as an experiment.

Finally, I ran out of time to explore the symbology of the results – I hypothesize that classification by a smaller number of classes would result in more robust interpolation maps, as the visualizations now show a vast majority of the space to fall into the lowest class. The interpolation data is relatively bimodal in nature, so an adjustment in the symbology tab would likely result in a a more accurate and precise representation of abundance.

Overall, I see interpolation as a valuable way to identify spatial patterns from point data. In the case of species abundance data, I believe that Kriging is the superior method, as it does not have the linear influence assumption that’s baked into IDW. Additionally, the geostatical methods used in Kriging generally allow for a more robust and precise interpolation regardless of the type of continuous data being used.

The next steps mentioned above will be taken before the presentation of Tutorial 1.

References

http://desktop.arcgis.com/en/arcmap/10.3/tools/3d-analyst-toolbox/how-idw-works.htm

http://desktop.arcgis.com/en/arcmap/10.3/tools/3d-analyst-toolbox/how-kriging-works.htm

Santora, Jarrod & Hazen, Elliott & Schroeder, Isaac & Bograd, Steven & Sakuma, KM & Field, JC. (2017). Impacts of ocean-climate variability on biodiversity of pelagic forage species in an upwelling ecosystem. Marine Ecology Progress Series. 580. 10.3354/meps12278.

 

Determining spatial autocorrelation of refugee settlements in Uganda

  1. Question that you asked

In analyzing the distribution of settlement boundaries that I obtained from OpenStreetMap, I wanted to know the general clustering and regionality of settlements in order to understand how other spatial statistics that I perform in my next step will behave based on the results from this explanatory step. The question I’m introducing for Exercise A centers around how similar or different nearby settlements are to each other: is there a regionality in settlement sizes? Some future questions that I’m considering are whether or not clustered settlements have higher detection in the World Settlement Footprint classification (which I’m using instead of GHSL because it’s a more localized classification). This would be determined using the percent of OSM settlement area that was detected by WSF.

  1. Name of the tool or approach that you used.

I decided to use spatial autocorrelation to answer this question, since the essence of my question centers around what the pattern of my data looks like, how clustered or not clustered are these settlements, and how that might affect my future analysis and considerations.

For this, I employed the Spatial Autocorrelation tool in ArcGIS Pro, which uses Global Moran I’s algorithm.

  1. Brief description of steps you followed to complete the analysis.

In order to identify the refugee settlements, I went through multiple rounds of querying with OpenStreetMap to extract the boundaries that are directly related to refugee camp boundaries. Hannah Friedrich and I have worked together on defining some of these boundaries, and she took some time to delineate boundaries for a separate study of hers. While I will incorporate these delineated ‘regions of interest,’ I will most likely not move forward with them in further analyses because it would present an issue with scaling up and manual interpretation. Below I list the three polygon data I’m analyzing here.

OSM Boundaries: This is a result of multiple query efforts within Overpass Turbo, a online server for downloading OpenStreetMap data. Various queries on attribute tags were performed in order to select relevant refugee boundary data.

Regions of Interest: This is a result from Hannah’s efforts of limiting areas within the OSM Boundaries that appear to be built up using high resolution imagery available on Google Satellite.

Selected OSM Boundaries: This is a selection from OSM Boundaries that merges polygons of the same settlement into multipolygons; this also a selection of boundary polygons that are less than 2000 hectares to account for some boundaries that are designated versus actually settled.

I then ingested these boundaries into Google Earth Engine, extracted the pixel areas for the WSF pixels within the three different boundary layers, and re-exported these.

I then imported these layers into ArcGIS Pro and performed the Spatial Autocorrelation tool and experimented with multiple parameters.

I ultimately chose the “row standardization” option given the possible bias in ‘sampling’ design, since this creation of this data is most often from the HOT Uganda team and resources might limit where they can travel and collect data.

  1. Brief description of results you obtained.

Ultimately, my results showed clustering within the OSM-identified settlements, which is to be expected. There are a variety of questions that arise from this that might confound my analysis that are still moderately troubling: the method of data recording, the spotlight effect, the presence of organizations able to record data, and the inherent clustering based on human behavior, or drivers such as conflict or environmental conditions. This initial analysis is really to understand the degree of clustering that I might expect in my future analyses – if there is high clustering, then I need to understand that clustering when I’m testing additional questions with my next exercise, like nearest road or nearest urban area. The images below demonstrates the result of an analysis settlement area in the “Selected OSM Boundaries,” “OSM Boundaries,” and “Regions of Interest.”

Moran I, Selection of OSM Boundaries

Moran I, Delineated Regions of Interest

Moran I, All OSM Boundaries

 

While OSM Boundaries and Selection from OSM Boundaries show high confidence of clustering, the “Regions of Interest” pattern shows a less confident result of clustering. This makes sense, given that both the OSM Selection and OSM Boundaries both have overlapping polygons and varying sizes.

Below is a map illustrating the distribution of refugee areas in northwest Uganda, where most of the settlements are concentrated.

  1. Critique of the method – what was useful, what was not?

Given my shortcomings with understanding statistical analyses, the interpretation of the results was most difficult for me. While my patterns appeared mostly clustered, the z-score and Moran’s Index showed changes when different parameters of comparison were chosen. Ultimately, I’m not sure if there would be a more effective spatial analysis to analyze aspects of these settlements that would prove more helpful in figuring out relevant information for moving forward with Exercise 2 and beyond.

Ex. 1: Spatial Autocorrelation in Stream Cross-Sectional Change In the Andrews Forest

Question that you asked

How much spatial autocorrelation exists in historic streambed changes in the Andrews Forest, both in the across-stream direction and the along-stream direction?

Name of the tool or approach that you used

I addressed each direction of the problems separately using. Since this resulted in one-dimensional, evenly spaced or faux-evenly spaced data, I just used the acf function in R to analyze the data.

Critique of the method – what was useful, what was not?

The methods provided an interesting look at patterns within the cross section data set. However, I think the one-dimensional approach fails to capture interactions between across- and along- channel changes as well as any temporal autocorrelation.

Description of steps you followed to complete the analysis and results you obtained.

For my project, I am looking at changes in the streambed along 4 stream reaches (black boxes) along two streams (blue lines) in the Andrews forest. The watershed of the larger stream, Lookout Creek is shown in black.

Researchers in the Andrews forest repeatedly surveyed fixed cross sections along the reaches from 1978 to 2011. In this exercise, I worked with a subset of the data from 1978 to 1998. The cartoon below does not contain real data but shows the arrangement of numbered reaches (blue lines) and their fixed endpoint stakes (blue dots) along a stream.

Part A – across-channel changes

Each individual cross section spans from bank to bank and show the topography and bathymetry across the width of the stream. Below are two consecutive years of cross-sectional profiles of the stream bed at Lower Lookout Creek, cross section # 3.

Profile from 1995.

Profile from 1996.

It looks like there was a lot of change between these two profiles. The active stream channel appears to have avulse from one side of the cross section to the other.

I compared the two profiles to see how much elevation was gained or lost.

Here grey line represents the surface of the stream bed in 1995, and the black line represents the surface of the stream bed in 1996. The area show in red represents material that was eroded away, and the green area represents material that was newly deposited.

We can simplify the plot above to show only the positive or negative elevation change across the channel. The figure below shows the same information as the figure above, but directly in terms of elevation change.

You can clearly see from this figure that this site at this year had one broad area of elevation gain or deposition, and one broad area of elevation loss or erosion. This isn’t precisely an attraction/repulsion phenomenon, but there can be positive feedbacks associated with erosion and deposition. For example, sediment deposited in a stream may locally obstruct or slow down the water passing by it and encourage deposition in adjacent areas.

Now, not every pair of years had this *much* elevation change, but also, not every pair of years shows such broad regions of positive or negative elevation change. Here is a figure of the elevation change at the same cross section between 1996 and 1997. You can see that elevation change is both smaller in magnitude and shows up in smaller patches.

I used spatial autocorrelation metrics to investigate the patch size. Note that autocorrelation at some scales is always expected when looking at topographic change. Even if you had extremely precise and accurate measurements, you would expect the stream bed change at one point to be strongly correlated with the change a few cm away both because the two points probably experience very similar forces and also because gravity tends to smooth out topographic irregularities in nature. But in most scenarios, you would probably not expect to find a correlation between the changes in a streambed with a change at a point 100 m away from the bank.

I picked two distances that seemed potentially interesting: 1 meter and 5 meters. Then I investigated autocorrelation at 1m and 5m lag distances using the acf function in R.

These lag distances are superimposed on the first change graph in the figure below.

I expected that all cross sections would have higher autocorrelation at the shorter lag distance than the longer one, but I also expected that a set of cross sections with very large patch sizes might show relatively higher autocorrelation at the larger lag distance compared to the small-patched cross sections. I need to spend more time looking into how the magnitude of the changes influences the autocorrelation function so that I can understand it better.

I repeated the process for every cross section at every year. Results are shown below.

The cross sections showed mostly positive autocorrelation at 1 m. Some years showed more autocorrelation across the board than others, and some individual cross sections seemed like they frequently had high autocorrelation values at multiple years.

Different patterns emerge at this spatial scale. At a 5 m lag, neutral or negative autocorrelation was much more common.

 

Part B – along-channel changes

I also wanted to look at autocorrelation between cross sections. Was the change at one cross section related to the change at cross sections immediately up- or down-stream?

Because I don’t have true geographic locations of the cross sections, I couldn’t do any fancy two dimensional work like interpolating the data into a river bathymetry model. Instead, I consolidated the cross-sectional change data into a single metric for each cross section.

Here is another look at the change plot I showed earlier. This is the same plot as before, but I’ve colored the positive values green to show deposition and the negative values red to show erosion.

I simplified this plot by removing the horizontal across-stream component and summing the red and green areas separately.

Then I had a choice: I could add the red and green areas together to show the total area of the cross section that changed in any way, or I could subtract the red area from the green area to show the net deposition in the cross section. Each of these metrics could be interesting for different reasons. For example, the total area of change might be relevant for ecological processes, but the net overal change could be relevant for tracking net sediment transport in these streams.

I started by adding the boxes together to calculate the total area of change.

In this way, I made a score for every cross section for every year pair. Hypothetical scores for the cartoon river are shown below.

Then I used the acf function again in R to determine how correlated adjacent cross section scores were with each other, compared to more distant cross section scores. This method is not fully spatial explicit, since I did not include any information about how far apart each cross section was from its neighbors. I just assumed that they were roughly equidistant within each reach. Indeed, the distance between cross sections can become hard to define when they are placed along curved parts of the stream.

I calculated these values with every year pair on each of the four reaches. I then repeated it with the subtraction method described above.

For the “addition” method, autocorrelation values were generally close to neutral with some reaches in some years tending towards positive values. Positive values could occur if changes in one reach directly influenced an adjacent reach (i.e. if upstream channel change influenced the hydraulics directly downstream) or if adjacent cross sections were influenced by similar other factors (i.e. two adjacent cross sections placed on a steep part of the reach might experience more similar changes to each other than two cross sections placed in a shallower part of the reach.)

The pattern is different for net change, where negative autocorrelation is more common. I think this is very interesting, because it implies that parts of the stream that had net deposition were more likely to be adjacent to parts of the stream that had net erosion. This could be an expression of a sink/source phenomenon where material scoured from one area is transported and deposited immediately downstream at the next cross section.

Exercise 1: Spatial Distribution of Juniper Saplings

1.Question asked and data used: 

  • What are the spatial patterns of juniper sapling establishment within a treated watershed 14 years after most juniper were removed?
  • Two data sets were used for this analysis:
    • Thirty-three belt transects (30 m in length, 3 m in width) were used to assess the density of juniper saplings within the watershed (data in form of excel spreadsheet listing latitude and longitude with number of juniper per transect)
    • An orthomosaic created from UAV imagery of a small area within the watershed (data in form of multispectral raster, with brightness values for red, green, blue, near-infrared, and re-edge wavelengths)
  • Watershed map (National Agricultural Imagery Program image from 2016 shown) with belt transects and UAV study area:

2. Approach and tools used: A number of techniques were initially applied to explore the data and to examine which approaches were more (or less) effective.

  • Belt transects:
    • Kriging
    • Inverse Distance Weighted (IDW) interpolation
    • Spatial autocorrelation (Global Moran’s I)
    • Hot-spot analysis
  • Classified raster (UAV orthomosaic):
    • Hot-spot analysis

3. Steps used in analysis:

  • Slope and aspect calculation:
    • The slope and aspect tools (Spatial Analyst Tools) were applied to 30 m DEM (from earthexplorer.usgs.gov). This information was noted for hot spots and cold spots but not used for further analysis during this exercise.
    • The extract multi values to points was used to extract the slope, aspect, and elevation value
  • Belt transects:
    • Projected data into NAD 1983 UTM Zone 10N
    • The location of each survey was symbolized by color based on the number of juniper found.
    • Kriging (Spatial Analyst Tools and Geostatistical Wizard) was used for initial exploratory analysis to assess general spatial distribution across the watershed
      • The number of juniper per transect used as the z-value field
      • Histogram and QQPlot used assess distribution of the number of juniper per transect for the geostatistical layer
      • For the raster layer, the predicted values were compared to the observed values by extracting the values at each survey location
    • IDW interpolation
      •  Juniper used as input feature
      • Maximum neighbors:15; Minimum neighbors:10
    • Spatial autocorrelation (Global Moran’s I) (Spatial Statistics Tools)
      • Calculated using the number of juniper per transect
      • HTML file created lists Moran’s index, z-score, and p-value
    • Hot Spot Analysis (Getis-Ord Gi) (Spatial Statistics Tools)
      • Conceptualization of spatial relationships: Fixed distance band
      • Distance method: Euclidean distance
      • Default was used for the distance band
  • Classified Raster (orthomosaic):
    • Supervised classification (support vector machine) was used to identify juniper within the UAV orthomosaic
    • General procedure for classification: create training samples-> assess using interactive training sample manager and scatterplots->create .ecd file->apply classifier->apply Majority Filter tool
    • A point file was created from pixel clusters identified as juniper within the image
    • Hot Spot analysis (Getis-Ord Gi)

      • Conducted to assess areas of concentration of juniper saplings within the sample area
      • Integrate tool used to aggregate juniper data (5 m used for analysis, but other distances were initially tested)
      • Hot Spot analysis tool using aggregated data created in previous step as input layer (other inputs were the same as those used for the belt transect hot spot analysis)

4. Overview of Results:

  • Belt Transects
    • Kriging. I used two different approaches to kriging. First, I used the kriging tool under spatial analyst tools and then used the geostatistical wizard to calculate ordinary prediction kriging. The resulting maps using these two approaches were different. In particular, the use of the geostatistical wizard created maps more similar to the IDW while the geostatistical wizard created a map with different contours.
      • Related statistics:
        • minimum:0.05
        • maximum: 6.97
        • mean: 2.83
        • standard deviation: 0.86
      • Spatial Analyst Kriging Map:
      •  
      • Using the export values function, the predicted values of this kriging method at each belt transect site were very similar (within 0.05) to the observed values.
      • Ordinary Prediction Kriging:
      • The QQ plot indicates that assumptions of normality may not be met:
      • The ordinary prediction kriging also tended to overestimate the juniper density based upong the distribution chart:
        • Related cross-validation/error statistics for the ordinary prediction kriging:
          • Mean -0.0583700520511708
            Root-Mean-Square 2.09329560839956
            Mean Standardized -0.0193126589526469
            Root-Mean-Square Standardized 0.986330022079785
            Average Standard Error 2.09938534113134
    • IDW
      • While general trends were similar to kriging, the size and shape of contours between the methods were different.
      • Mean: 3.08, range: 0.00 to 7.00, standard deviation: 1.16
      • IDW
    • Spatial autocorrelation (Global Moran’s I)
      • Moran’s I based on juniper per transect: -0.019, with a p-value of 0.92
      • Indicates that the pattern of juniper density is considered random and not significantly different than a random distribution
    • Hot Spot analysis (Getis-Ord Gi)
      • One hot spot found (90% confidence): northwestern aspect (300 degrees) with 8.8% slope
      • One cold spot found (95% confidence): north-northwestern aspect (347 degrees) with 11.5% slope
      • Remaining points were considered insignificant
      • Map of Hot Spot analysis:
  • Classified Raster
    • Hot Spot analysis (Getis-Ord Gi)
      • Within the area covered by the orthomosaic, four hot spots were found:
        • One area with 99% confidence: 11% slope, 325 degree aspect
        • Three areas with 95% confidence: 3% slope, 254 degree aspect; 11% slope, 167 degree aspect; and 8% slope,137 degree aspect
      • UAV Orthomosaic Hot Spot map:
      •  

5. Discussion/Critique of Methods.

Based on the maps produced using the IDW and kriging methods, some general trends in juniper spatial distribution exist within this watershed. For example, we see lower densities in the north-northeastern areas and greater densities of juniper in the southeastern and western areas. However, both the IDW and the kriging raster using the spatial analyst tool produced the characteristic “bulls-eye” pattern often associated with IDW. When compared to the ordinary prediction kriging maps, different interpretations of juniper density in the watershed might be made. Compared to the belt transects data, the kriging created using the spatial analyst tool was similar to the observed values while the ordinary prediction kriging tended to overestimate the distribution. However, more ground data is necessary to determine how accurate these prediction would be in other areas.  One of the biggest takeaways from this is to carefully consider the approach used and the nature of the data (for example, the use of ordinary versus simply kriging, etc.). From a user standpoint, I found the geostatistical wizard the most useful approach — particularly as it made inspecting the statistics and semivariogram very easy. However, in the future I would explore different methods of kriging within this tool.

The spatial autocorrelation and hot spot tools were useful, although the results did not provide much significant information regarding the spatial distribution of juniper in the cases examined here. For future analysis, particularly when examining the relationship between juniper density and other watershed characteristics (e.g., slope, aspect, etc.) these tools may become more important. In the case of the UAV orthomosaic, this provided a “test run” for what will be a larger dataset. The steps taken prior to analysis, particularly creating a point layer from the classified pixel clusters, will be time intensive and may require alternative approaches. At the limited scale of the UAV orthomosaics these hotspots did not provide much useful information, but if extrapolated over a larger area more patterns may be observed.

The distribution of juniper saplings in this case may be associated with a number of factors. Juniper seeds may be spread through birds or wind, resulting in the spread of juniper saplings throughout the watershed. At a much larger scale, if seed sources are less available, this distribution may be more localized. This pattern may also vary with the presence of mature juniper. As previously discussed some patterns emerged in this data (lower densities in the northern areas of the study area, for example) which may be associated with the spatial distribution of other characteristics, such as soil moisture. For example, sources and areas of higher juniper may include areas were we anticipate higher soil moisture such as flatter slopes with northerly aspects. These factors will be assessed further in exercise 2.

 

Spatial variation of economic losses resulting from a joint earthquake/tsunami event: An application to Seaside, OR

Question

What is the spatial variability of economic value of buildings as well as losses resulting from a joint earthquake/tsunami event? How does this spatial variability relate to independent earthquake and tsunami events, as well as the intensity of the hazard?

The purpose of this analysis was to consider the spatial variability of initial economic value, as well as economic losses resulting from a joint earthquake/tsunami event. The losses were deaggregated by hazard (earthquake only, tsunami only, joint earthquake/tsunami), as well as intensity of the event (100-year, 250-year, etc.).

Tool and approach

Two methods were implemented to evaluate the spatial variability economic value and losses: (1) interpolation via kernel density, and (2) a hotspot analysis using the Getis-Ord Gi* statistic. The economic losses were determined using a probabilistic earthquake/tsunami damage and loss model. This model implements Monte-Carlo methods to estimate the expected economic losses following an earthquake/tsunami event. For this application, 10,000 simulations were ran, from which the average loss at each building was computed for earthquake only, tsunami only, and a joint earthquake/tsunami event.

Description of steps

The average losses at each building resulting from the earthquake/tsunami loss model were added as an attribute to a GIS shapefile. Two methods to evaluate the spatial distribution were considered:

  1. Interpolation via kernel density: Spatial interpolation was performed using a kernel density estimate. A kernel with a size proportional to the value of the attribute under consideration (in this case economic value/loss) is placed at each data point. A map is then created by taking the sum of all kernels. The kernel radius and shape can vary to produce different results. In this analysis, a quartic kernel was utilized with a radius of 200 meters. The interpolation was performed using the built in interpolation feature in QGIS 3.
  2. Hotspot analysis using the Getis-Ord Gi* statistic: Hotspot analysis was performed using the Getis-Ord Gi* statistic. This statistic results in a p-value and z-score at each attribute, providing insight into whether the null-hypothesis can be rejected (in this case, spatial randomness). As such, features with a small p-value and a very large (or very small) z-score indicate that the null can be rejected (or that the data is not spatially random). Consequently, applied across an entire spatial dataset, the hotspot analysis identifies statistically significant clusters of high (or low) values. The hotspot analysis was performed using the available Hotspot Analysis plugin for QGIS 3.

Description of results

The results of the analysis are shown in Figure 1. The columns correspond to interpolation and hotspot analysis respectively. The first row shows the building values, whereas the second shows the economic losses resulting from a joint earthquake/tsunami event (2,500-year return period).

Areas of high economic value and losses can be easily observed from the interpolation analysis. Here, areas of red correspond to larger damages. Similarly, statistically significant clusters of large (and small) damages can be observed from the hotspot analysis. Again, red corresponds to a statistically significant hot spot (e.g. a cluster of large values and losses), whereas blue corresponds to a statistically significant cold spot (e.g. a cluster of small economic values and losses).

A large concentration of economic value is centrally located along the coast, and is due to the presence of resorts and condominium complexes. This area is observed from both the interpolation and hotspot analysis. Interestingly, more clusters are observed from the hotspot analysis as opposed to the interpolation. This could be explained by the scaling of the interpolation. In this case, the red regions correspond to a maximum value of $20M. If this value was reduced by half, more areas of high concentration would be observed.

The hotspot analysis provides insight into statistically significant clusters of high and low values, as opposed to single points of high values; however, when comparing interpolation and hotspot analysis, it should not be neglected that the results of the latter are visually more difficult to observe. This is due to the discrete nature of the Getis-Ord Gi* statistic (e.g. each point corresponds to a p-value and z-score, as opposed to the continuous surfaces of interpolation). This results in polygons that are shaded according to confidence levels.

Figure 1: Comparison of interpolation and hotspot analysis for both initial building value and economic losses

In addition to the initial value and economic losses resulting from the 2,500-year earthquake/tsunami event, interpolated maps were deaggregated based on hazard (earthquake only, tsunami only, combined) as well as intensity of the event (return years 100, 250, 500, 1,000, and 2,500). The results are shown in Figure 2, where each row corresponds to the hazard, and each column to the intensity of the event. Furthermore, the total economic losses to all buildings in Seaside were determined based on hazard and intensity (Figure 3).

Figure 2 shows that the economic losses are spatially consistent across Seaside for the 100-year event, and begin to exhibit spatial variability as the intensity increases. Losses begin to accumulate for the 250-year event near the center of Seaside, and it can be seen that the earthquake is the primary driving force. Similar trends are observed for the 500-year event. The 1000- and 2500-year events begin to see significant tsunami losses that are not as spatially concentrated as the earthquake losses, but are more evenly distributed along the coast. Figure 3 shows that the tsunami losses begin to dominate for the 1000-year event.

Figure 2: Earthquake and tsunami hazard deaggregated by hazard and intensity

Figure 3: Total earthquake and tsunami damages across Seaside, OR

Critique

Both the interpolation and hotspot analyses have limitations. As previously mentioned the hotspot analysis can be aesthetically challenging. Additionally, difficulties may arise in communicating the confidence levels to community planners and resource managers who may not have a statistical background.

Similarly, spatial interpolation via kernel density has its own limitations. As there are subjective options when performing the interpolation and viewing the results (e.g. radius, color scheme, and maximum values), the resulting maps could easily be deceiving. Figure 4 shows the same data but use of a different radius to define the kernel. It can be seen that the map on the right appears more severe than the map on the left. The practicality of a spatial interpolation map ultimately depends on the GIS analyst.

Figure 4: Comparison of interpolation resulting from different kernel radii.

Courtney’s Ex 1: Spatial patterns of ion concentrations in groundwater

1: Question being asked

How do ion and isotope concentrations vary spatially? And what trends in concentrations account for variability between my samples?

For detailed background information, I point you towards my problem statement for this class. In short, I’m hoping to use geochemical and isotope values for wells 32 well I sampled last summer in order to better understand how groundwater flows across faults in the Walla Walla Subbasin in northeastern Oregon. In the analysis for this post, I only used a subset of 18 wells for which I had the largest suite of measured parameters.

2: Approach used

For this exercise, I’m investigating spatial patterns of my ion and isotope concentrations. I could do this by individually examining distributions of all 19 of the parameters that I sampled for… but that would be graphically overwhelming and not terribly useful. It’s too difficult to draw conclusions by comparing 19 different maps of distributions.  Instead, I’m using a method called principal component analysis (PCA for short) to statistically evaluate my samples by the groups of parameters that account for large amounts of the variation that differentiates one sample from another.

The nuts and bolts of PCA involve sophisticated analysis that calculates eigenvectors, linear algebra, and n-dimensional properties of a data set, and I will let people more knowledgeable than me explain this if you’re curious about how exactly PCA works.

https://blog.bioturing.com/2018/06/14/principal-component-analysis-explained-simply/

R spits out two kinds of useful graphic output for PCA based on the concept of a “biplot”. The first is called a loading plot, where two principal components are graphed against each other, for example PC1 and PC2. The value for each parameter in PC1 and PC2 is used to give the point a cartesian coordinate which defines a vector in comparison to the origin. The second kind of output is called a scree plot, which applies the same concept to the sample points (called “individuals” in the code) instead of to the  parameters (called “variables” in the code). I used an R package called ggbiplot to combined the scree and loading plots in one figure.

https://blog.bioturing.com/2018/06/18/how-to-read-pca-biplots-and-scree-plots/

https://learnche.org/pid/latent-variable-modelling/principal-component-analysis/interpreting-score-plots-and-loading-plots

3: Methods

I used R to calculate the principal component vectors for my data and to create charts for them, and then symbolized my data in ArcGIS Pro to see if there was a spatial pattern to the information in the PCA output. The code in R is fairly manageable. I’ve included it below with my comments.

# Source: https://www.datacamp.com/community/tutorials/pca-analysis-r

# Preliminary set up stuff – add libraries, set working directory

library(devtools)

library(ggbiplot)

setwd(“~/_Oregon State University/Thesis/Summer Fieldwork/PCA”)

# Before importing your data, make sure that every entry in its table has a numeric entry in it.

# This process will spit out garbage if you have any blank cells or text.

# The number of rows in the data set needs to exceed the number of columns. I used an argument like [,c(1:11)] to enforce this in my data

#Process 1 of 2: read data set 1 into R, view it, compute PCA, display summaries and biplots

# it’s important to set the row names so we can use them as labels in ggbiplot

PCAdata.ions2 <- read.csv(“ions_allchem20190401_usgsdwnld.csv”, row.names = 1)

View(PCAdata.ions2)

Ions2.pca <- prcomp(PCAdata.ions2[,c(1:11)], center = TRUE, scale. = TRUE)

summary(Ions2.pca)

ggbiplot(Ions2.pca, choices=c(1,2), labels=rownames(PCAdata.ions2))

ggbiplot(Ions2.pca, choices=c(1,3), labels=rownames(PCAdata.ions2))

#read data set 2 into R, view it, compute PCA, display summaries and biplots

PCAdata.nonions <-read.csv(“NonIons_allchem20190401_usgsdwnld.csv”, row.names = 1)

View(PCAdata.nonions)

NonIons.pca <- prcomp(PCAdata.nonions[,c(1:8)],center = TRUE, scale. = TRUE)

summary(NonIons.pca)

ggbiplot(NonIons.pca, labels=rownames(PCAdata.nonions))

ggbiplot(NonIons.pca, choices=c(1,3), labels=rownames(PCAdata.nonions))

# if you want plots other than PC1 vs PC2 or PC1 vs PC3 just change the argument “choices = (1,2)”

# Process 2 of 2 add tools to get individual coordinates

# Source http://www.sthda.com/english/wiki/get-pca-extract-the-results-for-individuals-variables-in-principal-component-analysis-r-software-and-data-mining

# only install the package if you haven’t before. You will need to have RTools 3.5 already installed

# install.packages(“devtools”)

# library(“devtools”)

# install_github(“kassambara/factoextra”)

# now run things, it will spit out tab-delimited data. The “ind” command outputs PC information for the individuals, and the “var” command outputs the PC information for the variables.

library(factoextra)

ind <- get_pca_ind(Ions2.pca)

ind$coord

ind2 <- get_pca_ind(NonIons.pca)

ind2$coord

# copy that info into text files, import them into csv format, join in Arc!

var1 <- get_pca_var(Ions2.pca)

var1$coord

var2< get_pca_var(NonIons.pca)

var2$coord

Next, I symbolized my data in ArcGIS Pro. Because I only had 18 samples, it wasn’t too onerous to use selection in the attribute table to export data to another layer to symbolize the groups I saw in the principle component analysis biplot.

To examine how the principle components affected individual points, I used the second process in the above code to generate a tab-delimited chart of the principal component loadings for each sample. I copied this into a .txt file, imported it into Excel to create a CSV, and then joined that CSV to my shapefile of well locations. For the variables’ coordinates, I likewise created .txt files and imported the into Excel, but I symbolized them in Excel.

4: Results

In this section, I’ll show the graphs and maps that I created for the ion section of my data, and give an explanation of what I learned from this analysis.

For my ion data, the first two out of the 11 principle components explain about 80% of the variation in my data.

The first one, PCI, explains 55% of the variation in my data set and is shown on the X axis of the graph below. It is strongly influenced by concentrations of calcium, magnesium, chloride, alkalinity, sulfate, bromide, and sodium, which all have eigenvalues on the X axis greater than |1|. Comparing the scree plot to the loading plot, wells 56287, 3074, 4179, and 4167 are the primary individual drivers of this PC.

PC2 explains another 25% of the variance in my data, and is shown on the Y axis. Fluoride, silicon, and manganese are the primary variable drivers of variation between the individuals in PC2.

CvS PC1 PC2 ions biplot with groups

Based on the coordinate information for the variables, the variation explained by PC1 is due to increased levels of calcium, magnesium, potassium, sodium, alkalinity, bromide, chloride, and sulfate at four wells. This combination of ions all being similar could be explained by recent recharge influenced by agricultural chemicals, such as fertilizers. In an undisturbed basalt setting though time, calcium and magnesium tend to decrease as sodium and potassium increase. The fact the these four parameters vary together in PCI indicates that a process other than ion replacement is driving ion concentration. The variation in PC2 is predominantly explained by fluoride, dissolved silica, and manganese. Elevated concentrations of these three variables in comparison to the other variables indicates that the groundwater has spent more time in the basalt.In PC2, calcium and magnesium are inversely related to sodium, indicating that the cation concentrations are driven by the ion replacement process.

While the figure above locates wells based on their grouping on a biplot comparing two principle components, the following figures show the wells color-coded by their loadings in PC1 and in PC2. I’m hypothesizing that wells with smaller PC1 values are influenced by agricultural recharge, and wells with more positive values on PC2 are more influenced by the processes of ion dissolution as water travels through the basalt.

color-coded map of PC1 values for the 18 wells

PC1 is heavily driven by four particular wells, which show up here as being in colors of blue and blue-grey.

The distribution of PC2 values shows a pattern that I find really interesting. More negative – bluer – values indicate samples which were less influenced by the ion replacement and dissolution processes of groundwater traveling through the basalt. The values at high elevation are negative, which makes sense because they’re closer to the upland recharge zone, but so do some samples in the lowlands to the north which are downgradient of certain faults. This could indicate that those PC2 negative downgradient wells are cut off from the longer flowpaths of groundwater by faults that act as barriers to groundwater flow. The pinker – more positive – points indicate samples that were more influenced by the along-flowpath ion reactions.

Next, I present the PCA results for the non-ion parameters. Unlike the “ions” set whose title was somewhat self-explanatory, this PCA brought together different types of chemical and physical properties of water. These include pH, temperature, and specific conductivity which were measured in the field,  the analyzed values of hydrogen and oxygen isotopes as well as tritium, the elevation of the bottom of the well, and the cation ratio. The cation ratio is a standard calculated parameter for groundwater chemistry which is equal to ([Na] + [K]) / ([Ca] + [Mg]). As water spends more time in contact with the rock, its ion exchange with the rock leads to a smaller cation ratio. The first biplot graphs PC1 (~40% of variation) on the X  axis against PC2 (22% of variation) on the Y axis, and the second biplot graphs PC1 on the X axis versus PC3 (~18% of variation) on the Y axis.

The PC1 grouping explains variance among the samples by inversely correlating tritium, deuterium (d2H) and oxygen-18 (d18O) with pH and the cation ratio.  More positive PC1 values show that the individual has high d18O and d2H values, and low cation ratio and pH values. All d18O and d2H values are negative; values that are closer to 0 (“larger: for the purposes of this plot) are more enriched in the heavier isotope. PC2 is predominantly driven by an inverse correlation between well bottom elevation and temperature. PC3 inversely correlates tritium, specific conductance, and well bottom elevation with temperature, deuterium, and 18-oxygen.

Yellow/orange points on this map have more positive PC1 loadings. They have higher d18O and d2H values, and lower cation ratio and pH values. The lower cation ratio and pH values indicate that the sample is less chemically evolved based on the expected ion exchange reactions in the basalt. I found it interested that samples on that end of the spectrum were also more enriched in the heavier isotopes of hydrogen and oxygen.

Unfortunately I can’t seem to upload the maps for PC2 and PC3 for non-ions, I’m getting a message that all of my allowance for image uploads has been used up…

5: Critique

I’m pretty happy with what I put together using principal component analysis. In a perfect world I would have learned to do the spatial visualization elegantly in R instead of messing around with exporting data to Arc, and made my variable charts in R instead of exporting them to excel. However for the purposes of this project the less elegant method was quicker for me.

Exercise 1: Ventenata spatial clustering

Question Asked

I am interested in understanding the invasion potential of the recently introduced annual grass ventenata (Ventenata dubia) across eastern Oregon. Here I ask, what is the spatial pattern of the ventenata invasion across the Blue Mountains Ecoregion of eastern Oregon?

Tools and Approaches Used

To address this question, I (1) tested for spatial correlation at various distances using Moran’s I spatial autocorrelation coefficients plotted with a correlogram, and (2) performed hot-spot analysis (Getis-Ord Gi) to identify statistically significant clusters of areas with high and low ventenata cover.

Description of Analysis Steps

1a) Moran’s I: To compute Moran’s I spatial autocorrelation coefficient for all of my sample units, I used the “ape” package in R version 3.5.1. The first step to this analysis was to convert the ventenata data and associated coordinates into a distance matrix. Once the distance matrix was created, the Moran.I function computed the observed and expected spatial autocorrelation coefficients for the variable of interest (ventenata abundance). The function produces a test statistic that tests the null hypothesis of no correlation. See Gittleman and Kot (1990) for details on how the Moran.I function calculates Moran’s I statistics.

1b) Correlogram: I plotted a correlogram using Moran’s I coefficients with increasing distances (lags) to examine patterns of spatial autocorrelation in my data. I used the correlog function in the spdep package in R to plot a correlogram with lag intervals of 10,000m. The function has the option of randomly resampling the data at each increment to incorporate statistical significance. This randomization tests the null hypothesis of no autocorrelation. I ran the function with resamp = 100. Black points on the correlogram are indicative of Moran’s I values significantly larger or smaller than expected under the null hypothesis.

2) Hot Spot Analysis: I used the hot spot analysis (Getis-Ord Gi*) tool in Arc GIS to identify statistically significant clusters of areas with high and low ventenata cover across my study area. The tool produces z-scores and p-values that test the null hypothesis of a random distribution of high and low values rather than clusters of high or low values. High z-scores indicate clusters of high values and low z-scores indicate clusters of low values. Low p-values indicate that these clusters are more pronounced than would be expected by chance.

Results

1a) Moran’s I: The Moran’s I spatial autocorrelation coefficient estimate for all of the points across the entire sample area was 0.3 ± 0.05 (p < 0.3). This value is not particularly informative, as it only indicates that the data is positive spatially autocorrelated, but does not provide information to describe the spatial pattern. I chose to follow the Moran’s I up with a correlogram to uncover the spatial pattern driving the autocorrelation.

1b) Correlogram: The Moran’s I spatial correlogram shows a general trend of decreasing autocorrelation from 0 to about 70,000m where sudden jumps in Moran’s I values occur to up to ~0.3. Following this jump, the correlation decreases to -0.5 to -0.2 between 120,000 and 152,000m, then increases to ~0.3 at 170,000m, decreases to almost -1.0 just after 200,000m, and finally increases to almost 1 at 220,000m. The general trend appears to be decreasing from 0.2 to -0.9 at 220,000m with some high peaks interspersed. These high and low peaks indicate distinct ventenata patches distributed throughout the study area, suggesting a clustered spatial pattern of the ventenata invasion. The extreme high and low values at distances over 200,000 are likely a result of the few sample units being compared at these distances, thus these are not so informative of the overall spatial pattern.

2) Hot Spot Analysis: Hot spot analysis in ArcGIS depicted clusters ranging from high ventenata cover (large red circles) to low ventenata cover (small blue circles) across my study area (Fig. 2) using the calculated z-scores and p-values for each sample unit. The resulting map shows distinct clusters of high, low, and moderate ventenata cover distributed across seven sampled burn perimeters (displayed in light orange). The highest cover clusters are all located within the Ochoco and Aldrich Mountains in the center of the study region. The fires on the perimeters of the region exhibited clusters of low to no ventenata cover.

Critique of Methods Used

When run on all of the data across the entire region, Moran’s I did not produce a useful statistic, indicating only if the data was spatial autocorrelation without indicating a spatial pattern. However, when visualized with a correlogram at varying distances, the correlation coefficients suddenly told a story of spatial clustering. The results from the hot spot analysis reinforce the findings from the correlogram by clearing depicting clusters on a map of the study area. The hot spot analysis further explores these results by mapping the clusters of high and low ventenata cover on top of each of my sample units, providing a useful visualization of exactly where the clusters of high and low cover fall across the region.

References

Gittleman, J. L. and Kot, M. (1990) Adaptation: statistics and a null model for estimating phylogenetic effects. Systematic Zoology39, 227–241.

 

Exercise 1: What is the spatial pattern of western hemlock dwarf mistletoe at the Wolf Rock reference stand?

For Exercise 1, I wanted to analyze the spatial pattern of western hemlock dwarf mistletoe infections in live western hemlocks on my 2.2 ha reference stand (Wolf Rock). This was without considering any attributes of the western hemlock trees themselves. Simply, what was the spatial pattern of infection?

To answer this I used the “Average Nearest Neighbor” tool in the Spatial Statistics toolbox in ArcMap. This tool calculates a z-score and a p-value from that z distribution. This is a commonly used method in dwarf mistletoe literature for assessing the clustering of infection centers. Also, the equations for this tool assume that points are free to locate wherever in space and that there are no barriers to spread.

ArcMap makes running these analyses very simple so I created a selection of infected trees (red dots), created a new feature, and then ran the tool. The p-value from my test was 0.097 and my Nearest Neighbor Index was 0.970, indicating that the spatial pattern of the infections are somewhat clustered with an alpha of 0.10.

Average Nearest Neighbor is a good test for analyzing whether or not a set of coordinates are clustered. The degree of clustering of may be harder to interpret as a lower p-value may not necessarily mean points are more clustered. Also I was unable to see where my clusters are, and if my intuitions match the analysis (see map). One other important consideration is the study area. Changes in analysis area can drastically change the result of your clustering analysis (i.e. larger study areas may make data look more clustered). Lastly, there was no option for edge correction. This may have skewed some of the clustering results along the edge of my study site and 2.2 ha is pretty small to be subsampled without losing a lot of my data.

Prologue

After confirming that my infections were clustered, I wanted to see if the pattern I saw in my map, was actually on the ground. I wanted to know, where are infected trees clustered with infected trees and where are uninfected trees clustered with uninfected trees? Again, this was without considering any attributes of the western hemlock trees themselves.

I used the “Optimized Hot Spot Analysis” tool in the Mapping Clusters toolbox to analyze the incidence of infection data (0 = absence, and 1 = presence). The Optimized Hot Spot Analysis tool can automatically aggregate incidence data that are normally not appropriate for hot spot analysis. It also calculates several other metrics for me that made analysis easy. I could take these automatically calculated metrics and alter them in a regular hot spot analysis if needed.

This map displays clustering that matched up closely with my intuitions from Map 1. On the left, the blue values show a cluster of uninfected trees that are closely clustered with other uninfected trees. The larger swath on the right show a cluster of trees that are closely clustered with other infected trees. In the middle a mix of uninfected trees and infected trees are mixed without displaying any significant clustering. Lastly, small clusters in the top left and bottom left of infected trees were identified. These clusters may be edge of larger clusters outside my stand, or lightly infected trees that are starting a new infection center. These results will be extremely valuable in informing my steps for Exercise 2 because I can assess the conditions of both patches and determine differences between the two. I can also determine if distance to the refugia impact the clustering of infection because it appears the infected cluster is closer to the fire refugia.

The hot spot analysis was extremely useful for analyzing and displaying the information I needed about the clustering and was very useful for building off of the Average Nearest Neighbor analysis.

My data set also included a severity rating for dwarf mistletoe infected western hemlocks in my study site. I ran a similar hot spot analysis to above to determine if there were any similarities with how severity played out in the stand compared to solely incidence data. My data ranged from 0 – 5, 0 indicating uninfected trees and 5 indicating most heavily infected. These are classified data, not continuous but still appropriate for the optimized hot spot analysis. Western hemlock dwarf mistletoe forms infection centers, starting from a residual infected western hemlock that survived some disturbance. From there the infection spreads outwards. Another facet of infection centers is that the most heavily infected trees are almost always aggregated in the center of the infection center and infection severity decreases as you move towards the outside of the infection center. This is intuitive when you think about infected trees in terms of the time they’ve been exposed to a dwarf mistletoe seed rain: the trees in the center of the infection center likely have been exposed to infectious seed the longest. These trees can be rated using a severity rating system that essentially determines the proportion of tree crown infected. This is calculated in a way that gives a rating that is easily interpretable, in this case, 0-5.

This third map tells me about how severity is aggregated in the stand. I can see that the wide swath in the middle of the stand, associated with the fire refugia, has the largest aggregation of severely infected trees. This is what I expected in the stand because the trees in the fire refugia survived the fire and provide an infectious seed source for the post-fire regeneration. Also, on the edges of this high severity cluster, are lower severity values indicating the expected pattern of infection centers are playing out. The west side of the stand shows a large clustering of low severity ratings. We can see that the high density of uninfected trees, falls into our cold spot of low or no severity. Interestingly, the hot spot of trees found previously  in the southwest corner, is actually a cluster of low severity trees. This may be a new infection center forming or an exterior edge of another infection center outside the plot.  Lastly, the two pockets of low severity on the east side of the stand are more distinct when considering their severity.

This second application of hot spot analysis tells another story about my data and how dwarf mistletoe is patterned spatially. The non-significant swath in the center of my stand using the incidence data turns out to be a significant clustering of highly infected trees among other new observations.

 

Exercise 1: The Variogram

Question:

How does the variance between the depths to the first lava flow in my filed area vary with increasing distance?

The Tool: Variogram

I used a variogram to analyze the variance within my dataset. Variograms are discrete functions calculated using to measure the average correlation between pairs of measurements at various distance (Cameron and Hunter 2002). These distances are referred to as the binned distances (Anselin, 2016). In this study, binned distances determine the distances by which the depth to first lava flow is autocorrelated (Aneslin, 2016).

variog(geodata,coords = geodata$coords,data = geodata$data,max.dist=”number”) (R documentation)

The R code above need the geodata, an array of the data you are testing, the cords, or the coordinates those data correspond too.

Brief Methodology:

I selected data based on the quality of the descriptions, in the well log and assigned each well log I have interacted with a value from 0 to 3. Data with a score 0 represents either a dry well or a destroyed well. Data that has a score of 3 is well described and denotes either the depth to first water or the standing water level. I used data with a score of 3 for this analysis. I denoted the depth to the first lava flow in each of the well logs.

Figure 1: My field area, the well log locations are the blue circles, a rough delineation of my field area is in while.

First I normalized my data, giving a mean of 0 and a standard deviation of 1. Then I determined a max distribution, which determines the max bin size the variogram uses. The max distribution determines the lag increment, or the distances between which the increment is calculated (Anselin, 2016).

The data are projected, meaning that their horizontal distance is measured in meters. However, they are recorded in decimal degrees and located at the center of the township and range section that they are located in. Rather than convert my data from decimal degrees to meters, I recognized that there are around 111 km in 1 degree (at the equator), that there are 0.6214 miles in a kilometer and that there are 6 miles in a township. This helped me determine the max distribution for the variog function in R. I decided upon a max distributions of 0.09 and 0.5 degrees.
I then used R’s variog function on the normalized data with the max distribution of 0.09 and 0.5 degrees.

Succinct Results:

Figure 2: Variogram of the MLV-LP-BVM triangle with a max distribution of 0.09 degrees. Note the low semivariance with the lower bin size (.02 to 0.08 degrees).

Figure 3: Variogram of the MLV-LP-BVM triangle with a max distribution of 0.5 degrees. Note the low saw-toothed pattern of semivariance. From 0.01 to 0.08, semivariance is low, it spikes up, and lowers again around 0.2, spikes again at 0.3 degrees and lowers again at 0.4.

Critique:

I tested max bin sizes of 0.5 and 0.09 degrees to see how the variogram changes with an increasing bin size. Changing the max bin size, called the max distribution in R, changes the variogram. The smaller bin size, 0.09 degrees, limits the max bin size to a narrow range of values. In effect, the variogram only tests the covariance of data points that are separated by a maximum .09 or degrees. Increasing the bin size increases the distance around which the covariance of the data points are tested. Thus, mad distributions of 0.5 result in a spikey plot. Normally, one would expect the variogram to plateau at larger bin sizes, representing large variance with the data with larger distances, but figure 2 does not.

At its simplest, the variogram in figures 1 implies that data points that are correlated in space are more likely to have similar depths to the first lava flow. You can see this in the low variance in you find in locations that are close to each other, the smaller distances, and the higher variance in distances that are farther from each other.

Figure 2, with the max distribution of 0.5 degrees one could made an argument for a distribution of the locations of lava flows based on the locations of volcanic centers. Medicine Lake Volcano lies in the North of the study area and Lassen Peak lies in the south. The two spikes in variance with location might be linked to the distances between those two centers. In other words, distances that correspond to low values of semivariance (>1) correspond to either regions that lie on the same lava flow, or another near surface lava flow sourced from another volcanic center in the region (figure 3). Rather than finding the same lava flow at depth, we are seeing different lava flows at similar depths.

Figure 4: My field area with the rough delineations of two of the near surface lava flows in the region. 1 degree of longitude corresponds to roughly 111 km.

Lava flows are not attracted or repulsed to or from each other, but they do follow the laws of physics. Often, Volcanoes build topography, when lava erupts from them, the lava will flow from the high elevations of the volcano, to basins, using paleo-valleys as conduits for flow. Thus, if you know the paleo topography you can understand where a lava might have flowed, and where it might emplace. Predicting paleo-topography can be difficult in old volcanic regimes, but on the geologic timescales we are looking at, I can predict that the topography of the MLV-LP-BVM triangle has not changed much over the past 5 ma.

Lavas flows from high to low topography and from Volcanos. The two main volcanos in my field area are Lassen Peak in the South and Medicine Lake Volcano, in the North. Moreover, lava flows from high to low elevation. Lavas emplace in basin, if they sit long enough, the top soils form, the basin subsides, and another the cycle repeats.

My data does have different spatial patterns at different scales. If you look at regions that are within 0.1 of a degree of distance then you would expect to see a similar depth to the lava flow. If you move to 0.5 of a degree, you see a sees-saw effect, where the depth to the lava flow moves from having lava close to the surface of deeper down. This variation stems from the proximity of the data point to the sources and the sinks of the lava flow.

I would use this technique again, it helped me think about my problem.

Sources:
Cameron, K, and P Hunter. 2002. Using Spatial Models and Kriging Techniques to Optimize Long-Term Ground-Water Monitoring Networks: A Case Study. Environmetrics 13:629-59.

Anselin, Luc. “Variogram”. Copyright 2016. Powerpoint File.

R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

Exercise 1: The Spatial Patterns of Natural Resource Governance Perceptions

Question

What are the spatial patterns of natural resource governance perceptions in the Puget Sound?

Tools and Approaches

  1. Moran’s I (with correlograms) and Semivariograms in R studio
  2. Kriging and IDW in ArcGIS Pro
  3. Hotspot Analysis in ArcGIS Pro

 

Analysis Steps

  1. To compute Moran’s I, I used the “ape” library in R which has a function called Moran.I(). This function takes the variable in question (governance perceptions), and a distance matrix to compute the observed and expected values of Moran’s I, as well as the standard deviation and a p-value. For this analysis, I also subset my data to examine spatial autocorrelation by demographics including area (urban, suburban, rural), political ideology, life satisfaction, income, and cluster (created by running a cluster analysis on the seven variables which comprise the governance index).  I created correlograms for the variables that were significant (urban, conservative, and liberal) using the “ncf” library and the correlog() function. These figures give a better picture of spatial autocorrelation at various distances.  To create semivariograms, I used the “gstat” and “lattice” libraries which contain a function called variogram. This function takes the variable of interest along with latitude and longitude locations. The object created can then be plotted. For this analysis I used the same subsets as in the Moran’s I analysis.
  2. To preform interpolation on my data, I loaded my point data into ArcGIS Pro. I then used the Spatial Analysis toolbox to preform Kriging  and IDW to compare the outputs of the two techniques. I used my indexed variable of governance perceptions. The values of the variable vary from 1 to 7. I then also uploaded a shapefile bounding the sample area, as well as a shapefile of shoreline, to delineate my study area better.
  3. To run a hotspot analysis I used my previously loaded point data inArcGIS Pro. I then used the Spatial Analysis toolbox to preform ‘hotspot analysis.’ I used my indexed variable of governance perceptions with values from 1 to 7. I used the shapefile of shoreline to delineate my study area better.

Results

  1. The Moran’s I calculation was insignificant for rural, suburban, cluster groups, life satisfaction, and income, suggesting no spatial autocorrelation of governance perceptions by these subsets.

 

The Moran’s I calculation was significant for urban:

Observed value: -0.014

P-value: 0.0002

 

The Moran’s I calculation was also significant for ideology:

Conservative

Observed value: -0.006

P-value: 0.002

Liberal

Observed value: -0.002

P-value: 0.05

 

This suggests that in these subsets there is spatial autocorrelation between individual governance perceptions.

The semivariograms for the subsets that are significantly spatially autocorrelated are presented below.

 

None of these plots suggest high degrees of spatial autocorrelation. The urban plot does so more than the ideology plots, but the y axis scale is still very small.

 

 

 

 

 

 

 

 

The plot (top Urban, bottom left Liberal, bottom right Conservative) help to confirm the findings from above. The Moran’s I fluctuates around zero without much variation. The large spike in variation that the graphs do show are only for non significant points. Significant points are filled in, where non-significant points are open circles.

2. Interpolation

The kriging (bottom left) with individual points and IDW (bottom right), do not look incredibly different in terms of general trends. The kirging with shoreline (top) gives possibly the most interesting visual of spatial patterns. In general, perceptions are better (more green) in the center, where there is greater shoreline. There are also two sections that appear much more negative. To examine these locations further, I preformed a hotspot analysis.

3.  Hotspot Analysis

This image confirms the two bright red spots from the interpolation to be “cold spots” or spots that the value of perception is significantly lower  than the average perception (neutral) at a 99% confidence. The orange dots are a 95% confidence. The green corridor appears to hold in the southern part of the Sound and is confirmed at a “hotspot” or a spot that the value of perception is significantly higher than the areas surrounding it at a 99% confidence level.

The three main areas of red or orange correspond to the cities of Shelton (bottom), Port Angeles (west), and Everett with a little of south Whidbey Island (east).

  1. Critique

I believe all methods are useful, but some are redundant. I think it would probably be sufficient to do only one of each type of method—spatial autocorrelation and interpolation—but it is interesting and more convincing to see the same type of analysis done in different ways. The p-values from the Moran’s I appear to agree with the shape of the curve’s in the semivariograms, where the smaller p-values have more defined shapes. The same goes for the interpolation methods, while they are interesting to see side-by-side, they essentially tell the same story. I think in this case, the hotspot analysis shows the most interesting interpretation of the data because it indicates areas of potential concern.

A stain on the record? Have forest management practices set up PNW landscapes for a black-stain-filled future?

Describe the research question that you are exploring.

I am looking at how forest management practices influence the spread of black stain root disease (BSRD), a fungal root disease that affects Douglas-fir in the Pacific Northwest. While older trees become infected, BSRD primarily causes mortality in younger trees (< 30-35 years old). Management practices (e.g., thinning, harvest) attract insects that carry the disease and are associated with increased BSRD incidence. As forest management practices in the Pacific Northwest change to favor shorter-rotations of Douglas-fir monocultures, the distribution of Douglas-fir age classes is shifting towards younger stands and the frequency of harvest disturbance is increasing across the landscape. Though limited, our present understanding of this disease system indicates that these management trends, as well as the resulting disturbance regime and forest landscape age structure, may be creating favorable conditions for BSRD spread.

In this course, I would like to use spatial analyses to answer the question of whether forest management and the conditions that it creates act as a driver of the spread of black stain root disease. Specifically:

  • How do spatial patterns of forest management practices and the forest stand and landscape conditions that they create relate to spatial patterns of BSRD infection probabilities at the stand and landscape scale?
  • How do spatial patterns of forest management practices relate to landscape connectivity with respect to BSRD by affecting the area of susceptible forest and creating dispersal corridors and/or barriers throughout the landscape?

Example landscape with stands of three different forest management regimes (shades of green) and trees infected by black stain root disease (red). Forgive the 90s-esque graphics… NetLogo, the program I am using to develop and run my model, is powerful but old-school.

Describe the dataset you will be analyzing, including the spatial and temporal resolution and extent.

I will be analyzing the raster outputs of a spatial model that I built in the agent-based modeling program NetLogo (Wilensky 1999). The rasters contain the states of forested landscapes (managed as individual stands) at a given time during the model run. Variables include tree age, presence/absence of trees, management regime, probability of infection, infection status (infected/not infected), and cause of infection (root transmission, vector transmission).

The forested landscapes I am looking at are about 3,000 to 4,000 ha, with each pixel representing a ~1.5 m x 1.5 m area that can occupied by one tree. I run each model for a 300-year time series with 1-year intervals, though raster outputs may be produced at 10-year intervals.

Hypotheses: predict the kinds of patterns you expect to see in your data, and the processes that produce or respond to these patterns.

I hypothesize that landscapes with higher proportions of intensively managed, short-rotation stands will have higher probabilities of BSRD infection at the stand and landscape scales. In landscapes with high proportion of short-rotation stands, there will be large areas of suitable habitat for the pathogen and its vectors, frequent harvest that attracts disease vectors, and greater levels of connectivity for the spread of disease. In landscapes with a large proportion of older forests managed for conservation, I hypothesize that these forests will act as barriers to the spread of BSRD. High connectivity could be evidenced by greater landscape-scale dispersion of infections, whereas low connectivity would lead to a high degree of clustering of infections in the landscape.

I also hypothesize that intensively managed, short-rotation stands will have the highest probabilities of infection, followed by intensively managed, medium-rotation stands, and finally old-growth stands. However, I hypothesize that each stand’s probability of infection will depend not only on its own management but also on the management of neighboring stands and the broader landscape. At some threshold proportion of intensive management in the landscape, I hypothesize that there will be a shift in the scale of the drivers of infection, such that landscape-scale management patterns overtake stand-scale management as a predictor of infection probability.

Approaches: Describe the kinds of analyses you ideally would like to undertake and learn about this term, using your data.

I would like to learn about landscape connectivity analyses and spatial statistics such as clustering/dispersion as well as spatiotemporal analyses to analyze the relationships between discrete disturbance events and disease spread. I would like to learn how to separate the effects of connectivity from the effect of the area of suitable pathogen habitat. I am most interested in using R or Python to analyze my data, and I would like to move away from ESRI programs because of my interest in open-source and free tools for science and the prohibitive cost of ESRI software licenses for independent researchers and organizations with limited financial means.

Expected outcome – What do you want to produce – Maps? Statistical relationships?

My primary interest is to evaluate statistical relationships between spatial patterns of management and disease measures, but I would also like to produce maps to demonstrate model inputs and outputs (i.e., figures for my thesis).

Significance – How is your spatial problem important to science? To resource managers?

From a scientific perspective, this research aims to contribute to the body of research examining relationships between spatial patterns and ecological processes and complex behaviors in ecological systems. This research will examine how the diversity of the landscape age structure and disturbance regimes affect the susceptibility of the landscape to disease, contributing to literature relating diversity and stability in ecological systems. In addition, “neighborhood” and “spillover” effects will be tested by analyzing stand-scale infection probability with respect to the infection probability of neighboring stands and more broadly in the landscape. Analysis of threshold responses to changes in stand- and landscape-scale management patterns and shifts in the scale of disease drivers will contribute to understanding of cross-scale system interactions and emergent properties in the field of complex systems science.

From an applied perspective, the goal of this research is to inform management practices and understand the potential threat of black stain root disease in the Pacific Northwest. This will be achieved by improving understanding of the drivers of BSRD spread at multuiple scales and highlighting priority areas for future research. This project is a first step towards identifying evidence-based, landscape-scale management strategies that could be taken to mitigate BSRD disease issues. In addition, the structure of this model provides a platform for looking at multi-scale interactions between forest management and spatial spread processes. Its use is not restricted to a specific region and could be adapted for other current and emerging disease issues.

Your level of preparation – How much experience do you have with: (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) image processing, (e) other relevant software

Over the past 5 years, I have worked on and off with all the programs/platforms listed. For some, I have been formally trained, but for others, I have been largely self-taught. However, lack of continuous use has eroded my skills to some degree.

a. I have frequently used ArcInfo for making maps, visualizing data, and processing and analyzing spatial data. However, I do not have a lot of experience with spatial statistics in ArcInfo.

b. Modelbuilder/Python: Last spring, I took GEOG 562 and learned to program in Python, developing a script that used arcpy to prepare and manipulate spatial data for my final project. I felt comfortable programming in Python at that time, but I have not used Python much since the course.

c. I have frequently used R to clean and prepare data, perform simple statistical analyses (ANOVA, linear regression), and create plots. I have taken several workshops on using R for spatial analysis, but I have used rarely used the R packages I learned about outside of those workshops.

d. I have used ENVI to correct, patch, and combine satellite images, and I have performed supervised classifications to create land cover maps. I have worked primarily with LANDSAT images. I have also used CLASlite (an image processing software designed for classifying tropical forest cover).

e. Covered in part d.

References

Wilensky, U. (1999). NetLogo. http://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.


Adam Bouché