Tag Archives: GIS

Final project: Washing out the (black) stain and ringing out the details

BACKGROUND

In order to explain my project, especially my hypotheses, some background information about this disease is necessary. Black stain root disease of Douglas-fir is caused by the fungus Leptographium wageneri. It infects the roots of its Douglas-fir host, growing in the xylem and cutting the tree off from water. It spreads between adjacent trees via growth through root contacts and grafts and long-distance via insects (vectors) that feed and breed in roots and stumps and carry fungal spores to new hosts.

Forest management practices influence the spread of disease because of the influence on (i) the distance between trees (determined by natural or planted tree densities); (ii) adjacency of susceptible species (as in single-species Douglas-fir plantations); (iii) road, thinning and harvest disturbance, which create suitable habitat for insect vectors (stumps, dead trees) and stress remaining live trees, attracting insect vectors; and (iv) forest age distributions, because rotation lengths determine the age structure in managed forest landscapes and younger trees (<30-40 years old) are thought to be more susceptible to infection and mortality by the disease.

RESEARCH QUESTION

How do (B) spatial patterns of forest management practices relate to (A) spatial patterns of black stain root disease (BSRD) infection probabilities at the stand and landscape scale via (C) the spatial configuration and connectivity of susceptible stands to infection?

In order to address my research questions, I built a spatial model to simulate BSRD spread in forest landscapes using the agent-based modeling software NetLogo (Wilensky 1999). I used Exercises 1-3 to focus on the spatial patterns of forest management classes. Landscapes were equivalent in terms of the proportion of each management class and number of stands, varying only in spatial pattern of management classes. In the exercises, I evaluated the relationship between management and disease by simulating disease spread in landscapes with two distinct spatial patterns of management:

  • Clustered landscape: The landscape was evenly divided into three blocks, one for each management class. Each block was evenly divided into stands.
  • Random landscape: The landscape was evenly divided into stands, and forest management classes were randomly assigned to each stand.

MY DATA

I analyzed outputs of my spatial model. The raster files contain the states of cells in forest landscapes at a given time step during one model run. States tracked include management class, stand ID number, presence/absence of trees, tree age, probability of infection, and infection status (infected/not infected). Management class and stand ID did not change during the model run. I analyzed tree states from the last step of the model run and did not analyze change over time.

Extent: ~20 hectares (much smaller than my full models runs will be)

Spatial resolution: ~1.524 x 1.524 m cells (maximum 1 tree per cell)

Three contrasting and realistic forest management classes for the Pacific Northwest were present in the landscapes analyzed:

  • Intensive – Active management: 15-foot spacing, no thinning, harvest at 37 years.
  • Extensive – Active management: 10-foot spacing, one pre-commercial and two commercial thinnings, harvest at 80 years.
  • Set-aside/old-growth (OG) – No active management: Forest with Douglas-fir in Pacific Northwest old-growth densities and age distributions and uneven spacing with no thinning or harvest.

HYPOTHESES: PREDICTIONS OF PATTERNS AND PROCESSES I LOOKED FOR

Because forest management practices influence the spread of disease as described in the “Background” section above, I hypothesized that the spatial patterns of forest management practices would influence the spatial pattern of disease. Having changed my experimental design and learned about spatial statistics and analysis methods throughout the course, I hypothesize that…

  • The “clustered” landscape will have (i) higher absolute values of infection probabilities, (ii) higher spatial autocorrelation in infection probabilities, and (iii) larger infection centers (“hotspots” of infection probabilities) than the “random” landscape because clustering of similarly managed forest stands creates continuous, connected areas of forest managed in a manner that creates suitable vector and pathogen habitat and facilitates the spread of disease (higher planting densities, lower age, frequent thinning and harvest disturbance in the intensive and extensive management). I therefore predict that:
    • Intensive and extensive stands will have the highest infection probabilities with large infection centers (“hotspots”) that extend beyond stand boundaries.
      • Spatial autocorrelation will therefore be higher and exhibit a lower rate of decrease with increasing distance because there will be larger clusters of high and low infection probabilities when stands with similar management are clustered.
    • Set-aside (old-growth, OG) stands will have the lowest infection probabilities, with small infection centers that may or may not extend beyond stand boundaries.
      • Where old-growth stands are in contact with intensive or extensive stands, neighborhood effects (and edge effects) will increase infection probabilities in those OG stands.
    • In contrast, the “random” landscape will have (i) lower absolute values of infection probabilities, (ii) less spatial autocorrelation in infection probabilities, and (iii) smaller infection centers than the “clustered” landscape. This is because the random landscape will have less continuity and connectivity between similarly managed forest stands; stands with management that facilitates disease spread will be less connected and stands with management that does not facilitate the spread of disease will also be less connected. I would predict that:
      • Intensive and extensive stands will still have the highest infection probabilities, but that the spread of infection will be limited at the boundaries with low-susceptibility old-growth stands.
        • Because of the boundaries created by the spatial arrangement of low-susceptibility old-growth stands, clusters of similar infection probabilities will be smaller and values of spatial autocorrelation will be lower and decrease more rapidly with increasing lag distance. At the same time, old-growth stands may have higher infection probabilities in the random landscape than in the clustered landscape because they would be more likely to be in contact with high-susceptibility intensive and extensive stands.
      • I also hypothesize that each stand’s neighborhood and spatial position relative to stands of similar or different management will influence that stand’s infection probabilities because of the difference in spread rates between management classes and the level of connectivity to high- and low-susceptibility stands based on the spatial distribution of management classes.
        • Stands with a large proportion of high-susceptibility neighboring stands (e.g., extensive or intensive management) will have higher infection probabilities than similarly managed stands with a small proportion of high-susceptibility neighboring stands.
        • High infection probabilities will be concentrated in intensive and extensive stands that have high levels of connectivity within their management class networks because high connectivity will allow for the rapid spread of the disease to those stands. In other words, the more connected you are to high-susceptibility stands, the higher your probability of infection.

APPROACHES: ANALYSIS APPROACHES I USED

Ex. 1: Correlogram, Global Moran’s I statistic

In order to test whether similar infection probability values were spatially clustered, I used the raster package in R (Hijmans 2019) to calculate the global Moran’s I statistic at multiple lag distances for both of the landscape patterns. I then plotted global Moran’s I vs. distance to create a correlogram and compared my results between landscapes.

Ex. 2: Hotspot analyses (ArcMap), Neighborhood analyses (ArcMap)

First, I performed a non-spatial analysis comparing infection probabilities between (i) landscape patterns (ii) management classes, and (iii) management classes in each of the landscapes. Then, I used the Hotspot Analysis (Getis-Ord Gi*) tool in ArcMap to identify statistically significant hot- and cold-spots of high and low infection probabilities, respectively. I selected points within hot and cold spots and used the Multiple Ring Buffer tool in ArcMap to create distance rings, which I intersected with the management classes to perform a neighborhood analysis. This neighborhood analysis revealed how the proportion of each management class changed with increasing distance from hotspots in order to test whether the management “neighborhood” of trees influenced their probability of infection.

Ex. 3: Network and landscape connectivity analyses (Conefor)

I divided my landscape into three separate stand networks based on their management class. Then, I used the free landscape connectivity software Conefor (Saura and Torné 2009) to analyze the connectivity of each stand based on its position within and role in connecting the network using the Integrative Index of Connectivity (Saura and Rubio 2010). I then assessed the relationship between the connectivity of each stand and infection probabilities of trees within that stand using various summary statistics (e.g., mean, median) to test whether connectivity was related to infection probability.

RESULTS: WHAT DID I PRODUCE?

As my model had not been parameterized by the beginning of this term, I analyzed “dummy” data, where infection spread probabilities were calculated as a decreasing linear function of distance from infected trees. However, the results I produced still provided insights as to the general functioning of the model and factors that will likely influence my results in the full, parameterized model.

I produced both maps and numerical/statistical relationships that describe the patterns of “A” (infection probabilities), the relationship between “A” and “B” (forest management classes), and how/whether “A” and “B” are related via “C” (landscape connectivity and stand networks).

In Exercise 1, I found evidence to support my hypothesis of spatial autocorrelation at small scales in both landscapes and higher autocorrelation and slower decay with distance in the clustered landscape than the random landscape. This was expected because the design of the model calculated probability of infection for each tree as a function of distance from infected trees.

In Exercises 2 and 3, I found little to no evidence to support the hypothesis that either connectivity or neighboring stand management had significant influence on infection probabilities. Because the model that produced the “dummy” data limited infection to ~35 meters from infected trees and harvest and thinning attraction had not been integrated into infection calculations, this result was not surprising. In my full model where spread via insect vectors could span >1,000 m, I expect to see a larger influence of connectivity and neighborhood on infection probabilities.

A critical component of model testing is exploring the “parameter space”, including a range of possible values for each parameter. This is especially for agent-based models where there are complex interactions between many individuals that result in larger-scale patterns that may be emergent and not fully predictable by the simple sum of the parts. In my model, the disease parameters of interest are the factors influencing probability of infection (Fig. 1). In order to understand how the model reacts to changes in those parameters, I will perform a sensitivity analysis, systematically adjusting parameter values one-by-one and comparing the results of each series of model runs under each set of parameter values.

Fig.1. Two of the model parameters that will be systematically adjusted during sensitivity analysis. Tree susceptibility to infection as a function of age (left) and probability of root contact as a function of distance (right) will both likely influence model behavior and the relative levels of infection probability between the three management classes.

This is especially relevant given that in Exercises 1 through 3, I found that the extensively managed plantations had the highest values of infection probability and most of the infection hotspots, likely due to the fact that this management class has the highest [initial] density of trees. For the complete model, I am hypothesizing that the intensive plantations will have the highest infection probabilities because of high frequency of insect-attracting harvest and short rotations that maintain the trees in an age class highly susceptible to infection. In the full model, the extensive plantations will have higher initial density than the intensive plantations but will undergo multiple thinnings, decreasing tree density but attracting vectors, and will be harvested at age 80, thus allowing trees to grow into a less susceptible age class. In this final model, thinning, harvest length, and vector attraction will factor in to the calculation of infection probabilities. My analysis made it clear that even a 1.5 meter difference in spacing resulted in a statistically significant difference for disease transmission, with much higher disease spread in the denser forest. Because the model is highly sensitive to tree spacing, likely because the parameters of my model that relate to distance drop off in sigmoidal or exponential decay patterns, I would hypothesize that changes in the values of parameters that influence the spatial spread of disease (i.e., insect dispersal distance, probability of root contact with distance) and the magnitude of vector attraction after harvest and thinning will determine whether the “extensive” or “intensive” forest management class will ultimately the highest levels of infection probabilities. In addition, the rate of decay of root contact and insect dispersal probabilities will determine whether management and infection within stands influence infection in neighboring stands and the distance and strength of those neighborhood effects. I would like to test this my performing such analyses on the outputs from my sensitivity analyses.

SIGNIFICANCE: WHAT DID I LEARN FROM MY RESULTS? HOW ARE THESE RESULTS IMPORTANT TO SCIENCE? TO RESOURCE MANAGERS?

Ultimately, the significance of this research is to understand the potential threat of black stain root disease in the Pacific Northwest and inform management practices by identifying evidence-based, landscape-scale management strategies that could mitigate BSRD disease issues. While the results of Exercises 1-3 were interesting, they were produced using a model that had not been fully parameterized and thus are not representative of the likely actual model outcomes. Therefore, I was not able to test my hypotheses. That said, this course allowed me to design and develop an analysis to test my hypotheses. The exercises I completed have also provided a deeper understanding of how my model works. Through this process, I have begun to generate additional testable hypotheses regarding model sensitivity to parameters and the relative spread rates of infection in each of the forest management classes. Another key takeaway is the importance of producing many runs with the same landscape configuration and parameter settings to account for stochastic processes in the model. By only analyzing one run for each scenario, there is a chance that the results are not representative of the average behavior of the system or the full range of behaviors possible for those scenarios. For example, with the random landscape configuration, one generated landscape can be highly connected and the next highly fragmented with respect to intensive plantations, and only a series of runs under the same conditions would provide reliable results for interpretation.

WHAT I LEARNED ABOUT… SOFTWARE

(a, b) Arc-Info, Modelbuilder and/or GIS programming in Python

This was my first opportunity to perform statistical analysis in ArcGIS, and I used multiple new tools, including hotspot analysis, multiple ring buffers, and using extensions. Though I did not use Python or Modelbuilder, I realized that doing so will be critical for automating my analyses given the large number of model runs I will be analyzing. While I learned how to program in Python using arcpy in GEOG 562, I used this course to choose the appropriate tools and analyses for my questions and hypotheses rather than automating procedures I may not use again. I would now like to implement my procedures for neighborhood analysis in Python in order to automate and increase the efficiency of my workflow.

(c) Spatial analysis in R

During this course, I learned most about spatial data manipulation in R, since I had limited experience using R with spatial data beforehand. I used R for spatial statistics, data cleaning and management, and conversion between vector and raster data. I also learned about the limitations of R (and my personal limitations) in terms of the challenge of learning how to use packages and their functions when documentation is variable in quality and a wide variety of user-generated packages are available with little reference as to their quality and reliability. For example, for Exercise 2, I had trouble finding an up-to-date and high-quality package for hotspot analysis in R, with raster data or otherwise. However, this method was straightforward in ArcMap once the data were converted from raster to points. For Exercise 1, the only Moran’s I calculation that I was able to perform with my raster data was the “moran” function in the raster package, which does not provide z- or p-values to evaluate the statistical significance of the calculated Moran’s I and requires you to generate your own weights matrices, which is a pain. Using the spdep or ncf packages for this analysis was incredibly slow (though I am not sure why), and the learning curve for spatstat was too steep for me to overcome by the Exercise 1 deadline (but I hope to return to this package in the future).

Reading, manipulating, and converting data: With some trial and error and research into the packages available for working with spatial data in R (especially raster, sp/spdep, and sf), I learned how to quickly and easily convert data between raster and shapefile formats, which was very useful in automating the cleaning and preparation for multiple datasets and creating the inputs for the analyses I want to perform.

(d) Landscape connectivity analyses: I learned that there are a wide variety of metrics available through Fragstats (and landscapemetrics and landscapetools packages in R), however, I was not able to perform my desired stand-scale analysis of connectivity because I could not determine whether it is possible to analyze contiguous stands with the same management class as separate patches (Fragstats considered all contiguous cells in the raster with the same class to be part of the same patch). Instead, I used Conefor, which has an ArcMap extension that allows you to generate a node and connection file from a polygon shapefile, to calculate relatively few but robust and ecologically meaningful connectivity metrics for the stands in my landscape.

WHAT I LEARNED ABOUT… SPATIAL STATISTICS

Correlograms and Moran’s I: For this statistical method, I learned the importance of choosing meaningful lag distances based on the data being analyzed and the process being examined. For example, my correlogram consists of a lot of “noise” with many peaks and troughs due to the empty cells between trees, but I also captured data at the relevant distances. Failure to choose appropriate lag distances means that some autocorrelation could be missed, but analyses of large raster images at a high resolution of lag distances results in very slow processing. In addition, I wanted to compare local vs. global Moran’s I to determine whether infections were sequestered to certain portions of the landscape or spread throughout the entire landscape, but the function for local Moran’s I returned values far outside the -1 to 1 range of the global Moran’s I. As a result, I did not understand how to interpret or compare these values. In addition, global Moran’s I did not tell me where spatial autocorrelation was happening, but the fact that there was spatial autocorrelation led me to perform a…

Hotspot analysis (Getis-Ord Gi*): It became clear that deep conceptual understanding of hypothesized spatial relationships and processes in the data and a clear hypothesis are critical for hotspot analysis. I performed multiple analyses with difference distance weighting to compare the results, and there was a large variation in both the number of points included in hot and cold spots and the landscape area covered by those spots between the different weighting and distance methods. I ended up choosing the inverse squared distance weighting based on my understanding of root transmission and vector dispersal probabilities and because this weighting method was the most conservative (produced the smallest hotspots). The confidence level chosen also resulted in large variation in the size of hotspots. After confirming that there was spatial autocorrelation in infection probabilities, using this method helped me to understand where in the landscape these patterns were occurring and thus how they related to management practices.

Neighborhood analysis: I did not find this method provided much insight in my case, not because of the method itself but because of my data (it just confirmed the landscape pattern that I had designed, clustered vs. random) and my approach (one hotspot and one coldspot point non-randomly selected in each landscape. I also found this method to be tedious in ArcMap, though I would like to automate it, and I later learned about the zonal statistics tool, which can help make this analysis more efficient. In general, it is not clear what statistics I could have used to confirm whether results were significantly different between landscapes, but perhaps this is an issue caused by my own ignorance.

Network/landscape connectivity analyses: I learned that there are a wide variety of tools, programs, and metrics available for these types of analyses. I found the Integrative Index of Connectivity (implemented in Conefor) particularly interesting because of the way it categorizes habitat patches based on multiple attributes in addition to their spatial and topological positions in the landscape. The documentation for this metric is thorough, its ecological significance has been supported in peer-reviewed publications (Saura and Rubio 2010), and it is relatively easy to interpret. In contrast, I found the number of metrics available in Fragstats to be overwhelming especially during the data exploration phase.

REFERENCES

Robert J. Hijmans (2019). raster: Geographic Data Analysis and Modeling. R package version 2.8-19. https://CRAN.R-project.org/package=raster

Saura, S. & J. Torné. 2009. Conefor Sensinode 2.2: a software package for quantifying the importance of habitat patches for landscape connectivity. Environmental Modelling & Software 24: 135-139.

Saura, S. & L. Rubio. 2010. A common currency for the different ways in which patches and links can contribute to habitat availability and connectivity in the landscape. Ecography 33: 523-537.

Wilensky, U. (1999). NetLogo. http://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

Final Project: San Diego Bottlenose Dolphin Sighting Distributions

Final Project: San Diego Bottlenose Dolphin Sighting Distributions

The Research Question:

Originally, I asked the question: do common bottlenose dolphin sighting distances from shore change over time?

However, throughout the research and analysis process, I refined this question for a multitude of reasons. For example, I planned on using all of my dolphin sightings from my six different survey locations along the California coastline. Because the bulk of my sightings are from the San Diego survey site, I chose this data set for completeness and feasibility. Additionally, this data set used the most standard survey methods. Rather than simply looking at distance from shore, which would be at a very fine scale, seeing as all of my sightings are within two kilometers from shore, I chose to try and identify changes in latitude. Furthermore, I wanted to see if changes in latitude (if present, were somehow related to the El Nino Southern Oscillation (ENSO) cycles and then distances to lagoons). This data set also has the largest span of sightings by both year and month. When you see my hypotheses, you will notice that my original research question morphed into much more specific hypotheses.

Data Description:

My dolphin sighting data spans 1981-2015 with a few absent years, and sightings covering all months, but not in all years sampled. The same transects were performed in a small boat with approximately a two kilometer sighting span (one kilometer surveyed 90 degrees to starboard and port of the bow). These data points therefore have a resolution of approximately two kilometers. Much of the other data has a coarser spatial resolution, which is why it was important to use such a robust data set. The ENSO data I used gave a broad brushstroke approach to ENSO indices. Rather than first using the exact ENSO index which is at a fine scale, I used the NOAA database that split month-years into positive, neutral, and negative indices (1, 0, and -1, respectively). These data were at a month-year temporal resolution, which I matched to my month-date information of my sighting data. Lagoon data were sourced from the mid-late 2000s, therefore I treated lagoon distances as static.

Hypotheses:

H1: I predicted that bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) along the San Diego transect throughout the years 1981-2015 would exhibit clustered distribution patterns as a result of the patchy distributions of both the species’ preferred habitats and prey, as well as the social nature of this species.

H2: I predicted there would be higher densities of bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) at higher latitudes spanning 1981-2015 due to prey distributions shifting northward and less human activities in the northward sections of the transect. I predicted that during warm (positive) ENSO months, the dolphin sightings in San Diego would be distributed more northerly, predominantly with prey aggregations historically shifting northward into cooler waters, due to (secondarily) increasing sea surface temperatures. I expect the spatial gradient to shift north and south, in relation to the ENSO gradient (warm, neutral, or cold)

H3: I predicted that along the San Diego coastline, bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) would be clustered around the six major lagoons within about two kilometers, with no specific preference for any lagoon, because the murky, nutrient-rich waters in the estuarine environments are ideal for prey protection and known for their higher densities of schooling fishes.

Map with bottlenose dolphin sightings on the one-kilometer buffered transect line and the six major lagoons in San Diego.

Approaches:

I utilized multiple approaches with different software platforms including ArcMap, qGIS, GoogleEarth, and R Studio (with some Excel data cleaning).

  • Buffers in ArcMap
  • Calculations in an attribute table
  • ANOVA with Tukey HSD
  • Nearest Neighbor averages
  • Cluster analyses
  • Histograms and Bar plots

Results: 

I produced a few maps (will be), found statistical relationships between sightings and distribution patterns,  ENSO and dolphin latitudes, and distances to lagoons.

H1: I predicted that bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) along the San Diego transect throughout the years 1981-2015 would exhibit clustered distribution patterns as a result of the patchy distributions of both the species’ preferred habitats and prey, as well as the social nature of this species.

True: The results of the average nearest neighbor spatial analysis in ArcMap 10.6 produced a z-score of -127.16 with a p-value of < 0.000001, which translates into there being less than a 1% likelihood that this clustered pattern could be the result of random chance. Although I could not look directly at prey distributions because of data availability, it is well-known that schooling fishes exist in clustered distributions that could be related to these dolphin sightings also being clustered. In addition, bottlenose dolphins are highly social and although pods change in composition of individuals, the dolphins do usually transit, feed, and socialize in small groups. Also see Exercise 2 for other, relevant preliminary results, including a histogram of the distribution in differences of sighting latitudes.

Summary from the Average Nearest Neighbor calculation in ArcMap 10.6 displaying that bottlenose dolphin sightings in San Diego are highly clustered.

H2: I predicted there would be higher densities of bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) at higher latitudes spanning 1981-2015 due to prey distributions shifting northward and less human activities in the northward sections of the transect. With this, I predicted that during warm (positive) ENSO months, the dolphin sightings in San Diego would be distributed more northerly, predominantly with prey aggregations historically shifting northward into cooler waters, due to (secondarily) increasing sea surface temperatures. I expect the spatial gradient to shift north and south, in relation to the ENSO gradient (warm, neutral, or cold).

False: the sightings are more clumped towards the lower latitudes overall (p < 2e-16), possibly due to habitat preference. The sightings are closer to beaches with higher human densities and human-related activities near Mission Bay, CA. It should be noted, that just north of the San Diego transect is the Camp Pendleton Marine Base which conducts frequent military exercises and could deter animals.

I used an ANOVA analysis and found there was a significant difference in sighting latitude distributions between monthly ENSO indices. A Tukey HSD was performed to determine where the differences between treatment(s) were significant. All differences (neutral and negative, positive and negative, and positive and neutral ENSO indices) were significant with p < 0.005.

H3: I predicted that along the San Diego coastline, bottlenose dolphin sightings at the pod-scale (usually, one to ten individuals) would be clustered around the six major lagoons within about two kilometers, with no specific preference for any lagoon, because the murky, nutrient-rich waters in the estuarine environments are ideal for prey protection and known for their higher densities of schooling fishes. See my Exercise 3 results.

Using a histogram, I was able to visualize how distances to each lagoon differed by lagoon. That is dolphin sightings nearest to, Lagoon 6, the San Dieguito Lagoon, are always within 0.03 decimal degrees. In comparison, Lagoon 5, Los Penasquitos Lagoon, is distributed across distances, with the most sightings at a great distance.

Bar plot displaying the different distances from dolphin sighting location to the nearest lagoon in San Diego in decimal degrees. Note: Lagoon 4 is south of the study site and therefore was never the nearest lagoon.

After running an ANOVA in R Studio, I found that there was a significant difference between distance to nearest lagoon in different ENSO index categories (p < 2.55e-9) with a Tukey HSD confirming that the significant difference in distance to nearest lagoon being significant between neutral and negative values and positive and neutral years. Therefore, I gather there must be something happening in neutral months that changes the distance to the nearest lagoon, potentially prey are more static or more dynamic in those years compared to the positive and negative months. Using a violin plot, it appears that Lagoon 5, Los Penasquitos Lagoon, has the widest span of sighting distances when it is the nearest lagoon in all ENSO index month values. In neutral years, Lagoon 0, the Buena Vista Lagoon has more than a single sighting (there were none in negative months and only one in positive months). The Buena Vista Lagoon is the most northerly lagoon, which may indicate that in neutral ENSO months, dolphin pods are more northerly in their distribution.

Takeaways to science and management: 

Bottlenose dolphins have a clustered distribution which seems to be related to ENSO monthly indices, with certain years having more of a difference in distribution, and likely, their sociality on a larger scale. Neutral ENSO months seem to have a different characteristic that impact sighting distribution locations along the San Diego coastline. More research needs to be done in this to determine what is different about neutral months and how this may impact this dolphin population. On a finer scale, the six lagoons in San Diego appear to have a spatial relationship with dolphin pod sighting distributions. These lagoons may provide critical habitat for bottlenose dolphin preferred prey species or preferred habitat for the dolphins themselves either for cover or for hunting, and different lagoons may have different spans of impact at different distances, either by creating larger nutrient plumes, or because of static, geographic and geologic features. This could mean that specific areas should be protected more or maintain protection. For example, the Batiquitos and San Dieguito Lagoons have some Marine Conservation Areas with No-Take Zones. It is interesting to see the relationship to different lagoons, which may provide nutrient outflows and protection for key bottlenose dolphin prey species. The city of San Diego and the state of California are need ways to assess the coastlines and how protecting the marine, estuarine, and terrestrial environments near and encompassing the coastlines impact the greater ecosystem. Other than the Marine Mammal Protection Act and small protected zones, there are no safeguards for these dolphins.

My Learning: about software (a) Arc-Info and b) R

  1. a) Arc-Info: buffer creation, creating graphs, nearest neighbor analyses. How to deal with transects, certain data with mismatching information, conflicting shapefiles
  2. b) R: I didn’t know much, except the basics in R. I learned about how to conduct ANOVAs and then how to interpret results. Mainly I learned about how to visualize my results and use new packages.

My Learning: about statistics

Throughout this project I learned that spatial statistics requires clear hypothesis testing in order to clearly step through a spatial process. Most specifically, I learned about spatial analyses in ArcMap, and how I could utilize nearest neighbor calculations to assess distribution patters. Furthermore, I now have a better understanding of spatial distribution patterns and how they are assessed, such as clustering versus random versus equally dispersed distributions. For more data analysis and cleaning, I also learned how to apply my novice understanding of ANOVAs and then display results relating to spatial relationships (distances) using histograms and other graphical displays in R Studio.

________________________________________________________________________

Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Exercise 1: Preparing for Point Pattern Analysis

Exercise 1

The Question in Context

In order to answer my question: are the dolphin sighting data points clustered along the transect surveys or do they have an equal distribution pattern? I need to use point pattern analysis. I am trying visualize where in space dolphins were sighted along the coast of California, specifically from my San Diego sighting area. In this exercise, the variable of interest is dolphin sightings. These are x,y coordinates (point data) indicating the presence of common bottlenose dolphins along a transect. However, these transect data were not recorded and I needed to recreate these lines to my best abilities. This process is more challenging than anticipated, but will prove useful in the short-term view of this class and project and long-term in management ramifications.

The Tools

As part of this exercise, I used ArcMap 10.6, GoogleEarth, qGIS, and Excel. Although I was only intending on importing my Excel data, saved as a .csv file into ArcMap, that was not working, so other tools were necessary. The final goal of this exercise was to complete point-pattern analyses comparing distance along recreated transects to sightings. From there, the sightings would be broken down by year, season, or environmental factor (El Niño versus La Niña years) to look for distributing patterns, specifically if the points were ever clustered or equally distributed at different points in time.

Steps/Outputs/Review of Methods and Analysis

My first step was to clean up my sightings data enough that it could be exported as a .csv and imported as x-y data into ArcMap. However, ArcMap, no matter the transformation equation, seemed to understand the projected or geographic coordinate systems. After many attempts, where my data ended up along the east coast of Africa or in the Gulf of Mexico, I tried a work around; I imported the .csv file into qGIS with the help of a classmate, and then exported that file as a shape file. Then, I was able to import that shape file into ArcMap and select the correct geographic and projected coordinate systems. The points finally appeared off the coast of California.

I then found a shape file of North America with a more accurate coastline, to add to the base map. This step will be important later when I add in track lines, and how the distributions of points along these track lines are related to bathymetry. The bathymetric lines will need to be rasterized and later interpolated.

The next step was the track line recreation. I chose to focus on the San Diego study site. This site has the most data and the most consistently and standardly collected data. The surveys always left the same port of Mission Bay, San Diego, CA traveled north at 5-10km/hr to a specific beach (landmark), then turned around. It is noted on sighting data whether the track line was surveyed on both directions (South to North and North to South), or unidirectional (South to North). Because some data were collected prior to the invention of a GPS and the commercial availability, I have to recreate these track lines. I started trying to use ArcMap to draw the lines but had difficulty. Luckily, after many attempts, it was suggested that I use Google Earth. Here I found a tool to create a survey line where I can mark the edges along the coastline at an approximate distance from shore, and then export that file. It took a while to realize that the file needed to be exported as a .kml and not a .kmz.

Once exported as a .kml, I was able to convert the .kml file to a layer file and then to a shape file in ArcMap. The next step in this is somehow getting all points within one kilometer of the track line (my spatial scale for this part of the project) to associate with that track line. One idea was snapping the points to the line. However, this did not work. I am still stuck here: the major step before I can have my point data with an association to the line and then begin a point pattern analysis in ArcMap and/or R Studio.

Results

Although I do not currently have results of this exercise, fully. I can say for certain, that it has not been without trying, nor am I stopping. I have been brainstorming and milking resources from classmates and teaching assistants about how to associate the sighting data points with the track line to then do this cluster analysis. Hopefully, based on this can be exported to R studio where I can see distributions along the transect. I may be able to do a density-based analysis which would show if different sections along the transect, which I would need to designate and potentially rasterize first, have different densities of points. I would expect the sections to differ seasonally.

Critiques

Although I add in my opinions on usefulness and ease above, I do believe this will be very helpful in analyzing distribution patterns. Right now, it is largely unknown if there are differences in distribution patterns for this population because they move rapidly and at great distances. But, by investigating data from only the San Diego site, I can determine if there are differences in distributions along the transects temporally and spatially. In addition, the total counts of sightings in each location per unit effort will be useful to see the influx to that entire survey area over time.


Contact information: this post was written by Alexa Kownacki, Wildlife Science Ph.D. Student at Oregon State University. Twitter: @lexaKownacki

Ex 1: Mapping the stain: Using spatial autocorrelation to look at clustering of infection probabilities for black stain root disease

My questions:

I am using a simulation model to analyze spatial patterns of black stain root disease of Douglas-fir at the individual tree, stand, and landscape scales. For exercise 1, I focused on the spatial pattern of probability of infection, asking:

  • What is the spatial pattern of probability of infection for black stain root disease in the forest landscape?
  • How does this spatial pattern differ between landscapes where stands are clustered by management class and landscapes where management classes are randomly distributed?

    Fig 1. Left: Raster of the clustered landscape, where stands are spatially grouped by each of the three forest management classes. Each management class has a different tree density, making the different classes clearly visible as three wedges in the landscape. Right: Raster of the landscape where management classes are randomly assigned to stands with no predetermined spatial clustering. The color of each cell represents the value for infection probability of that cell. White cells in both landscapes are non-tree areas with NA values.

Tool or approach that you used: Spatial autocorrelation analysis, Moran’s I, correlogram (R)

My model calculates probability of infection for each tree based on a variety of tree characteristics, including proximity to infected trees, so I expected to see spatial autocorrelation (when a variable is related to itself in space) with the clustering of high and low values of probability of infection. Because some management practices (i.e., high planting density, clear-cut harvest, thinning, shorter rotation length) have been shown to promote the spread of infection, there is reason to hypothesize that more intensive management strategies – and their spatial patterns in the landscape – may affect the spread of black stain at multiple scales.

I am interested in hotspot analysis to later analyze how the spatial pattern of infection hotspots map against different forest management approaches and forest ownerships. However, as a first step, I needed to show that there is some clustering in infection probabilities (spatial autocorrelation) in my data. I used the “Moran” function in the “raster” package (Hijmans 2019) in R to calculate the global Moran’s I statistic. The Moran’s I statistic ranges from -1 (perfect dispersion, e.g., a checkerboard) to +1 (perfect clustering), with a value of 0 indicating perfect randomness.

Moran’s I = -1

Moran’s I = 0

Moran’s I = 1

 

 

 

 

 

 

 

 

I calculated this statistic at multiple lag distances, h, to generate a graph of the values of the Moran’s I statistic across various values of h. You can think of the lag distance of the size of the window of neighbors being considered for each cell in a raster grid. The graph produced by plotting the calculated value of Moran’s I across various lag values is called a “correlogram.”

What did I actually do? A brief description of steps I followed to complete the analysis

1. Imported my raster files, corrected the spatial scale, and re-projected the rasters to fall somewhere over western Oregon.

I am playing with hypothetical landscapes (with the characteristics of real-world landscapes), so the spatial scale (extent, resolution) is relevant but the geographic placement is somewhat arbitrary. I looked at two landscapes: one where management classes are clustered (“clustered” landscape), and one where management classes are randomly distributed (“random”). For each landscape, I used two rasters: probability of infection (continuous values from 0 to 1) and non-tree/tree (binary, 0s and 1s).

2. Masked non-tree cells

Since not all cells in my raster grid contain trees, I set all non-tree cells to NA for my analysis in order to avoid comparing the probability of infection between trees and non-trees. I used the tree rasters to create a mask.
c.tree[ c.tree < 1 ] <- NA # Set all non-tree cells in the tree raster to NA
c.pi.tree <- mask(c.pi, c.tree) # Combine the prob inf with tree, leaving all others NA
# Repeat with randomly distributed management landscape
r.tree[ r.tree < 1 ] <- NA # Set all non-tree cells in the tree raster to NA
r.pi.tree <- mask(r.pi, r.tree) # Combine the prob inf with tree, leaving all others NA

Fig 2. Filled and hollow weights matrices.

3. Calculated Global Moran’s I for multiple values of lag distance.

For each lag distance, I created a weights matrix so the Moran function in the raster package would know how to weight each neighbor pixel at a given distance. Then, I let it run, calculating Moran’s I for each lag to create the data points for a correlogram.

I produced two correlograms: one where all cells within a given distance (lag) were given a weight of 1 and another “hollow” weights matrix when only cells at a given distance were given a weight of 1 (see example below).

4. Plotted the global Moran’s I for each landscape and compare.

 

 

 

 

 

 

What did I find? Brief description of results I obtained.

The correlograms show that similar values become less clustered at greater distances, approaching a random distribution by about 50 cell distances. In other words, cells are more similar to the cells around them than they are to more-distant cells. The many peaks and troughs in the correlogram are present because there are gaps between trees because of their regular spacing in plantation management.

In general, the highest values of Moran’s I were similar between the landscape with clustered management landscape and the landscape with randomly distributed management classes. However, the rate of decrease in the value of Moran’s I with increasing lag distance was higher for the random landscape than the clustered landscape. In other words, similar infection probabilities had larger clusters when forest management classes were clustered. For the clustered landscape, there was actually spatial autocorrelation at lag distances of 100 to 150, likely because of the clusters of higher infection probability in the “old growth” management cluster.

Correlogram for the clustered and random landscape showing Moran’s I as a function of lag distance. “Filled” weights matrix.

Correlogram for the clustered and random landscape showing Moran’s I as a function of lag distance. “Hollow” weights matrix.

 

 

 

 

 

 

 

 

 

 

 

 

 

Critique of the method – what was useful, what was not?

My biggest issue initially was finding a package to perform a hotspot analysis on raster data in R. I found some packages with detailed tutorials (e.g., hotspotr), but some had not been updated recently enough to work in the latest version of R. I could have done this analysis in ArcMap, but I am trying to use open-source software and free applications and improve my programming abilities in R.

The Moran function I eventually used in the raster package worked quickly and effectively, but it does not provide statistics (e.g., p-values) to interpret the significance of the Moran’s I values produced. I also had to make the correlogram by hand with the raster package. Other packages do include additional statistics but are either more complex to use or designed for point data. There are also built-in correlogram functions in packages like spdep or ncf, but they were very slow, potentially taking hours on a 300 x 300 cell raster. That said, it may just be my inexperience that made a clear path difficult to find.

References

Glen, S. 2016. Moran’s I: Definition, Examples. https://www.statisticshowto.datasciencecentral.com/morans-i/.

Robert J. Hijmans (2019). raster: Geographic Data Analysis and Modeling. R package version 2.8-19. https://CRAN.R-project.org/package=raster

 

The Geography of Exclusion

Description of the research question:

My research focuses on vulnerable populations, specifically refugees and internally displaced peoples. This is a small part of a larger project funded by NASA, “Mapping the Missing Millions” and is largely defined as the “geography of exclusion.” I am hoping to understand why settlements have been excluded from global population datasets; we know that this happens often, but not specifically the mechanisms of why these settlements are missing from these datasets. Hence, my question recognizes that the classification methods used for population datasets are imperfect and I’m seeking to understand why they are imperfect. This means I will need to understand the spatial distribution of the settlements identified in both sets and analyze the intersections and exclusions between them and understand why these exist. This might also mean figuring out how close an OpenStreetMap settlement is to an urban center or a road and figuring out if these metrics affect the classification.

My research question is as follows:

How do the settlements identified by OpenStreetMap (OSM) compare to settlements identified in global population datasets via classification and what about these classification metrics fails to detect settlements known to OSM?

Description of the dataset:

The crux of my data is a comparison of UNHCR and OpenStreetMap (OSM) to a global population dataset, Global Human Settlement Layer (GHSL). OpenStreetMap is a global open source dataset and contains both point and polygon information. Through the UNHCR point data that identifies settlement locations, I have identified boundaries that are attributed as delineating refugee settlements. A potential disclaimer with OSM data is that it’s an open source dataset contributed to by volunteers. This means that attribution can be unclear or inconsistent, despite validation. I can also use other OSM data like roads and urban areas to expand my spatial analyses for a proximity assessment.

I will also make use of the rich Landsat and Sentinel data available for my spectral analysis. This will either be at 30 meter resolution (Landsat) or 10 meter resolution (Sentinel). The temporal extent depends on the satellite: Landsat 7 is from 2000 and forward; Landsat 8 is 2014 to present, and Sentinel-2 was launched in 2015.

For this class, I will focus my analysis on Uganda, given its high prevalence of refugee settlements and extensive OSM dataset with a strong Humanitarian OpenStreetMap Team presence.

Figure 1. Layoun Refugee Camp boundary (blue) in an urban false color composite of Landsat 8 imagery.

Figure 2. Global Human Settlement Layer overlay with Layoun Refugee Camp boundary. White indicates measured human settlement.

The images above are an example of a refugee settlement in Algeria. The area in blue in the NW corner is the settlement; the area in the SW is a nearby town. However, this settlement is not identified in the Global Human Settlement Footprint, although this specific settlement has existed since at least 2001.

Hypotheses:

I expect that settlements not detected by GHSL will have a different and less distinct spectral signature than settlements detected by GHSL. By “distinct,” I am referring to how different the spectral signature in the settlement is to the spectral signature immediately around the settlement. By “different” spectral signature, I am referring to the concept that the classification in GHSL is looking for a specific type of spectral signature, and that this does not match the spectral signature found in the settlements indicated by OSM. I also expect that settlements not detected by GHSL will be further from known roads and high density urban areas than settlements detected by GHSL.

Approaches & Analyses:

With my OSM data, I can use these vector boundaries to analyze the spatial and spectral patterns of these settlements. I will analyze the size of these settlements, the spectral signature in these settlements, proximity to resources (roads, water, cities).

With the global population dataset, I can identify pixel clusters that indicate settlements, and perform similar analysis to identify size, spectral signature, and proximity to resources.

While these analysis can help me identify the differences between these settlements, I also still need to analyze the classification methods of GHSL to understand why these differences might be significant and have resulted in different settlement detections.

Expected Outcome:

I will need to present the statistical relationships between the refugee settlements that are and are not detected in my target population dataset. Because I’m also seeking to understand why these settlements are excluded in the classification, I will need to connect the spatial relationships that I find with the classification methods that GHSL uses. This will be a more verbal description, but I plan to make maps to illustrate these spatial relationships and characteristics. These relationships and characteristics include settlement size, border complexity, proximity to roads, and spectral signature.

Significance:

This project addresses the exclusion of settlements and populations within various global datasets. This has a greater relevance given that so much derived data relies on this, whether for distributing aid and resources, analyzing displacement, or understanding human migration. By understanding what factors contribute to the inclusion or exclusion includes settlements in these datasets, more users can understand the limitations of what is possible to detect and where the gaps in population detection is more likely to occur.

Level of preparation:

I have substantial experience with ArcInfo products. I’ve been using ArcGIS Pro for over a year now, and prior to that I spent 2 years working with ArcMap daily in a professional capacity, took three classes that exclusively taught in the ArcDesktop interface, and employed ArcInfo for projects in multiple other classes. My image processing skills are also extensive, ranging from two classes using ENVI Classic, a GIS internship that included georeferencing satellite imagery, and most recently a class and outside research using Google Earth Engine. My experience with R is limited to a summer research project in 2016. I have some basic programming in GIS skills (very limited ArcPy use but recent and frequent ModelBuilder use) and will be learning more throughout this term as a participant in Robert Kennedy’s GIS Programming class.