Author Archives: ballasia

Final Project: Characterizing the exclusion of refugee settlements from global settlement datasets

Final Project Blog: Characterizing the exclusion of refugee settlements from global settlement datasets

Anna Ballasiotes

RESEARCH QUESTION

This research is a small part of a larger question: where are refugees? Which in itself is the basic question to a larger problem: how will the global community provide resources to those who need help — and in order to do so, understanding where these vulnerable populations exist is particularly important. In considering the hierarchy of understanding where IDPs and refugees exist, there are certain drivers of the creation of a ‘refugee population,’ notably conflict (most often varying levels of violent conflict) and more recently, climate events (including natural disaster events). Beyond the drivers of refugee populations, there are different typologies of refugee populations: refugees within cities and blending into existing infrastructure, refugees in settlements and camps that are close to borders, refugees in settlements close to urban centers, and refugees in settlements and camps in areas where populations have not been established and as such there is available land. My focus lies largely in the last item of that list: where populations have not been formally established but now camps and settlements are expanding. It is a known problem that global settlement datasets are not doing an adequate job of recognizing these camps and settlements as “population.” As such, my goal was to characterize how these global settlement datasets were not recognizing these camps and settlements — and thus excluding these populations. This asks the question: to what extent do global settlement datasets fail to detect known refugee camps and settlements that form in areas of low density population?

I originally sought to answer the question, “what about the classification metrics of global settlement dataset fails to detect settlements known to OSM.” I wasn’t able to truly answer this question because I needed to spend more time on actually characterizing this exclusion before understanding why these global population datasets were unsuccessful. Characterizing where this exclusion occurred was difficult enough and took most of the class, especially as I explored what these refugee settlements looked like and considered how they may have formed.

DATA

I used three main data sources: Refugee settlement boundaries from OpenStreetMap, Facebook High Resolution Settlement Dataset, and the World Settlement Footprint. Only data within Uganda was used for these data, given the overlap of data availability and project partners with reliable data. The Facebook High Resolution Settlement Dataset is trained on 2018 imagery and is presented at a 30 meter resolution. World Settlement Footprint is trained on 2015 imagery and is also presented at 30 meter resolution, but with a minimum mapping unit mask that creates clusters of pixels that make the data appear to have a coarser resolution. The refugee settlement boundaries from OSM range in temporality from 2016 to 2019; however, this is only when the boundaries were added to the OpenStreetMap database and is not necessarily when camps or settlements were established. I had initially included the Global Human Settlement Layer, which is also trained on 2015 imagery, but GHSL did not appear to detect any settlement in the refugee camp boundaries within OSM, so I was not able to perform any meaningful analysis with it.

HYPOTHESES: predictions of patterns and processes.

I wanted to understand in what ways the global settlement datasets excluded the known refugee settlements — thus, I wanted to understand what percentage of refugee settlements were captured and whether larger settlements or smaller settlements showed a higher percentage of captured area, or if settlements that had a higher percentage of captured area showed a faster green-up around the settlement as one moved away from the boundary. I suspected that smaller settlements would have a higher percentage of area captured, mostly because I predicted that smaller settlements conformed more to the actual built-up of settlement area, whereas the larger settlements included more land cover types that would be less likely detected as “built-up.” I also suspected that settlements that showed a higher detection percentage would have a more distinct natural boundary — that the green-up around the settlements would occur faster if a higher percentage was detected.

APPROACHES: analysis approaches used.

I had initially measured the spatial autocorrelation of the refugee settlements using ArcGIS Pro, but did not find this analysis to provide too much significant information for me. I already knew that my settlements were quite different from each other in size; any clustering would have really come from other factors that are not obviously visible, like the way land was distributed or delegated for refugee settlements.

In order to characterize the percentage of area of refugee settlements captured by each settlement dataset, I brought everything into Earth Engine and calculated pixel area of every pixel both within the refugee boundaries and across WSF and Facebook datasets — counting only the pixels that have area detected as settlement.

In order to refine the comparison of area detected within refugee settlements, I also needed to really only look at pixels that look like “built-up” within the refugee settlements, since the boundary themselves have quite a lot of green space and other land cover types. also performed a binary unsupervised classification based on the known locations of refugee settlements. Ultimately, I used a confusion matrix to compare this binary unsupervised classification to WSF and Facebook to refine the “percent area detected.”

I also looked for the relationship of detection and greenness moving away from the boundary. To do this, I created a multi-ring buffer out to 1 kilometer at 100 meter increments, calculated the average maximum NDVI in each “donut buffer” for 2018 using Landsat 8 imagery. I then plotted these NDVI values over increasing distance from the refugee boundary to see the green-up.

RESULTS

I produced a map of the spatial confusion of overlap between WSF, Facebook, and a binary unsupervised classifier within my settlement boundaries. I have refined this map a bit, but there are still further statistical comparisons that could be done to better characterize the overlap of WSF, Facebook, and the K-Means classifier.

I also have all of the settlement boundaries with the percent of area captured by each global dataset within each boundary: this could be a sorting mechanism in which I could perform furt

Ayilo 1 & 2 Refugee Settlements; Classifications shown

her analysis to better answer my hypothesis, which I didn’t necessarily answer well, specifically with regards to the NDVI prediction. Within the exercises, I answered this question based on geography rather than percent detection — did settlements in the north vs. south show different pattern of vegetation greenup around the settlement?

SIGNIFICANCE What did you learn from your results? How are these results important to science? to resource managers?

I learned that across the board, refugee settlements ARE systematically excluded from global datasets but can easily be detected at the local scale, and thus this systematic bias of these datasets must be addressed and solved. Those that are modeling systems of migration or providing access to humanitarian resources need to have more accurate representations of settlements. Beyond this, I was really hoping to be able to characterize the spatial patterns of the exclusion better than I was able to, and I think that’s where my analysis fell short and where my next hypothesis would manifest.

LEARNING: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other? What about statistics? including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA), and (e) or other techniques?

I didn’t truly dive more deeply into these softwares than I have in the past; I think I just became more acutely aware of the statistical methods that exist in Arc toolboxes and became more comfortable manipulating data in R.

I think that the statistical learning was less specific or intense than I was hoping, but also in part because I think I was really challenged to think statistically and spatially-statistically about my research question. I wish I had spent more time understanding statistical methods rather than trying to mold my question into something that seemed to fit my elementary understanding of spatial statistics.

Exercise 3: Characterizing the spatial pattern of confusion between pixels classified by global settlement datasets and OSM refugee settlements.

Question that I asked:

In Exercise 3, I was asking how the different classification methods that I’m using align or do not align with each other — essentially, creating a confusion matrix to identify the pixels that are True Positives, False Positives, and False Negatives when comparing two of the classifications. That is — what is the spatial pattern of confusion between pixels classified by global settlement datasets and an unsupervised binary classification of pixels within an OSM boundary?

Approach that I used:

I used EarthEngine to simulate confusion among pixels — I did not necessarily create a matrix in order to do so, but considered the concept of a confusion matrix in order to calculate a number for each pixel.

Steps I followed to complete the analysis:

In order to do this, I needed to compare “Facebook” and “K-means” as well as “WSF” and “K-means.” In this case, the “K-means” classification was my “truth” classification, because I needed something to measure the other classifications against that was not just a summary of pixels inside of a vector but a more traditional binary unsupervised classification. Thus, I used the settlements to guide a clustering algorithm of K-Means to cluster similar pixels, and created clusters of “settlement-like” pixels. I needed to perform raster math in order to combine these datasets and represent False Positives, False Negatives, and True Positives. I added WSF to K-means, with each pixel labeled as “1” in each representing a positive record. I then added Facebook to K-means, again with each pixel labeled as 1, but then multiplied this by 4 for further additive properties to identify the overlaps in False Positives, False Negatives, and True Positives. Ultimately, this meant that the pixels that were False Negative in either or both comparisons had pixel values of 10, 11, and 14. The data that were False Positive in either or both comparisons had pixel values of 1, 4, and 5. A value of 15 represented where all three datasets registered as True Positive.

Brief description of results I obtained:

I was interested in both False Positives and False Negatives, but for different reasons. In False Positives, I wanted to visualize where the different global settlement datasets DID detect settlements while the K-Means did not. False Negatives would show me where K-Means picked up settlement data but neither of the global datasets did. Ultimately, False Negatives were much more prevalent throughout the dataset, further illustrating the exclusion of refugee settlements from global datasets. I chose to display a map of a more interesting pattern that appeared when looking over a larger settlement that was near a large body of water: because the binary clustering grouped water with “settlement” type landscapes, this area is a significant False Negative in the data. If I were to calculate statistics, it would make sense to exclude the area of water in order to not promote bias in the data.

False Positives and False Negatives between Refugee Settlement Binary Clusters and Facebook Classification and World Settlement Footprint Classification

Critique of the method – what was useful, what was not?

This exercise probably caused the most trouble with regards to methods that I thought made sense and would work but both presented more challenges and fewer patterns than I was hoping. There wasn’t really the spatial pattern that I was expecting, perhaps because this data is so noisy and over discontinuous areas in the landscape. It was useful to dig into the overlapping data, but ultimately the actual spatial patterns were not very remarkable. Perhaps the numbers presented in a matrix would be a more useful representation of the data or patterns for this specific method. Aside from the spatial statistics, I actually ran into quite a few function issues in EarthEngine, Pro, and ArcGIS Desktop — all three of which I used to try to manipulate this data into the ways I was imagining it would. For my final project, I may need to revisit these methods or seek advice from others, because my techniques were not as effective as I’d hoped.

Ex2: Incremental pixel greenness while moving away from refugee settlement boundaries

Exercise 2: Incremental pixel greenness while moving away from refugee settlement boundaries

The question I asked centered on how the pixel greenness / NDVI varied in buffered increments around settlements within BidiBidi, Imvepi, and Rhino Refugee camps. I wanted to compare the settlements in Bidi Bidi and Imvepi, which have a larger settlements, to Rhino, which tended to have smaller settlements more uniformly spaced and spread out. Given the more uniformed and wider spacing of Rhino, I expect that green-up will happen more quickly in comparison to the settlements in Bidi Bidi and Imvepi, which are closer together and varying in size and development pattern. This leads me to believe that there’s more cleared or available land and that
Map of Regions of Interest & Buffers

settlement size is based on political organization of camp blocks rather than natural boundaries that might exist already. One of the questions that I’m asking with these settlements and the geography of exclusion is essentially why different settlement areas and parts of these areas are included or excluded from global settlement datasets. One of the factors that contributes to this is a spectral and spatial distinction – that is, how might the green space in and around a settlement change as you move away from said settlement? With this exercise, I wanted to compare the settlements in Rhino to the settlements in Bidi Bidi and Imvepi to see if the pixel greenness changed at a different rate or in a different pattern as one moved away from the settlement center.

I used multiple different tools, including the Multi-Ring Buffer tool in QGIS (since I was working with export JSONs), EarthEngine to extract NDVI mean values from Landsat 8 satellite images at these settlement locations, and the smoothing factor in ggplot in R to plot and statistically examine the way NDVI changed.
I first needed to buffer the regions of interest that I wanted to study, of which there were 44 in Bidibidi and Impvipi and 41 in Rhino. I performed this buffering in QGIS using the Multi-Ring Buffer plugin and made buffers at 100-meter increments from 0 to 1000 meters. Some of the buffers overlapped, but for the sake of simplicity of this assessment, I ignored this. Within EarthEngine, I pulled Landsat8 images from 2018 that covered these settlements. After adding NDVI and NDBI calculated bands to the image collection of 2018 images, I performed a quality mosaic to compress the image collection into one image. I based this quality mosaic on NDVI, meaning that the pixel chosen from the image collection would be the pixel with the highest NDVI. While this can sometimes pull pixels from different dates, it does exclude the possibility of clouds and seasonality affecting the dataset by comparing just the most vegetated pixel that occurred in that area. If I were to re-do this, I might choose a single date image to capture phenological nuance. After reducing the image across all of the buffers (that is, calculating the mean NDVI within each buffer), I exported the geoJSONs, brought them back into QGIS, ensured that there was a spatial selection component linking all of the buffers and regions of interest together, and brought this data into R to plot the NDVI change over distance for all regions of interest.

Bidi Bidi Settlement; Imveppi Settlement Buffers

Rhino Settlement Buffers

The pattern of greening in the buffers around settlements in Rhino versus Imvipi and BidiBidi did present different patterns, but not particularly significant different patterns. It appears that the Rhino settlements had a faster rate of increase in greenness while moving away from the settlements especially in the first 500 meters, whereas Bidi Bidi and Imveppi showed a more gradual green-up, although there also seems to be a small shift at 500 meters. These results are somewhat expected, but also not very drastic. It would be interesting to see how the green-up changes if I increased my buffer extent or decreased my buffer increments.

I think that looking at NDVI in buffers was an interesting approach, but as I said above, my choice of pixel quality selection (highest NDVI) could alter a neutral selection of data. Also, what buffers I chose were relatively arbitrary – I chose equal intervals, but this does mean that when the mean NDVI is calculated, the mean is reduced across a larger area as each buffer gets further from the center. I could also try testing with larger buffers (200 or 500 meter buffers) that extend beyond 1000 meters from the settlement edge. Further, some of my buffers overlapped and encroached on other actual boundaries: this means that the buffers sometimes contained pixels from other identified settlements. For this reason, I chose to present the data in a smoothing trend. I will likely need to fix some of these errors for the final project, because I do think that this is a typical and useful spatial analysis to perform on this type of data and some of the errors are relatively easy to fix and would show stronger data integrity.

Determining spatial autocorrelation of refugee settlements in Uganda

Question that you asked

In analyzing the distribution of settlement boundaries that I obtained from OpenStreetMap, I wanted to know the general clustering and regionality of settlements in order to understand how other spatial statistics that I perform in my next step will behave based on the results from this explanatory step. The question I’m introducing for Exercise A centers around how similar or different nearby settlements are to each other: is there a regionality in settlement sizes? Some future questions that I’m considering are whether or not clustered settlements have higher detection in the World Settlement Footprint classification (which I’m using instead of GHSL because it’s a more localized classification). This would be determined using the percent of OSM settlement area that was detected by WSF.

Name of the tool or approach that you used.

I decided to use spatial autocorrelation to answer this question, since the essence of my question centers around what the pattern of my data looks like, how clustered or not clustered are these settlements, and how that might affect my future analysis and considerations.

For this, I employed the Spatial Autocorrelation tool in ArcGIS Pro, which uses Global Moran I’s algorithm.

Brief description of steps you followed to complete the analysis.

In order to identify the refugee settlements, I went through multiple rounds of querying with OpenStreetMap to extract the boundaries that are directly related to refugee camp boundaries. Hannah Friedrich and I have worked together on defining some of these boundaries, and she took some time to delineate boundaries for a separate study of hers. While I will incorporate these delineated ‘regions of interest,’ I will most likely not move forward with them in further analyses because it would present an issue with scaling up and manual interpretation. Below I list the three polygon data I’m analyzing here.

OSM Boundaries: This is a result of multiple query efforts within Overpass Turbo, a online server for downloading OpenStreetMap data. Various queries on attribute tags were performed in order to select relevant refugee boundary data.

Regions of Interest: This is a result from Hannah’s efforts of limiting areas within the OSM Boundaries that appear to be built up using high resolution imagery available on Google Satellite.

Selected OSM Boundaries: This is a selection from OSM Boundaries that merges polygons of the same settlement into multipolygons; this also a selection of boundary polygons that are less than 2000 hectares to account for some boundaries that are designated versus actually settled.

I then ingested these boundaries into Google Earth Engine, extracted the pixel areas for the WSF pixels within the three different boundary layers, and re-exported these.

I then imported these layers into ArcGIS Pro and performed the Spatial Autocorrelation tool and experimented with multiple parameters.

I ultimately chose the “row standardization” option given the possible bias in ‘sampling’ design, since this creation of this data is most often from the HOT Uganda team and resources might limit where they can travel and collect data.

Brief description of results you obtained.

Ultimately, my results showed clustering within the OSM-identified settlements, which is to be expected. There are a variety of questions that arise from this that might confound my analysis that are still moderately troubling: the method of data recording, the spotlight effect, the presence of organizations able to record data, and the inherent clustering based on human behavior, or drivers such as conflict or environmental conditions. This initial analysis is really to understand the degree of clustering that I might expect in my future analyses – if there is high clustering, then I need to understand that clustering when I’m testing additional questions with my next exercise, like nearest road or nearest urban area. The images below demonstrates the result of an analysis settlement area in the “Selected OSM Boundaries,” “OSM Boundaries,” and “Regions of Interest.”

Moran I, Selection of OSM Boundaries

Moran I, Delineated Regions of Interest

Moran I, All OSM Boundaries

While OSM Boundaries and Selection from OSM Boundaries show high confidence of clustering, the “Regions of Interest” pattern shows a less confident result of clustering. This makes sense, given that both the OSM Selection and OSM Boundaries both have overlapping polygons and varying sizes.

Below is a map illustrating the distribution of refugee areas in northwest Uganda, where most of the settlements are concentrated.

Critique of the method – what was useful, what was not?

Given my shortcomings with understanding statistical analyses, the interpretation of the results was most difficult for me. While my patterns appeared mostly clustered, the z-score and Moran’s Index showed changes when different parameters of comparison were chosen. Ultimately, I’m not sure if there would be a more effective spatial analysis to analyze aspects of these settlements that would prove more helpful in figuring out relevant information for moving forward with Exercise 2 and beyond.

The Geography of Exclusion

Description of the research question:

My research focuses on vulnerable populations, specifically refugees and internally displaced peoples. This is a small part of a larger project funded by NASA, “Mapping the Missing Millions” and is largely defined as the “geography of exclusion.” I am hoping to understand why settlements have been excluded from global population datasets; we know that this happens often, but not specifically the mechanisms of why these settlements are missing from these datasets. Hence, my question recognizes that the classification methods used for population datasets are imperfect and I’m seeking to understand why they are imperfect. This means I will need to understand the spatial distribution of the settlements identified in both sets and analyze the intersections and exclusions between them and understand why these exist. This might also mean figuring out how close an OpenStreetMap settlement is to an urban center or a road and figuring out if these metrics affect the classification.

My research question is as follows:

How do the settlements identified by OpenStreetMap (OSM) compare to settlements identified in global population datasets via classification and what about these classification metrics fails to detect settlements known to OSM?

Description of the dataset:

The crux of my data is a comparison of UNHCR and OpenStreetMap (OSM) to a global population dataset, Global Human Settlement Layer (GHSL). OpenStreetMap is a global open source dataset and contains both point and polygon information. Through the UNHCR point data that identifies settlement locations, I have identified boundaries that are attributed as delineating refugee settlements. A potential disclaimer with OSM data is that it’s an open source dataset contributed to by volunteers. This means that attribution can be unclear or inconsistent, despite validation. I can also use other OSM data like roads and urban areas to expand my spatial analyses for a proximity assessment.

I will also make use of the rich Landsat and Sentinel data available for my spectral analysis. This will either be at 30 meter resolution (Landsat) or 10 meter resolution (Sentinel). The temporal extent depends on the satellite: Landsat 7 is from 2000 and forward; Landsat 8 is 2014 to present, and Sentinel-2 was launched in 2015.

For this class, I will focus my analysis on Uganda, given its high prevalence of refugee settlements and extensive OSM dataset with a strong Humanitarian OpenStreetMap Team presence.

Figure 1. Layoun Refugee Camp boundary (blue) in an urban false color composite of Landsat 8 imagery.

Figure 2. Global Human Settlement Layer overlay with Layoun Refugee Camp boundary. White indicates measured human settlement.

The images above are an example of a refugee settlement in Algeria. The area in blue in the NW corner is the settlement; the area in the SW is a nearby town. However, this settlement is not identified in the Global Human Settlement Footprint, although this specific settlement has existed since at least 2001.

Hypotheses:

I expect that settlements not detected by GHSL will have a different and less distinct spectral signature than settlements detected by GHSL. By “distinct,” I am referring to how different the spectral signature in the settlement is to the spectral signature immediately around the settlement. By “different” spectral signature, I am referring to the concept that the classification in GHSL is looking for a specific type of spectral signature, and that this does not match the spectral signature found in the settlements indicated by OSM. I also expect that settlements not detected by GHSL will be further from known roads and high density urban areas than settlements detected by GHSL.

Approaches & Analyses:

With my OSM data, I can use these vector boundaries to analyze the spatial and spectral patterns of these settlements. I will analyze the size of these settlements, the spectral signature in these settlements, proximity to resources (roads, water, cities).

With the global population dataset, I can identify pixel clusters that indicate settlements, and perform similar analysis to identify size, spectral signature, and proximity to resources.

While these analysis can help me identify the differences between these settlements, I also still need to analyze the classification methods of GHSL to understand why these differences might be significant and have resulted in different settlement detections.

Expected Outcome:

I will need to present the statistical relationships between the refugee settlements that are and are not detected in my target population dataset. Because I’m also seeking to understand why these settlements are excluded in the classification, I will need to connect the spatial relationships that I find with the classification methods that GHSL uses. This will be a more verbal description, but I plan to make maps to illustrate these spatial relationships and characteristics. These relationships and characteristics include settlement size, border complexity, proximity to roads, and spectral signature.

Significance:

This project addresses the exclusion of settlements and populations within various global datasets. This has a greater relevance given that so much derived data relies on this, whether for distributing aid and resources, analyzing displacement, or understanding human migration. By understanding what factors contribute to the inclusion or exclusion includes settlements in these datasets, more users can understand the limitations of what is possible to detect and where the gaps in population detection is more likely to occur.

Level of preparation:

I have substantial experience with ArcInfo products. I’ve been using ArcGIS Pro for over a year now, and prior to that I spent 2 years working with ArcMap daily in a professional capacity, took three classes that exclusively taught in the ArcDesktop interface, and employed ArcInfo for projects in multiple other classes. My image processing skills are also extensive, ranging from two classes using ENVI Classic, a GIS internship that included georeferencing satellite imagery, and most recently a class and outside research using Google Earth Engine. My experience with R is limited to a summer research project in 2016. I have some basic programming in GIS skills (very limited ArcPy use but recent and frequent ModelBuilder use) and will be learning more throughout this term as a participant in Robert Kennedy’s GIS Programming class.

GEOG 566

Advanced spatial statistics and GIScience