Final Project: Characterizing the exclusion of refugee settlements from global settlement datasets

Final Project Blog: Characterizing the exclusion of refugee settlements from global settlement datasets

Anna Ballasiotes

RESEARCH QUESTION

This research is a small part of a larger question: where are refugees? Which in itself is the basic question to a larger problem: how will the global community provide resources to those who need help — and in order to do so, understanding where these vulnerable populations exist is particularly important. In considering the hierarchy of understanding where IDPs and refugees exist, there are certain drivers of the creation of a ‘refugee population,’ notably conflict (most often varying levels of violent conflict) and more recently, climate events (including natural disaster events). Beyond the drivers of refugee populations, there are different typologies of refugee populations: refugees within cities and blending into existing infrastructure, refugees in settlements and camps that are close to borders, refugees in settlements close to urban centers, and refugees in settlements and camps in areas where populations have not been established and as such there is available land. My focus lies largely in the last item of that list: where populations have not been formally established but now camps and settlements are expanding. It is a known problem that global settlement datasets are not doing an adequate job of recognizing these camps and settlements as “population.” As such, my goal was to characterize how these global settlement datasets were not recognizing these camps and settlements — and thus excluding these populations. This asks the question: to what extent do global settlement datasets fail to detect known refugee camps and settlements that form in areas of low density population?

I originally sought to answer the question, “what about the classification metrics of global settlement dataset fails to detect settlements known to OSM.” I wasn’t able to truly answer this question because I needed to spend more time  on actually characterizing this exclusion before understanding why these global population datasets were unsuccessful. Characterizing where this exclusion occurred was difficult enough and took most of the class, especially as I explored what these refugee settlements looked like and considered how they may have formed.

DATA

I used three main data sources: Refugee settlement boundaries from OpenStreetMap, Facebook High Resolution Settlement Dataset, and the World Settlement Footprint. Only data within Uganda was used for these data, given the overlap of data availability and project partners with reliable data. The Facebook High Resolution Settlement Dataset is trained on 2018 imagery and is presented at a 30 meter resolution. World Settlement Footprint is trained on 2015 imagery and is also presented at 30 meter resolution, but with a minimum mapping unit mask that creates clusters of pixels that make the data appear to have a coarser resolution. The refugee settlement boundaries from OSM range in temporality from 2016 to 2019; however, this is only when the boundaries were added to the OpenStreetMap database and is not necessarily when camps or settlements were established. I had initially included the Global Human Settlement Layer, which is also trained on 2015 imagery, but GHSL did not appear to detect any settlement in the refugee camp boundaries within OSM, so I was not able to perform any meaningful analysis with it.

HYPOTHESES: predictions of patterns and processes.

I wanted to understand in what ways the global settlement datasets excluded the known refugee settlements — thus, I wanted to understand what percentage of refugee settlements were captured and whether larger settlements or smaller settlements showed a higher percentage of captured area, or if settlements that had a higher percentage of captured area showed a faster green-up around the settlement as one moved away from the boundary. I suspected that smaller settlements would have a higher percentage of area captured, mostly because I predicted that smaller settlements conformed more to the actual built-up of settlement area, whereas the larger settlements included more land cover types that would be less likely detected as “built-up.” I also suspected that settlements that showed a higher detection percentage would have a more distinct natural boundary — that the green-up around the settlements would occur faster if a higher percentage was detected.  

APPROACHES: analysis approaches used.

I had initially measured the spatial autocorrelation of the refugee settlements using ArcGIS Pro, but did not find this analysis to provide too much significant information for me. I already knew that my settlements were quite different from each other in size; any clustering would have really come from other factors that are not obviously visible, like the way land was distributed or delegated for refugee settlements.

In order to characterize the percentage of area of refugee settlements captured by each settlement dataset, I brought everything into Earth Engine and calculated pixel area of every pixel both within the refugee boundaries and across WSF and Facebook datasets — counting only the pixels that have area detected as settlement.

In order to refine the comparison of area detected within refugee settlements, I also needed to really only look at pixels that look like “built-up” within the refugee settlements, since the boundary themselves have quite a lot of green space and other land cover types. also performed a binary unsupervised classification based on the known locations of refugee settlements. Ultimately, I used a confusion matrix to compare this binary unsupervised classification to WSF and Facebook to refine the “percent area detected.”

I also looked for the relationship of detection and greenness moving away from the boundary. To do this, I created a multi-ring buffer out to 1 kilometer at 100 meter increments, calculated the average maximum NDVI in each “donut buffer” for 2018 using Landsat 8 imagery. I then plotted these NDVI values over increasing distance from the refugee boundary to see the green-up.

RESULTS

I produced a map of the spatial confusion of overlap between WSF, Facebook, and a binary unsupervised classifier within my settlement boundaries. I have refined this map a bit, but there are still further statistical comparisons that could be done to better characterize the overlap of WSF, Facebook, and the K-Means classifier.

I also have all of the settlement boundaries with the percent of area captured by each global dataset within each boundary: this could be a sorting mechanism in which I could perform furt

Ayilo 1 & 2 Refugee Settlements; Classifications shown

her analysis to better answer my hypothesis, which I didn’t necessarily answer well, specifically with regards to the NDVI prediction. Within the exercises, I answered this question based on geography rather than percent detection — did settlements in the north vs. south show different pattern of vegetation greenup around the settlement?

SIGNIFICANCE What did you learn from your results? How are these results important to science? to resource managers?

I learned that across the board, refugee settlements ARE systematically excluded from global datasets but can easily be detected at the local scale, and thus this systematic bias of these datasets must be addressed and solved. Those that are modeling systems of migration or providing access to humanitarian resources need to have more accurate representations of settlements. Beyond this, I was really hoping to be able to characterize the spatial patterns of the exclusion better than I was able to, and I think that’s where my analysis fell short and where my next hypothesis would manifest.

LEARNING: what did you learn about software (a) Arc-Info, (b) Modelbuilder and/or GIS programming in Python, (c) R, (d) other? What about statistics? including (a) hotspot, (b) spatial autocorrelation (including correlogram, wavelet, Fourier transform/spectral analysis), (c) regression (OLS, GWR, regression trees, boosted regression trees), (d) multivariate methods (e.g., PCA),  and (e) or other techniques?

I didn’t truly dive more deeply into these softwares than I have in the past; I think I just became more acutely aware of the statistical methods that exist in Arc toolboxes and became more comfortable manipulating data in R.  

I think that the statistical learning was less specific or intense than I was hoping, but also in part because I think I was really challenged to think statistically and spatially-statistically about my research question. I wish I had spent more time understanding statistical methods rather than trying to mold my question into something that seemed to fit my elementary understanding of spatial statistics.

Print Friendly, PDF & Email