K-means Clustering in R

Introduction

I am working with a dataset from a dynamic global vegetation model (DGVM) run across the Pacific Northwest (PNW) over the time period 1895-2100. This is a process-based model that includes a dynamic fire model. I am particularly interested in drivers of the fire regimes (the frequency and intensity of wildfire) produced by the model. The general question I am studying is “what, if any, is the relationship between the climatic and elevation inputs of the model and the fire regimes produced by the model?” A more specific pair of questions for this method is “Do fire regimes cluster locally?” and “How do fire regimes change over time?”

Method description

Overview

I divided the data into two time periods: 1901-2000 and 2001-2100 to capture the historical and projected climates and fire regimes over those periods. To define a fire regime, I chose the two fire-related output variables – carbon consumed by fire and percent of model pixel burned – along with fire return interval (FRI) over the century time period. I used k-means cluster analysis in R to define four fire regimes.

K-means clustering divides a dataset into a specified number of data point clusters and calculates centroids and cluster membership such that the Euclidean distance between each cluster’s centroids its members is minimized. The relationships between scales of the input variables should be taken into consideration in this type of analysis as it affects the values of Euclidean distances calculated. The steps in the analysis are outlined below. An appendix of the actual commands used for the analysis are in Appendix A.

Outline of steps

For each of the two input files:

Open the NetCDF file
Reverse the order of the latitude values (change from highest to lowest to lowest to highest)
Filter out the na (no data) values
Copy data (carbon consumed by fire, fraction pixel burned, FRI) into a combined data set

For the combined dataset:

Normalize values for each variable to 0.0 to 1.0 using the minimum and maximum of each variable in the combined dataset.
Generate the four clusters
Output value distributions of the four clusters
Distribute resulting data into two output datasets, one for each of the two time periods
1. Divide into two separate datasets
2. Distribute results into non-na data locations

For each of the output datasets:

Create output NetCDF file
Write data to NetCDF file
Close NetCDF file

Results

Clusters

With my input data, the FRI values range from a minimum of 1 to a maximum of 100 years. The mean annual fraction area burned has a theoretical range from 0 to 1, and the mean annual carbon consumed ranges from 0 to 266.5 gm^-2. Performing the cluster analysis using the original input values resulted in 3 of the 4 clusters driven primarily by the values of carbon consumed (Fig. 1a-c). Normalizing the values of each input variable using the variable’s minimum and maximum, resulted in clusters driven by different variables (Fig. 2). For the remainder of the project I am using normalized variables.

Fig. 1: Z-score distributions of input variables within clusters for clusters computed without normalizing input data. Clusters 1, 2, and 3 (A-C) exhibit strong overlaps in both fraction area burned, and time between fires, and strong shifts in carbon consumed between clusters. Cluster 4 (D) is the only cluster in which all three factors differ substantially with other clusters.

Fig. 2: Distributions of normalized input variables within clusters. Each of the clusters was created by performing cluster analysis using normalized input data. All clusters have substantial shifts in the distributions of at least two factors compared to the other clusters.

Result maps

The clustering of fire regimes can easily be seen in maps of the fire regimes (Fig. 3). Areas without fire tend to be at higher elevations: in the Coast Range and Cascades of Oregon and Washington and the Rockies of western Montana and northwestern Idaho. High frequency fires (Low FRI) are common in the plains and plateaus of southern Idaho, south and central Oregon, and the Columbia Plateau in central Washington. The other two fire regimes, both of which have a Medium FRI are somewhat intermingled, but are present in mid elevations in the Rockies, in southwest Oregon, and in the Willamette Valley and Puget Trough regions of Oregon and Washington, respectively.

Fig. 3: Fire regime maps for (A) the historical period (1901-2000) and (B) the future period (2001-2100). (C: carbon consumed; Med: medium; Frac: fraction of grid cell; No Fire: no fire occurred in the cell over the time period)

Fire regime change from the historical (Fig. 3A) to future (Fig. 3B) period include the appearance of the High Carbon Consumed, Medium Fraction Burned, Medium FRI fire regime around the edges of the plains and plateaus as well as into the Blue Mountains of northeastern Oregon as well as the spread of both Medium FRI fire regimes into the Coast, Cascade, and Rocky Mountain ranges.

Method critique

I found this method to be very useful for my study. Fire and vegetation modeling results are commonly expressed using single variables over time or area and fail to take into consideration the relationships between multiple variables. This work indicates that breaking sets of multiple fire-related variables into clusters provides a way to better characterize regional characteristics as well as spatial changes through time.

I did find the data manipulation required for the method a bit tricky. Having to work around the na values within the data required a bit of labor. Converting matrix data to vector data for the methods was another minor inconvenience. All that said, however, I believe such manipulations will become second nature as my experience with R grows.

Appendix A: Annotated R commands used for the analysis of normalized input data.
## Cluster analysis on normalized data.
## The dataset is drawn from two files, one for the 20th
## century and one for the 21st
## Install the necessary libraries

library(“DescTools”) # for displaying data
library(“ncdf4″) # reading and writing netCDF files

## Set the current working directory for data and results

setwd(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis’)

## Open the NetCDF files for reading

histnc = nc_open(‘SummaryOver_1901-2000.nc’)
futnc = nc_open(‘SummaryOver_2001-2100.nc’)

## Get the lat and lon variables

lon = ncvar_get(histnc,’lon’)
lat = ncvar_get(histnc,’lat’)

## The lat dimension is highest to lowest in the input
## file, so its vector needs to be reversed.
## Reverse lat into ascending order (lower case r rev)

lat = rev(lat)
## Create the data frames for historical, future, and difference data

normHist = expand.grid(lon=lon, lat=lat)
normFut = expand.grid(lon=lon, lat=lat)
normDiff = expand.grid(lon=lon, lat=lat)

## Get the fields from the input files
## Note the lat dimension comes in from high to low, so
## it needs to be reversed so that R can display it (upper case R Rev)

## Consumed is the annual mean for carbon consumed
## PBurn is the annual mean of fraction of area burned
## FRI is the mean number of years between fires

normHist$Consumed = c(Rev(ncvar_get(histnc,’CONSUMED_mean’),margin = 2))
normHist$PBurn = c(Rev(ncvar_get(histnc,’PART_BURN_mean’),margin = 2))
normHist$FRI = c(Rev(ncvar_get(histnc,’PART_BURN_intvl’),margin = 2))

normFut$Consumed = c(Rev(ncvar_get(futnc,’CONSUMED_mean’),margin = 2))
normFut$PBurn = c(Rev(ncvar_get(futnc,’PART_BURN_mean’),margin = 2))
normFut$FRI = c(Rev(ncvar_get(futnc,’PART_BURN_intvl’),margin = 2))

## Normalize the values prior to doing the analysis
## Also get the z scores for distribution plotting later
## Note that NA values from the data must be omitted before
## taking the statistics

## Loop over the three variables used in the clusters
## foreach one filter out cells with no data using na.omit
## Normalize the data based on the min and max value of each
## variable.

for(nm in c(‘Consumed’,’PBurn’,’FRI’)) {
## Temporary data for efficient processing
tmpVect = c(na.omit(normHist[[nm]]),na.omit(normFut[[nm]]))
tmpMin = min(tmpVect)
tmpMax = max(tmpVect)
tmpDiff = tmpMax – tmpMin
tmpMean = mean(tmpVect)
tmpSD = sd(tmpVect)

## Normalize the data
normHist[[nm]] = (normHist[[nm]] – tmpMin) / tmpDiff
normFut[[nm]] = (normFut[[nm]] – tmpMin) / tmpDiff

## Z scores
normHist[[paste(nm,’_Z’,sep=”)]] = (normHist[[nm]] – tmpMean) / tmpSD
normFut[[paste(nm,’_Z’,sep=”)]] = (normFut[[nm]] – tmpMean) / tmpSD
}

## Create the data frame for clustering
## Again, we omit data with no value using na.omit

normWorkDF = data.frame(‘Consumed’ = c(na.omit(normHist$Consumed),na.omit(normFut$Consumed)))
normWorkDF$PBurn = c(na.omit(normHist$PBurn),na.omit(normFut$PBurn))
normWorkDF$FRI = c(na.omit(normHist$FRI),na.omit(normFut$FRI))

## Perform the clustering analysis
## This is the point of all the work being done here
## The Lloyd algorithm runs efficiently on the large dataset
## (~100k points)

normCluster = kmeans(
data.frame(normWorkDF$Consumed,normWorkDF$PBurn,normWorkDF$FRI),
4,
algorithm=”Lloyd”,
iter.max=500
)

## Copy the clusters back to the work data frame

normWorkDF$Clust = normCluster$cluster

# Plot the cluster distributions

for(ndx in 1:4) {
fNm = paste(
‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormClust_’,
ndx,
‘.png’,
sep=”
)
png(fNm)
tmpDF = subset(normWorkDF,normWorkDF$Clust == ndx)
plot(
density(tmpDF$Consumed),
xlim=c(0,1),
ylim=c(0.0,6.0),
col=’darkblue’,
xlab=’Normalized Value’,
main=paste(‘Cluster’,ndx,’Distribution’)
)
lines(density(tmpDF$PBurn),xlim=c(0,1),col=’green’)
lines(density(tmpDF$FRI),xlim=c(0,1),col=’red’)
legend(
‘topright’,
legend=c(‘C Consumed’,’Fract. Area Burned’,’Time Betw. Fires’),
lwd=1,
col=c(‘darkblue’,’green’,’red’)
)
dev.off()
}

## Add the cluster numbers into the original data frames
## Note that NA needs to be taken into account using
## [!is.na(hist$Consumed)] to get the indexes of those data
## items that are not NA

## Historical values are in first half of data frame, future in second
## Calculate indexes for clarity

histClustStartNdx = 1
histClustEndNdx = length(normCluster$cluster)/2
futClustStartNdx = length(normCluster$cluster)/2 + 1
futClustEndNdx = length(normCluster$cluster)

normHist$Clust = NA
normHist$Clust[!is.na(normHist$Consumed)] = normWorkDF$Clust[histClustStartNdx:histClustEndNdx]

normFut$Clust = NA
normFut$Clust[!is.na(normFut$Consumed)] = normWorkDF$Clust[futClustStartNdx:futClustEndNdx]

## Create matrices for results and display
## png() tells R what file to store the results to

png(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormHistClustMap.png’)
normHistClustArr = matrix(normHist$Clust,nrow=331,ncol=169)
image(lon,lat,normHistClustArr)
dev.off()

png(‘/Users/timsheehan/Projects/MACA/FireClustering/Analysis/Graphs/NormFutClustMap.png’)
normFutClustArr = matrix(normFut$Clust,nrow=331,ncol=169)
image(lon,lat,normFutClustArr)
dev.off()

## Calculate Euclidean distance between cluster centers

normClustCtrEucDist = data.frame(cbind(rep(0,4),rep(0,4),rep(0,4),rep(0,4)))

for(ndx_1 in 1:4) {
for(ndx_2 in 1:4) {
normClustCtrEucDist[ndx_1,ndx_2] =
sqrt(
(normCluster$centers[ndx_1,1] – normCluster$centers[ndx_2,1]) ^ 2 +
(normCluster$centers[ndx_1,2] – normCluster$centers[ndx_2,2]) ^ 2 +
(normCluster$centers[ndx_1,3] – normCluster$centers[ndx_2,3]) ^ 2
)
}
}

## Create Fire Regime Euclidean distance map between historical and future

normDiff$HistClust = normHist$Clust
normDiff$FutClust = normFut$Clust
normDiff$EucDist = NA

normDiff$EucDist = mapply(
function(x,y) ifelse(is.na(y), NA, normClustCtrEucDist[x,y]),
normDiff$HistClust,
normDiff$FutClust
)

normDiffEucDistArr = matrix(normDiff$EucDist,nrow=331,ncol=169)
image(lon,lat,normDiffEucDistArr)

## Create NetCDF file and store results there. This yields a
## NetCDF file that can be used independently of R

x = ncdim_def(‘lon’,’degrees’,lon)
## Reverse the order of latitude values
y = ncdim_def(‘lat’,’degrees’,rev(lat))

histFCNCVar = ncvar_def(‘HistFireCluster’,’classification’,list(x,y),-9999,prec=’short’)
futFCNCVar = ncvar_def(‘FutFireCluster’,’classification’,list(x,y),-9999,prec=’short’)
eucDistNCVar = ncvar_def(‘EucDistance’,’NormalizedDistance’,list(x,y),9.9692099683868690e+36,prec=’float’)

# Reverse the lat order of the values being output
OutHistClust = c(Rev(matrix(normDiff$HistClust,ncol=length(lat),nrow=length(lon)),margin=2))
OutFutClust = c(Rev(matrix(normDiff$FutClust,ncol=length(lat),nrow=length(lon)),margin=2))
OutEucDist = c(Rev(matrix(normDiff$EucDist,ncol=length(lat),nrow=length(lon)),margin=2))

## Output the variables and close the file

ncOut = nc_create(‘NormClusters.nc’,list(histFCNCVar,futFCNCVar,eucDistNCVar))

ncvar_put(ncOut,histFCNCVar,OutHistClust)
ncvar_put(ncOut,futFCNCVar,OutFutClust)
ncvar_put(ncOut,eucDistNCVar,OutEucDist)

nc_close(ncOut)

GEO599/GEO584-Advanced Spatial Statistics and GIS, 2013-2016

Just another blogs.oregonstate.edu site

Leave a reply Cancel reply

Contact Info