Tag Archives: PCA

Courtney’s Final Project Post

Research Question

  • “How is the spatial pattern of ion and isotope concentrations in wells tapping the basalt aquifer related to the spatial pattern of mapped faults via the mechanism of groundwater flow as determined by hydraulic transmissivity of the geologic setting?”

Description of dataset that I examined

  • A: In my research I have analytical data for 31 wells, whose XY locations were determined by field confirmation of the Oregon Water Resource Department (OWRD) well log database. As groundwater is a 3D system, I have to consider Z values as well. The well depths and lithology information are also from the OWRD database. My analytical data provides a snapshot of water chemistry during the summer of 2018. I have only one temporal data point per well. At all 31 wells, I collected samples to be analyzed for pH, temperature, specific conductivity, oxygen isotopes 16 and 18, and hydrogen isotopes 1 and 2. At a subset of 18 of those wells I collected additional samples for tritium, carbon 14, and major ion analysis.
  • B: The shapefile of faults mapped at the surface was created by Madin and Geitgey of the USGS in their 2007 publication on the geology of the Umatilla Basin. There is some uncertainty in my analysis as far as extending this surface information into the subsurface. USGS studies have constrained proposed ranges of dip angles for the families of faults that I am studying, but not exact angles for any single mapped fault.
  • C: results of pumping interference tests involving 29 wells, 12 of which I had chemical data for. The data was collected by the OWRD in 2018 and 2019.

Hypotheses

  • Faults can act as conduits or barriers to groundwater flow, depending on how their transmissivity compares to the transmissivity of the host rock.
  • I hypothesize that clusters of similar chemical and isotopic properties of groundwater can indicate a shared aquifer unit/compartment, and that if faults separate clusters then the fault is influencing that difference in chemical/isotopic signatures. If the fault is between two clusters, I hypothesize that it is acting as a barrier. If it crosses through a cluster, I hypothesize that it acts as a conduit.
  • Where faults act as barriers, I hypothesize that parameter values will differ in groups on either side of a fault. Specifically, a barrier fault might cause older, warmer water to rise into upper aquifer layers, and the downstream well might show a signature of more local recharge.
  • Where faults act as conduits, I hypothesize that water chemistry and isotopes of samples from wells on either side of the fault would indicate a relatively direct flowpath from the upstream well to a downstream well. Over a short distance, this means that ion and isotope concentrations would not differ significantly in wells across the fault.
  • My hypotheses depend on a “barrier” fault damming groundwater flow up-gradient of the fault, and compartmentalizing local recharge on the down-gradient side. This hypothesis is only relevant if the fault is roughly perpendicular to the flow direction, and so disrupting transmissivity between a recharge zone and the wells. If a fault that separates two wells is parallel to the flow direction and there is no obstacle between the wells and upstream recharge areas, then the fault might indeed limit communication between the wells but they will have similar chemical signatures. Wells separate by this second kind of fault barrier would be better evaluated by a physical test of communication, such as a pumping interference test.

Analysis Approaches

  • Principal component analysis: used to simplify the multivariate data set (19 variables!) into variable relationships that could represent aquifer processes
  • Analysis of PCA results compared to distance from a flow path
    • Interpolation of well water levels classified by well stratigraphy to estimate a potentiometric surface and evaluate groundwater flow directions.
    • Raster calculations to compare flow direction to fault incidence angle
    • Measuring distance from each well to the nearest fault along the flow path
    • simple linear regression, comparing Non-ion PC1 score of a well with its distance from a fault.
  • Two-sided T-tests comparing distance between wells, presence of a fault, and pumping interference test communication between wells
  • ANOVA F-tests comparing chemical and isotopic variance within groups of wells that communicate with each other and between those groups.

Results

  • Principal component analysis – Patterns of variation between wells are statistically explained primarily by total ion concentration, a relationship between chemical evolution from ion exchange and decreasing stable isotope ratios, and the combination of well depth and water temperature. Moran’s I indicates that only Non-ion PC2 is spatially clustered, while the other PC models have a distribution not significantly different than random. The other PC models are useful to understand the groundwater system, but not specifically to analyze clustering correlated to faults.
  • Interpolation of water level, and comparison of fault incidence angle with flow direction, indicates faults that are and are not able to be tested by my hypotheses.
  • Analysis of PCA results compared to distance from a fault along flow path – some wells that are “within” a fault zone have very old signatures and others have very young signatures. This could be related to the angle of the dip of the fault and the accuracy of mapping compared to the depth of the well. I hypothesize that the wells that are in the fault zone but have high PC1 scores are on the up-gradient side of the fault where older water is upwelling along a barrier. Wells in fault zones with low PC1 scores could indicate wells open to downgradient areas of the fault, where vertical recharge through the fault damage zone is able to reach the well.
  • Returning to the conclusions I wrote in that blog post after I found improved stratigraphic data, I’m not sure if I can make conclusions other than those about the wells are that mapped as “inside” a fault. Several wells that are closer to faults are also open to shallower aquifer units, and so the effect of lower PC1 scores closer downgradient to faults might be confounded by lower PC1 scores caused by vertical inflow from the sedimentary aquifer and upper Saddle Mountain aquifer.
  • Two-sided T-tests comparing distance between wells, presence of a fault, and communication between wells show that the presence of a fault has a greater effect on communication than the distance between the wells.
  • ANOVA F-tests comparing chemical and isotopic variance within groups of wells that communicate with each other and between those groups – stable isotopes and specific conductivity both show more variation between well groups than within well groups.
  • Not covering in these blog posts, I also ran Moran’s I on my inputs to see which ones are clustered and so might be more related to horizontal grouping factors (such as faults) than vertical grouping parameters (such as stratigraphic unit). Of the PCA and individual variables, only d18O, d2H, and Non-ion PC2(combination of well depth and water temperature) were clustered. The other PCA models, temperature, pH, and specific conductivity were not significantly spatially clustered.

Significance –  Groundwater signatures are related to faults agree/disagree with past understandings of differences between wells in the region, and can inform well management. If a senior permit holder puts a call on the water right and asks for junior users to be regulated off, it would not help that senior user if on of those junior permit holders’ wells is not hydraulically connected to the senior users.

  • More wells would need to be sampled to be better able to disentangle the effects of faults from the effects of well stratigraphy.

My learning – I learned significantly more about how to code and troubleshoot in R. Additionally, I learned about the process of performing spatial clustering analysis in ArcGIS.

What did I learn about statistics?

  • PCA was completely new to me, and it’s a cool method for dealing with multivariate data once I dealt with the steep learning curve involved in setting it up and interpreting the results. It was useful getting more practice performing and interpreting t-tests and Anova F-tests. I had not used spatial clustering before, and learning how to apply it was interesting. It gave me a much more concrete tool to try to disentangle the patterns in my effectively 3D data on X,Y plane, as opposed to the Z direction.

Spatial and Temporal Patterns of Reported Salmonella Rates in Oregon

Question: How are values of reported yearly Salmonella rates related to predictors found in previous OLS and GWR analysis at different temporal and spatial lags? Also, to what extent are there regional groupings in Oregon found through Principal Components Analysis (PCA)?

The data used here is the same used in all prior blog posts.

Names of Analytical Tools/Approaches Used

Both temporal and spatial cross-correlation tools were used in my analysis to visualize how values of the predictors I identified in previous analyses varied at higher/lower rates of reported Salmonella. The time series analysis was limited to 2008-2017 which was the extent of my data. Temporal cross-correlation allowed me to visualize how the values of predictors varied at different time lags of Salmonella rates and the spatial cross-correlation allowed me to visualize the variation in the predictors at different spatial lags of Salmonella rates. Results for the temporal and spatial cross-correlation were visualized with ACF plots and cluster plots respectively. Finally, PCA was used to identify noticeable regional groupings as a function of the different variables in my dataset. These results were visualized on a biplot.

Description of the Analytical Process

All cross-correlation analysis was limited to the significant predictors identified in previous regression analysis. Specifically, county Salmonella rate as a function of: county % female, county % child poverty, and county median age.

  1. Temporal Autocorrelation: Data were summarized by year, creating a data frame of median values for Salmonella rates and the predictors for each of the 10 years in my dataset. ACF plots were created for each of the predictors as they varied over different time lags of Salmonella
  2. Spatial Autocorrelation: Data were summarized by county, creating a data frame of median values for Salmonella rates and the predictors for all 36 counties in Oregon. Spatial cross-correlation was carried out using a local spatial weights matrix to create clusters of local indicators of spatial association. The clusters for each of the predictors were visualized on maps of Oregon.
  3. PCA: Oregon was divided into three broad regions based on population distribution and geographical barriers: 1) the Portland metro area consisting of Multnomah, Clackamas, Washington, and Columbia counties, 2) West Oregon consisting of all other counties west of the Cascade Mountain range, and 3) East Oregon consisting of all counties east of the mountain range. Data was summarized by county and every county was associated with a specific region within Oregon. PCA was carried out on the first two principle components because approximately 70% of the variation in the data was explained by these two components. The results of the PCA were visualized in a biplot.

Brief Description of Results

Temporal Cross Correlation:  results show that county percent female was positively associated with Salmonella rates at time lags 1-4 and slightly negatively associated with disease rates at all other lags. Child poverty was also positively associated with Salmonella rates at lags 1-4 and were otherwise negatively associated with disease rates except at distantly negative time lags. Temporal cross correlation analysis of median age and Salmonella incidence rates yielded a similar pattern. It appears the temporal cross-correlation patterns across all three of these variables follow a somewhat sinusoidal curve.

Female:

Child Poverty:

Median Age:

Spatial Cross Correlation: results from a cross correlation analysis of county percent female and Salmonella rates show a large cluster of high county percent female and high rates of Salmonella clustered in the Northwest region of the state. Other clusters are fairly well dispersed around the state. From an analysis of child poverty and Salmonella rates there is a large cluster of high Salmonella rates and high child poverty in the Northeastern area of the state and a large cluster of high Salmonella rates but low child poverty in the Northwest. A similar pattern can be seen in the median age cross-correlation plot.

Female:

Child Poverty:

Median Age:

PCA: Principal Component Analysis showed that the counties comprising the Portland metro area were all characterized by relatively higher income compared to the rest of the state. The Eastern portion of the state can somewhat be characterized by higher median ages and higher proportions of elderly residents. The rest of the western portion of Oregon is not characterized by particularly high values of any of the other variables of interest.

Critique of Methods Used

These analyses support the findings of Exercise 2 where there was evidence to support the existence of a time trend in the rates of reported Salmonella in Oregon counties. Also, the results of the cross-correlation and principal component analysis support the findings in the GWR analysis where different predictors were positively/negatively associated with Salmonella rates depending on the county in which the data was measured. One main critique of the spatial cross-correlation analysis was that through the use of a local spatial weight matrix, only local indicators of spatial indicators of association were determined. This analysis did not include a global spatial weight matrix which could change the spatial associations seen in my results. Also, while PCA was useful in showing that different regions in Oregon were more strongly associated with certain predictors than others, there is considerable overlap between the regions. Thus, it is unknown if these results are significant.

Courtney’s Ex 1: Spatial patterns of ion concentrations in groundwater

1: Question being asked

How do ion and isotope concentrations vary spatially? And what trends in concentrations account for variability between my samples?

For detailed background information, I point you towards my problem statement for this class. In short, I’m hoping to use geochemical and isotope values for wells 32 well I sampled last summer in order to better understand how groundwater flows across faults in the Walla Walla Subbasin in northeastern Oregon. In the analysis for this post, I only used a subset of 18 wells for which I had the largest suite of measured parameters.

2: Approach used

For this exercise, I’m investigating spatial patterns of my ion and isotope concentrations. I could do this by individually examining distributions of all 19 of the parameters that I sampled for… but that would be graphically overwhelming and not terribly useful. It’s too difficult to draw conclusions by comparing 19 different maps of distributions.  Instead, I’m using a method called principal component analysis (PCA for short) to statistically evaluate my samples by the groups of parameters that account for large amounts of the variation that differentiates one sample from another.

The nuts and bolts of PCA involve sophisticated analysis that calculates eigenvectors, linear algebra, and n-dimensional properties of a data set, and I will let people more knowledgeable than me explain this if you’re curious about how exactly PCA works.

https://blog.bioturing.com/2018/06/14/principal-component-analysis-explained-simply/

R spits out two kinds of useful graphic output for PCA based on the concept of a “biplot”. The first is called a loading plot, where two principal components are graphed against each other, for example PC1 and PC2. The value for each parameter in PC1 and PC2 is used to give the point a cartesian coordinate which defines a vector in comparison to the origin. The second kind of output is called a scree plot, which applies the same concept to the sample points (called “individuals” in the code) instead of to the  parameters (called “variables” in the code). I used an R package called ggbiplot to combined the scree and loading plots in one figure.

https://blog.bioturing.com/2018/06/18/how-to-read-pca-biplots-and-scree-plots/

https://learnche.org/pid/latent-variable-modelling/principal-component-analysis/interpreting-score-plots-and-loading-plots

3: Methods

I used R to calculate the principal component vectors for my data and to create charts for them, and then symbolized my data in ArcGIS Pro to see if there was a spatial pattern to the information in the PCA output. The code in R is fairly manageable. I’ve included it below with my comments.

# Source: https://www.datacamp.com/community/tutorials/pca-analysis-r

# Preliminary set up stuff – add libraries, set working directory

library(devtools)

library(ggbiplot)

setwd(“~/_Oregon State University/Thesis/Summer Fieldwork/PCA”)

# Before importing your data, make sure that every entry in its table has a numeric entry in it.

# This process will spit out garbage if you have any blank cells or text.

# The number of rows in the data set needs to exceed the number of columns. I used an argument like [,c(1:11)] to enforce this in my data

#Process 1 of 2: read data set 1 into R, view it, compute PCA, display summaries and biplots

# it’s important to set the row names so we can use them as labels in ggbiplot

PCAdata.ions2 <- read.csv(“ions_allchem20190401_usgsdwnld.csv”, row.names = 1)

View(PCAdata.ions2)

Ions2.pca <- prcomp(PCAdata.ions2[,c(1:11)], center = TRUE, scale. = TRUE)

summary(Ions2.pca)

ggbiplot(Ions2.pca, choices=c(1,2), labels=rownames(PCAdata.ions2))

ggbiplot(Ions2.pca, choices=c(1,3), labels=rownames(PCAdata.ions2))

#read data set 2 into R, view it, compute PCA, display summaries and biplots

PCAdata.nonions <-read.csv(“NonIons_allchem20190401_usgsdwnld.csv”, row.names = 1)

View(PCAdata.nonions)

NonIons.pca <- prcomp(PCAdata.nonions[,c(1:8)],center = TRUE, scale. = TRUE)

summary(NonIons.pca)

ggbiplot(NonIons.pca, labels=rownames(PCAdata.nonions))

ggbiplot(NonIons.pca, choices=c(1,3), labels=rownames(PCAdata.nonions))

# if you want plots other than PC1 vs PC2 or PC1 vs PC3 just change the argument “choices = (1,2)”

# Process 2 of 2 add tools to get individual coordinates

# Source http://www.sthda.com/english/wiki/get-pca-extract-the-results-for-individuals-variables-in-principal-component-analysis-r-software-and-data-mining

# only install the package if you haven’t before. You will need to have RTools 3.5 already installed

# install.packages(“devtools”)

# library(“devtools”)

# install_github(“kassambara/factoextra”)

# now run things, it will spit out tab-delimited data. The “ind” command outputs PC information for the individuals, and the “var” command outputs the PC information for the variables.

library(factoextra)

ind <- get_pca_ind(Ions2.pca)

ind$coord

ind2 <- get_pca_ind(NonIons.pca)

ind2$coord

# copy that info into text files, import them into csv format, join in Arc!

var1 <- get_pca_var(Ions2.pca)

var1$coord

var2< get_pca_var(NonIons.pca)

var2$coord

Next, I symbolized my data in ArcGIS Pro. Because I only had 18 samples, it wasn’t too onerous to use selection in the attribute table to export data to another layer to symbolize the groups I saw in the principle component analysis biplot.

To examine how the principle components affected individual points, I used the second process in the above code to generate a tab-delimited chart of the principal component loadings for each sample. I copied this into a .txt file, imported it into Excel to create a CSV, and then joined that CSV to my shapefile of well locations. For the variables’ coordinates, I likewise created .txt files and imported the into Excel, but I symbolized them in Excel.

4: Results

In this section, I’ll show the graphs and maps that I created for the ion section of my data, and give an explanation of what I learned from this analysis.

For my ion data, the first two out of the 11 principle components explain about 80% of the variation in my data.

The first one, PCI, explains 55% of the variation in my data set and is shown on the X axis of the graph below. It is strongly influenced by concentrations of calcium, magnesium, chloride, alkalinity, sulfate, bromide, and sodium, which all have eigenvalues on the X axis greater than |1|. Comparing the scree plot to the loading plot, wells 56287, 3074, 4179, and 4167 are the primary individual drivers of this PC.

PC2 explains another 25% of the variance in my data, and is shown on the Y axis. Fluoride, silicon, and manganese are the primary variable drivers of variation between the individuals in PC2.

CvS PC1 PC2 ions biplot with groups

Based on the coordinate information for the variables, the variation explained by PC1 is due to increased levels of calcium, magnesium, potassium, sodium, alkalinity, bromide, chloride, and sulfate at four wells. This combination of ions all being similar could be explained by recent recharge influenced by agricultural chemicals, such as fertilizers. In an undisturbed basalt setting though time, calcium and magnesium tend to decrease as sodium and potassium increase. The fact the these four parameters vary together in PCI indicates that a process other than ion replacement is driving ion concentration. The variation in PC2 is predominantly explained by fluoride, dissolved silica, and manganese. Elevated concentrations of these three variables in comparison to the other variables indicates that the groundwater has spent more time in the basalt.In PC2, calcium and magnesium are inversely related to sodium, indicating that the cation concentrations are driven by the ion replacement process.

While the figure above locates wells based on their grouping on a biplot comparing two principle components, the following figures show the wells color-coded by their loadings in PC1 and in PC2. I’m hypothesizing that wells with smaller PC1 values are influenced by agricultural recharge, and wells with more positive values on PC2 are more influenced by the processes of ion dissolution as water travels through the basalt.

color-coded map of PC1 values for the 18 wells

PC1 is heavily driven by four particular wells, which show up here as being in colors of blue and blue-grey.

The distribution of PC2 values shows a pattern that I find really interesting. More negative – bluer – values indicate samples which were less influenced by the ion replacement and dissolution processes of groundwater traveling through the basalt. The values at high elevation are negative, which makes sense because they’re closer to the upland recharge zone, but so do some samples in the lowlands to the north which are downgradient of certain faults. This could indicate that those PC2 negative downgradient wells are cut off from the longer flowpaths of groundwater by faults that act as barriers to groundwater flow. The pinker – more positive – points indicate samples that were more influenced by the along-flowpath ion reactions.

Next, I present the PCA results for the non-ion parameters. Unlike the “ions” set whose title was somewhat self-explanatory, this PCA brought together different types of chemical and physical properties of water. These include pH, temperature, and specific conductivity which were measured in the field,  the analyzed values of hydrogen and oxygen isotopes as well as tritium, the elevation of the bottom of the well, and the cation ratio. The cation ratio is a standard calculated parameter for groundwater chemistry which is equal to ([Na] + [K]) / ([Ca] + [Mg]). As water spends more time in contact with the rock, its ion exchange with the rock leads to a smaller cation ratio. The first biplot graphs PC1 (~40% of variation) on the X  axis against PC2 (22% of variation) on the Y axis, and the second biplot graphs PC1 on the X axis versus PC3 (~18% of variation) on the Y axis.

The PC1 grouping explains variance among the samples by inversely correlating tritium, deuterium (d2H) and oxygen-18 (d18O) with pH and the cation ratio.  More positive PC1 values show that the individual has high d18O and d2H values, and low cation ratio and pH values. All d18O and d2H values are negative; values that are closer to 0 (“larger: for the purposes of this plot) are more enriched in the heavier isotope. PC2 is predominantly driven by an inverse correlation between well bottom elevation and temperature. PC3 inversely correlates tritium, specific conductance, and well bottom elevation with temperature, deuterium, and 18-oxygen.

Yellow/orange points on this map have more positive PC1 loadings. They have higher d18O and d2H values, and lower cation ratio and pH values. The lower cation ratio and pH values indicate that the sample is less chemically evolved based on the expected ion exchange reactions in the basalt. I found it interested that samples on that end of the spectrum were also more enriched in the heavier isotopes of hydrogen and oxygen.

Unfortunately I can’t seem to upload the maps for PC2 and PC3 for non-ions, I’m getting a message that all of my allowance for image uploads has been used up…

5: Critique

I’m pretty happy with what I put together using principal component analysis. In a perfect world I would have learned to do the spatial visualization elegantly in R instead of messing around with exporting data to Arc, and made my variable charts in R instead of exporting them to excel. However for the purposes of this project the less elegant method was quicker for me.