By Florence Sullivan, MSc student GEMM lab
Over the past few months I have been slowly (and I do mean SLOWLY – I don’t believe I’ve struggled this much with learning a new skill in a long, long time) learning how to work in “R”. For those unfamiliar with why a simple letter might cause me so much trouble, R is a programming language and free software environment suitable for statistical computing and graphing.
My goal lately has been to interpolate my whale tracklines (i.e. smooth out the gaps where we missed a whale’s surfacing by inserting artificial locations). In order to do this I needed to know (1) How long does a gap between fixes need to be to identify a missed surfacing? (2) How many artificial points should be used to fill a given gap?
The best way to answer these queries was to look at a distribution of all of the time steps between fixes. I started by importing my dataset – the latitude and longitude, date, time, and unique whale identifier for each point (over 5000 of them) we recorded last summer. I converted the locations into x & y coordinates, adjusted the date and time stamp into the proper format, and used the package adehabitatLT to calculate the difference in times between each fix. A package known as ggplot2 was useful for creating exploratory histograms – but my data was incredibly skewed (Fig 1)! It appeared that the majority of our fixes happened less than a minute apart from each other. When you recall that gray whales typically take 3-4 short breathes at the surface between dives, this starts to make a lot of sense, but we had anticipated a bimodal distribution with two peaks: one for the quick surfacings, and one for the surfacings between 4-5 minutes dives. Where was this second peak?
Sometimes, calculating the logarithm of one of your axes can help tease out more patterns in your data – particularly in a heavily skewed distribution like Fig. 1. When I logged the time interval data, our expected bimodal distribution pattern became evident (Fig. 2). And, when I back-calculate from the center of the two peaks we see that the first peak occurs at less than 20 seconds (e^2.5 = 18 secs) representing the short, shallow blow intervals, or interventilation dives, and that the second peak of dives spans ~2.5 minutes to ~5 minutes (e^4.9 = 134 secs, e^5.7 = 298 secs). Reassuringly, these dive intervals are in agreement with the findings of Stelle et al. (2008) who described the mean interval between blows as 15.4 ± 4.73 seconds, and overall dives ranging from 8 seconds to 11 minutes.
So, now that we know what the typical dive patterns in this dataset are, the trick was to write a code that would look through each trackline, and identify gaps of greater than 5 minutes. Then, the code calculates how many artificial points to create to fill the gap, and where to put them.
One of the most frustrating parts of this adventure for me has been understanding the syntax of the R language. I know what calculations or comparisons I want to make with my dataset, but translating my thoughts into syntax for the computer to understand has not been easy. With error messages such as:
Error in match.names(clabs, names(xi)) :
names do not match previous names
Solution: I had to go line by line and verify that every single variable name matched, but turned out it was a capital letter in the wrong place throwing the error!
Error in as.POSIXct.default(time1) :
do not know how to convert ‘time1’ to class “POSIXct”
Solution: a weird case where the data was in the correct time format, but not being recognized, so I had to re-import the dataset as a different file format.
Error in data.frame(Whale.ID = Whale.ID, Site = Site, Latitude = Latitude, : arguments imply differing number of rows: 0, 2, 1
Solution: HELP! Yet to be solved….
Is it any wonder that when a friend asks how I am doing, my answer is “R is kicking my butt!”?
Science is a collaborative effort, where we build on the work of researchers who came before us. Rachael, a wonderful post-doc in the GEMM Lab, had already tackled this time-based interpolation problem earlier in the year working with albatross tracks. She graciously allowed me to build on her previous R code and tweak it for my own purposes. Two weeks ago, I was proud because I thought I had the code working – all that I needed to do was adjust the time interval we were looking for, and I could be off to the rest of my analysis! However, this weekend, the code has decided it doesn’t work with any interval except 6 minutes, and I am lost.
Many of the difficulties encountered when coding can be fixed by judicious use of google, stackoverflow, and the CRAN repository.
But sometimes, when you’ve been staring at the problem for hours, what you really need is a little praise for trying your best. So, if you are an R user, go download this package: praise, load the library, and type praise() into your console. You won’t regret it (See Fig. 4).
Thank you to Rachael who created the code in the first place, thanks to Solene who helped me trouble shoot, thanks to Amanda for moral support. Go GEMM Lab!
Why do pirates have a hard time learning the alphabet? It’s not because they love aaaR so much, it’s because they get stuck at “c”!
Stelle, L. L., W. M. Megill, and M. R. Kinzel. 2008. Activity budget and diving behavior of gray whales (Eschrichtius robustus) in feeding grounds off coastal British Columbia. Marine mammal science 24:462-478.
Go Florence!
and love the praise() function, this is just what I need!
Now we know why the field is so much more fun. Heard you will be working in Port Orford this summer. Good luck with the project.
Thanks Era! Yes, we’re gearing up to continue the project down in Port Orford this summer with an extended focus on predator-prey interactions between mysid and gray whales. It should be fun, and give us new headaches in data analysis too – we’ll be sure to keep the blog updated with news!