Data Wrangling to Assess Data Availability: A Data Detective at Work

By Alexa Kownacki, Ph.D. Student, OSU Department of Fisheries and Wildlife, Geospatial Ecology of Marine Megafauna Lab

Data wrangling, in my own loose definition, is the necessary combination of both data selection and data collection. Wrangling your data requires accessing then assessing your data. Data collection is just what it sounds like: gathering all data points necessary for your project. Data selection is the process of cleaning and trimming data for final analyses; it is a whole new bag of worms that requires decision-making and critical thinking. During this process of data wrangling, I discovered there are two major avenues to obtain data: 1) you collect it, which frequently requires an exorbitant amount of time in the field, in the lab, and/or behind a computer, or 2) other people have already collected it, and through collaboration you put it to a good use (often a different use then its initial intent). The latter approach may result in the collection of so much data that you must decide which data should be included to answer your hypotheses. This process of data wrangling is the hurdle I am facing at this moment. I feel like I am a data detective.

Data wrangling illustrated by members of the R-programming community. (Image source: R-bloggers.com)

My project focuses on assessing the health conditions of the two ecotypes of bottlenose dolphins between the waters off of Ensenada, Baja California, Mexico to San Francisco, California, USA between 1981-2015. During the government shutdown, much of my data was inaccessible, seeing as it was in possession of my collaborators at federal agencies. However, now that the shutdown is over, my data is flowing in, and my questions are piling up. I can now begin to look at where these animals have been sighted over the past decades, which ecotypes have higher contaminant levels in their blubber, which animals have higher stress levels and if these are related to geospatial location, where animals are more susceptible to human disturbance, if sex plays a role in stress or contaminant load levels, which environmental variables influence stress levels and contaminant levels, and more!

Alexa, alongside collaborators, photographing transiting bottlenose dolphins along the coastline near Santa Barbara, CA in 2015 as part of the data collection process. (Image source: Nick Kellar).

Over the last two weeks, I was emailed three separate Excel spreadsheets representing three datasets, that contain partially overlapping data. If Microsoft Access is foreign to you, I would compare this dilemma to a very confusing exam question of “matching the word with the definition”, except with the words being in different languages from the definitions. If you have used Microsoft Access databases, you probably know the system of querying and matching data in different databases. Well, imagine trying to do this with Excel spreadsheets because the databases are not linked. Now you can see why I need to take a data management course and start using platforms other than Excel to manage my data.

A visual interpretation of trying to combine datasets being like matching the English definition to the Spanish translation. (Image source: Enchanted Learning)

In the first dataset, there are 6,136 sightings of Common bottlenose dolphins (Tursiops truncatus) documented in my study area. Some years have no sightings, some years have fewer than 100 sightings, and other years have over 500 sightings. In another dataset, there are 398 bottlenose dolphin biopsy samples collected between the years of 1992-2016 in a genetics database that can provide the sex of the animal. The final dataset contains records of 774 bottlenose dolphin biopsy samples collected between 1993-2018 that could be tested for hormone and/or contaminant levels. Some of these samples have identification numbers that can be matched to the other dataset. Within these cross-reference matches there are conflicting data in terms of amount of tissue remaining for analyses. Sorting these conflicts out will involve more digging from my end and additional communication with collaborators: data wrangling at its best. Circling back to what I mentioned in the beginning of this post, this data was collected by other people over decades and the collection methods were not standardized for my project. I benefit from years of data collection by other scientists and I am grateful for all of their hard work. However, now my hard work begins.

The cutest part of data wrangling: finding adorable images of bottlenose dolphins, photographed during a coastal survey. (Image source: Alexa Kownacki).

There is also a large amount of data that I downloaded from federally-maintained websites. For example, dolphin sighting data from research cruises are available for public access from the OBIS (Ocean Biogeographic Information System) Sea Map website. It boasts 5,927,551 records from 1,096 data sets containing information on 711 species with the help of 410 collaborators. This website is incredible as it allows you to search through different data criteria and then download the data in a variety of formats and contains an interactive map of the data. You can explore this at your leisure, but I want to point out the sheer amount of data. In my case, the OBIS Sea Map website is only one major platform that contains many sources of data that has already been collected, not specifically for me or my project, but will be utilized. As a follow-up to using data collected by other scientists, it is critical to give credit where credit is due. One of the benefits of using this website, is there is information about how to properly credit the collaborators when downloading data. See below for an example:

Example citation for a dataset (Dataset ID: 1201):

Lockhart, G.G., DiGiovanni Jr., R.A., DePerte, A.M. 2014. Virginia and Maryland Sea Turtle Research and Conservation Initiative Aerial Survey Sightings, May 2011 through July 2013. Downloaded from OBIS-SEAMAP (http://seamap.env.duke.edu/dataset/1201) on xxxx-xx-xx.

Citation for OBIS-SEAMAP:

Halpin, P.N., A.J. Read, E. Fujioka, B.D. Best, B. Donnelly, L.J. Hazen, C. Kot, K. Urian, E. LaBrecque, A. Dimatteo, J. Cleary, C. Good, L.B. Crowder, and K.D. Hyrenbach. 2009. OBIS-SEAMAP: The world data center for marine mammal, sea bird, and sea turtle distributions. Oceanography 22(2):104-115

Another federally-maintained data source that boasts more data than I can quantify is the well-known ERDDAP website. After a few Google searches, I finally discovered that the acronym stands for Environmental Research Division’s Data Access Program. Essentially, this the holy grail of environmental data for marine scientists. I have downloaded so much data from this website that Excel cannot open the csv files. Here is yet another reason why young scientists, like myself, need to transition out of using Excel and into data management systems that are developed to handle large-scale datasets. Everything from daily sea surface temperatures collected on every, one-degree of latitude and longitude line from 1981-2015 over my entire study site to Ekman transport levels taken every six hours on every longitudinal degree line over my study area. I will add some environmental variables in species distribution models to see which account for the largest amount of variability in my data. The next step in data selection begins with statistics. It is important to find if there are highly correlated environmental factors prior to modeling data. Learn more about fitting cetacean data to models here.

The ERDAPP website combined all of the average Sea Surface Temperatures collected daily from 1981-2018 over my study site into a graphical display of monthly composites. (Image Source: ERDDAP)

As you can imagine, this amount of data from many sources and collaborators is equal parts daunting and exhilarating. Before I even begin the process of determining the spatial and temporal spread of dolphin sightings data, I have to identify which data points have sex identified from either hormone levels or genetics, which data points have contaminants levels already quantified, which samples still have tissue available for additional testing, and so on. Once I have cleaned up the datasets, I will import the data into the R programming package. Then I can visualize my data in plots, charts, and graphs; this will help me identify outliers and potential challenges with my data, and, hopefully, start to see answers to my focal questions. Only then, can I dive into the deep and exciting waters of species distribution modeling and more advanced statistical analyses. This is data wrangling and I am the data detective.

What people may think a ‘data detective’ looks like, when, in reality, it is a person sitting at a computer. (Image source: Elder Research)

Like the well-known phrase, “With great power comes great responsibility”, I believe that with great data, comes great responsibility, because data is power. It is up to me as the scientist to decide which data is most powerful at answering my questions.

Data is information. Information is knowledge. Knowledge is power. (Image source: thedatachick.com)

 

Photogrammetry Insights

By Leila Lemos, PhD Candidate, Fisheries and Wildlife Department, Oregon State University

After three years of fieldwork and analyzing a large dataset, it is time to finally start compiling the results, create plots and see what the trends are. The first dataset I am analyzing is the photogrammetry data (more on our photogrammetry method here), which so far has been full of unexpected results.

Our first big expectation was to find a noticeable intra-year variation. Gray whales spend their winter in the warm waters of Baja California, Mexico, period while they are fasting. In the spring, they perform a big migration to higher latitudes. Only when they reach their summer feeding grounds, that extends from Northern California to the Bering and Chukchi seas, Alaska, do they start feeding and gaining enough calories to support their migration back to Mexico and subsequent fasting period.

 

Northeastern gray whale migration route along the NE Pacific Ocean.
Source: https://journeynorth.org/tm/gwhale/annual/map.html

 

Thus, we expected to see whales arriving along the Oregon coast with a skinny body condition that would gradually improve over the months, during the feeding season. Some exceptions are reasonable, such as a lactating mother or a debilitated individual. However, datasets can be more complex than we expect most of the times, and many variables can influence the results. Our photogrammetry dataset is no different!

In addition, I need to decide what are the best plots to display the results and how to make them. For years now I’ve been hearing about the wonders of R, but I’ve been skeptical about learning a whole new programming/coding language “just to make plots”, as I first thought. I have always used statistical programs such as SPSS or Prism to do my plots and they were so easy to work with. However, there is a lot more we can do in R than “just plots”. Also, it is not just because something seems hard that you won’t even try. We need to expose ourselves sometimes. So, I decided to give it a try (and I am proud of myself I did), and here are some of the results:

 

Plot 1: Body Area Index (BAI) vs Day of the Year (DOY)

 

In this plot, we wanted to assess the annual Body Area Index (BAI) trends that describe how skinny (low number) or fat (higher number) a whale is. BAI is a simplified version of the BMI (Body Mass Index) used for humans. If you are interested about this method we have developed at our lab in collaboration with the Aerial Information Systems Laboratory/OSU, you can read more about it in our publication.

The plots above are three versions of the same data displayed in different ways. The first plot on the left shows all the data points by year, with polynomial best fit lines, and the confidence intervals (in gray). There are many overlapping observation points, so for the middle plot I tried to “clean up the plot” by reducing the size of the points and taking out the gray confidence interval range around the lines. In the last plot on the right, I used a linear regression best fit line, instead of polynomial.

We can see a general trend that the BAI was considerably higher in 2016 (red line), when compared to the following years, which makes us question the accuracy of the dataset for that year. In 2016, we also didn’t sample in the month of July, which is causing the 2016 polynomial line to show a sharp decrease in this month (DOY: ~200-230). But it is also interesting to note that the increasing slope of the linear regression line in all three years is very similar, indicating that the whales gained weight at about the same rate in all years.

 

Plot 2: Body Area Index (BAI) vs Body Condition Score (BCS)

 

In addition to the photogrammetry method of assessing whale body condition, we have also performed a body condition scoring method for all the photos we have taken in the field (based on the method described by Bradford et al. 2012). Thus, with this second set of plots, we wanted to compare both methods of assessing whale body condition in order to evaluate when the methods agree or not, and which method would be best and in which situation. Our hypothesis was that whales with a ‘fair’ body condition would have a lower BAI than whales with a ‘good’ body condition.

The plots above illustrate two versions of the same data, with data in the left plot grouped by year, and the data in the right plot grouped by month. In general, we see that no whales were observed with a poor body condition in the last analysis months (August to October), with both methods agreeing to this fact. Additionally, there were many whales that still had a fair body condition in August and September, but less whales in the month of October, indicating that most whales gained weight over the foraging seasons and were ready to start their Southbound migration and another fasting period. This result is important information regarding monitoring and conservation issues.

However, the 2016 dataset is still a concern, since the whales appear to have considerable higher body condition (BAI) when compared to other years.

 

Plot 3:Temporal Body Area Index (BAI) for individual whales

 

In this last group of plots, we wanted to visualize BAI trends over the season (using day of year – DOY) on the x-axis) for individuals we measured more than once. Here we can see the temporal patterns for the whales “Bit”, “Clouds”, “Pearl”, “Scarback, “Pointy”, and “White Hole”.

We expected to see an overall gradual increase in body condition (BAI) over the seasons, such as what we can observe for Pointy in 2018. However, some whales decreased their condition, such as Bit in 2018. Could this trend be accurate? Furthermore, what about BAI measurements that are different from the trend, such as Scarback in 2017, where the last observation point shows a lower BAI than past observation points? In addition, we still observe a high BAI in 2016 at this individual level, when compared to the other years.

My next step will be to check the whole dataset again and search for inconsistencies. There is something causing these 2016 values to possibly be wrong and I need to find out what it is. The overall quality of the measured photogrammetry images was good and in focus, but other variables could be influencing the quality and accuracy of the measurements.

For instance, when measuring images, I often struggled with glare, water splash, water turbidity, ocean swell, and shadows, as you can see in the photos below. All of these variables caused the borders of the whale body to not be clearly visible/identifiable, which may have caused measurements to be wrong.

 

Examples of bad conditions for performing photogrammetry: (1) glare and water splash, (2) water turbidity, (3) ocean swell, and (4) a shadow created in one of the sides of the whale body.
Source: GEMM Lab. Taken under NMFS permit 16111 issued to John Calambokidis.

 

Thus, I will need to check all of these variables to identify the causes for bad measurements and “clean the dataset”. Only after this process will I be able to make these plots again to look at the trends (which will be easy since I already have my R code written!). Then I’ll move on to my next hypothesis that the BAI of individual whales varied by demographics including sex, age and reproductive state.

To carry out robust science that produces results we can trust, we can’t simply collect data, perform a basic analysis, create plots and believe everything we see. Data is often messy, especially when developing new methods like we have done here with drone based photogrammetry and the BAI. So, I need to spend some important time checking my data for accuracy and examining confounding variables that might affect the dataset. Science can be challenging, both when interpreting data or learning a new command language, but it is all worth it in the end when we produce results we know we can trust.

 

 

 

A Marine Mammal Odyssey, Eh!

By Leila Lemos, PhD student

Dawn Barlow, MS student

Florence Sullivan, MS

The Society for Marine Mammalogy’s Biennial Conference on the Biology of Marine Mammals happens every two years and this year the conference took place in Halifax, Nova Scotia, Canada.

Logo of the Society for Marine Mammalogy’s 22nd Biennial Conference on the Biology of Marine Mammals, 2017: A Marine Mammal Odyssey, eh!

The conference started with a welcome reception on Sunday, October 22nd, followed by a week of plenaries, oral presentations, speed talks and posters, and two more days with different workshops to attend.

This conference is an important event for us, as marine mammalogists. This is the moment where we get to share our projects (how exciting!), get important feedback, and hear about different studies that are being conducted around the world. It is also an opportunity to network and find opportunities for collaboration with other researchers, and of course to learn from our colleagues who are presenting their work.

The GEMM Lab attending the opening plenaries of the conference!

The first day of conference started with an excellent talk from Asha de Vos, from Sri Lanka, where she discussed the need for increased diversity (in all aspects including race, gender, nationality, etc.) in our field, and advocated for the end of “parachute scientists” who come into a foreign (to them) location, complete their research, and then leave without communicating results, or empowering the local community to care or act in response to local conservation issues.  She also talked about the difficulty that researchers in developing countries face accessing research that is hidden behind journal pay walls, and encouraged everyone to get creative with communication! This means using blogs and social media, talking to science communicators and others in order to get our stories out, and no longer hiding our results behind the ivory tower of academia.  Overall, it was an inspirational way to begin the week.

On Thursday morning we heard Julie van der Hoop, who was this year’s recipient of the F.G. Wood Memorial Scholarship Award, present her work on “Drag from fishing gear entangling right whales: a major extinction risk factor”. Julie observed a decrease in lipid reserves in entangled whales and questioned if entanglements are as costly as events such as migration, pregnancy or lactation. Tags were also deployed on whales that had been disentangled from fishing gear, and researchers were able to see an increase in whale speed and dive depth.

Julie van der Hoop talks about different drag forces of fishing gears
on North Atlantic Right Whales.

There were many other interesting talks over the course of the week. Some of the talks that inspired us were:

— Stephen Trumble’s talk “Earplugs reveal a century of stress in baleen whales and the impact of industrial whaling” presented a time-series of cortisol profiles of different species of baleen whales using earplugs. The temporal data was compared to whaling data information and they were able to see a high correlation between datasets. However, during a low whaling season concurrent to the World War II in the 40’s, high cortisol levels were potentially associated to an increase in noise from ship traffic.

— Jane Khudyakov (“Elephant seal blubber transcriptome and proteome responses to single and repeated stress”) and Cory Champagne (“Metabolomic response to acute and repeated stress in the northern elephant seal”) presented different aspects of the same project. Jane looked at down/upregulation of genes (downregulation is when a cell decreases the quantity of a cellular component, such as RNA or protein, in response to an external stimulus; upregulation is the opposite: when the cell increases the quantity of cellular components) to check for stress. She was able to confirm an upregulation of genes after repeated stressor exposure. Cory checked for influences on the metabolism after administering ACTH (adrenocorticotropic hormone: a stimulating hormone that causes the release of glucocorticoid hormones by the adrenal cortex. i.e., cortisol, a stress related hormone) to elephant seals. By looking only at the stress-related hormone, he was not able to differentiate acute from chronic stress responses. However, he showed that many other metabolic processes varied according to the stress-exposure time. This included a decrease in amino acids, mobilization of lipids and upregulation of carbohydrates.

— Jouni Koskela (“Fishing restrictions is an essential protection method of the Saimaa ringed seal”) talked about the various conservation efforts being undertaken for the endangered Lake Saimaa ringed seal. Gill nets account for 90% of seal pup mortality, but if new pups can reach 20kg, only 14% of them will drown in these fishing net entanglements. Working with local industry and recreational interests, increased fishing restrictions have been enacted during the weaning season. In addition to other year-round restrictions, this has led to a small, but noticeable upward trend in pup production and population growth! A conservation success story is always gratifying to hear, and we wish these collaborative efforts continued future success.

— Charmain Hamilton (“Impacts of sea-ice declines on a pinnacle Arctic predator-prey relationship: Habitat, behaviour, and spatial overlap between coastal polar bears and ringed seals”) gave a fascinating presentation looking at how changing ice regimes in the arctic are affecting spatial habitat use patterns of polar bears. As ice decreases in the summer months, the polar bears move more, resulting in less spatial overlap with ringed seal habitat, and so the bears have turned to targeting ground nesting seabirds.  This spatio-temporal mismatch of traditional predator/prey has drastic implications for arctic food web dynamics.

— Nicholas Farmer’s presentation on a Population Consequences of Disturbance (PCoD) model for assessing theoretical impacts of seismic survey on sperm whale population health had some interesting parallels with new questions in our New Zealand blue whale project. By simulating whale movement through modeled three-dimensional sound fields, he found that the frequency of the disturbance (i.e., how many days in a row the seismic survey activity persisted) was very important in determining effects on the whales. If the seismic noise persists for many days in a row, the sperm whales may not be able to replenish their caloric reserves because of ongoing disturbance. As you can imagine, this pattern gets worse with more sequential days of disturbance.

— Jeremy Goldbogen used suction cup tags equipped with video cameras to peer into an unusual ecological niche: the boundary layer of large whales, where drag is minimized and remoras and small invertebrates compete and thrive. Who would have thought that at a marine mammal conference, a room full of people would be smiling and laughing at remoras sliding around the back of a blue whale, or barnacles filter feeding as they go for a ride with a humpback whale? Insights from animals that occupy this rare niche can inform improvements to current tag technologies.

The GEMM Lab was well represented this year with six different talks: four oral presentations and two speed talks! It is evident that all of our hard work and preparation, such as practicing our talks in front of our lab mates two weeks in advance, paid off.  All of the talks were extremely well received by the audience, and a few generated intelligent questions and discussion afterwards – exactly as we hoped.  It was certainly gratifying to see how packed the room was for Sharon’s announcement of our new method of standardizing photogrammetry from drones, and how long the people stayed to talk to Dawn after her presentation about an unique population of New Zealand blue whales – it took us over an hour to be able to take her away for food and the celebratory drinks she deserved!

GEMM Lab members on their talks. From left to right, top to bottom: Amanda Holdman, Leila Lemos, Solène Derville, Dawn Barlow, Sharon Nieukirk, and Florence Sullivan.

 

GEMM Lab members at the closing celebration. From left to right: Florence Sullivan, Leila Lemos, Amanda Holdman, Solène Derville, and Dawn Barlow.
We are not always serious, we can get silly sometimes!

The weekend after the conference many courageous researchers who wanted to stuff their brains with even more specialized knowledge participated in different targeted workshops. From 32 different workshops that were offered, Leila chose to participate in “Measuring hormones in marine mammals: Current methods, alternative sample matrices, and future directions” in order to learn more about the new methods, hormones and matrices that are being used by different research groups and also to make connections with other endocrinologist researchers. Solène participated in the workshop “Reproducible Research with R, Git, and GitHub” led by Robert Shick.  She learned how to better organize her research workflow and looks forward to teaching us all how to be better collaborative coders, and ensure our analysis is reproducible by others and by our future selves!

On Sunday none of us from the GEMM Lab participated in workshops and we were able to explore a little bit of the Bay of Fundy, an important area for many marine mammal species. Even though we didn’t spot any marine mammals, we enjoyed witnessing the enormous tidal exchange of the bay (the largest tides in the world), and the fall colors of the Annaoplis valley were stunning as well. Our little trip was fun and relaxing after a whole week of learning.

The beauty of the Bay of Fundy.
GEMM Lab at the Bay of Fundy; from left to right: Kelly Sullivan (Florence’s husband and a GEMM Lab fan), Florence Sullivan, Dawn Barlow, Solène Derville, and Leila Lemos.
We do love being part of the GEMM Lab!

It is amazing how refreshing it is to participate in a conference. So many ideas popping up in our heads and an increasing desire to continue doing research and work for conservation of marine mammals. Now it’s time to put all of our ideas and energy into practice back home! See you all in two years at the next conference in Barcelona!

Flying out of Halifax!

Finding the edge: Preliminary insights into blue whale habitat selection in New Zealand

By Dawn Barlow, MSc student, OSU Department of Fisheries and Wildlife, Geospatial Ecology of Marine Megafauna Lab

I was fortunate enough to spend the Austral summer in the field, and so while the winter rain poured down on Oregon I found myself on the water with the sun and wind on my face, looking for blue whales in New Zealand. This spring I switched gears and spent time taking courses to build my analytical toolbox. In a course on technical writing and communication, I was challenged to present my research using only pictures and words with no written text, and to succinctly summarize the importance of my research in an introduction to a technical paper. I attended weekly seminars to learn about the diverse array of marine science being conducted at Oregon State University and beyond. I also took a course entitled “Advanced Spatial Statistics and Geographic Information Science”. In this skill-building course, we were given the opportunity to work with our own data. Even though my primary objective was to expand the tools in my toolbox, I was excited to explore preliminary results and possible insight into blue whale habitat selection in my study area, the South Taranaki Bight region (STB) of New Zealand (Figure 1).

Figure 1. A map of New Zealand, with the South Taranaki Bight (STB) region delineated by the black box. Farewell Spit is denoted by a star, and Kahurangi point is denoted by an X.

Despite the recent documentation of a foraging ground in the STB, blue whale distribution remains poorly understood in New Zealand. The STB is New Zealand’s most industrially active marine region, and the site of active oil and gas extraction and exploration, busy shipping traffic, and proposed seabed mining. This potential space-use conflict between endangered whales and industry warrants further investigation into the spatial and temporal extent of blue whale habitat in the region. One of my research objectives is to investigate the relationship between blue whales and their environment, and ultimately to build a model that can predict blue whale presence based on physical and biological oceanographic features. For this spring term, the question I asked was:

Is the number of blue whales present in an area correlated with remotely-sensed sea surface temperature and chlorophyll-a concentration?

For the purposes of this exploration, I used data from our 2017 survey of the STB. This meant importing our ship’s track and our blue whale sighting locations into ArcGIS, so that the data went from looking like this:

… to this:

The next step was to get remote-sensed images for sea surface temperature (SST) and chlorophyll-a (chl-a) concentration. I downloaded monthly averages from the NASA Moderate Resolution Imaging Spectrometer (MODIS aqua) website for the month of February 2017 at 4 km2 resolution, when our survey took place. Now, my images looked something more like this:

But, I can’t say anything reliable about the relationships between blue whales and their environment in the places we did not survey.  So next I extracted just the portions of my remote-sensed images where we conducted survey effort. Now my maps looked more like this one:

The above map shows SST along our ship’s track, and the locations where we found whales. Just looking at this plot, it seems like the blue whales were observed in both warmer and colder waters, not exclusively in one or the other. There is a productive plume of cold, upwelled water in the STB that is generated off of Kahurangi point and curves around Farewell Spit and into the bight (Figure 1). Most of the whales we saw appear to be near that plume. But how can I find the edges of this upwelled plume? Well, I can look at the amount of change in SST and chl-a across a spatial area. The places where warm and cold water meet can be found by assessing the amount of variability—the standard deviation—in the temperature of the water. In ArcGIS, I calculated the deviation in SST and chl-a concentration across the surrounding 20 km2 for each 4 km2 cell.

Now, how do I tie all of these qualitative visual assessments together to produce a quantitative result? With a statistical model! This next step gives me the opportunity to flex some other analytical muscles, and practice using another computational tool: R. I used a generalized additive model (GAM) to investigate the relationships between the number of blue whales observed in each 4 km2 cell our ship surveyed and the remote-sensed variables. The model can be written like this:

Number of blue whales ~ SST + chl-a + sd(SST) + sd(chl-a)

In other words, are SST, chl-a concentration, deviation in SST, and deviation in chl-a concentration correlated with the number of blue whales observed within each 4 km2 cell on my map?

This model found that the most important predictor was the deviation in SST. In other words, these New Zealand blue whales may be seeking the edges of the upwelling plume, honing in on places where warm and cold water meet. Thinking back on the time I spent in the field, we often saw feeding blue whales diving along lines of mixing water masses where the water column was filled with aggregations of krill, blue whale prey. Studies of marine mammals in other parts of the world have also found that eddies and oceanic fronts—edges between warm and cold water masses—are important habitat features where productivity is increased due to mixing of water masses. The same may be true for these New Zealand blue whales.

These preliminary findings emphasize the benefit of having both presence and absence data. The analysis I have presented here is certainly strengthened by having environmental measurements for locations where we did not see whales. This is comforting, considering the feelings of impatience generated by days on the water spent like this with no whales to be seen:

Moving forward, I will include the blue whale sighting data from our 2014 and 2016 surveys as well. As I think about what would make this model more robust, it would be interesting to see if the patterns become clearer when I incorporate behavior into the model—if I look at whales that are foraging and traveling separately, are the results different? I hope to explore the importance of the upwelling plume in more detail—does the distance from the edge of the upwelling plume matter? And finally, I want to adjust the spatial and temporal scales of my analysis—do patterns shift or become clearer if I don’t use monthly averages, or if I change the grid cell sizes on my maps?

I feel more confident in my growing toolbox, and look forward to improving this model in the coming months! Stay tuned.

Grad School Headaches

By Florence Sullivan, MSc student GEMM lab

Over the past few months I have been slowly (and I do mean SLOWLY – I don’t believe I’ve struggled this much with learning a new skill in a long, long time) learning how to work in “R”.  For those unfamiliar with why a simple letter might cause me so much trouble, R is a programming language and free software environment suitable for statistical computing and graphing.

My goal lately has been to interpolate my whale tracklines (i.e. smooth out the gaps where we missed a whale’s surfacing by inserting artificial locations).  In order to do this I needed to know (1) How long does a gap between fixes need to be to identify a missed surfacing? (2) How many artificial points should be used to fill a given gap?

The best way to answer these queries was to look at a distribution of all of the time steps between fixes.  I started by importing my dataset – the latitude and longitude, date, time, and unique whale identifier for each point (over 5000 of them) we recorded last summer. I converted the locations into x & y coordinates, adjusted the date and time stamp into the proper format, and used the package adehabitatLT  to calculate the difference in times between each fix.  A package known as ggplot2 was useful for creating exploratory histograms – but my data was incredibly skewed (Fig 1)! It appeared that the majority of our fixes happened less than a minute apart from each other. When you recall that gray whales typically take 3-4 short breathes at the surface between dives, this starts to make a lot of sense, but we had anticipated a bimodal distribution with two peaks: one for the quick surfacings, and one for the surfacings between 4-5 minutes dives. Where was this second peak?

Histogram of the difference in time (in seconds) between whale fixes.
Fig. 1.  Histogram of the difference in time (in seconds on x-axis) between whale fixes.

Sometimes, calculating the logarithm of one of your axes can help tease out more patterns in your data  – particularly in a heavily skewed distribution like Fig. 1. When I logged the time interval data, our expected bimodal distribution pattern became evident (Fig. 2). And, when I back-calculate from the center of the two peaks we see that the first peak occurs at less than 20 seconds (e^2.5 = 18 secs) representing the short, shallow blow intervals, or interventilation dives, and that the second peak of dives spans ~2.5 minutes to  ~5 minutes (e^4.9 = 134 secs, e^5.7 = 298 secs). Reassuringly, these dive intervals are in agreement with the findings of Stelle et al. (2008) who described the mean interval between blows as 15.4 ± 4.73 seconds, and overall dives ranging from 8 seconds to 11 minutes.

Fig. 2. Histogram of the log of time difference between whale fixes.
Fig. 2. Histogram of the log of time difference between whale fixes.

So, now that we know what the typical dive patterns in this dataset are, the trick was to write a code that would look through each trackline, and identify gaps of greater than 5 minutes.  Then, the code calculates how many artificial points to create to fill the gap, and where to put them.

Fig. 3. A check in my code to make sure the artificial points are being plotted correctly. The blue points are the originals, and the red ones are new.
Fig. 3. A check in my code to make sure the artificial points are being plotted correctly. The blue points are the originals, and the red ones are new.

One of the most frustrating parts of this adventure for me has been understanding the syntax of the R language.  I know what calculations or comparisons I want to make with my dataset, but translating my thoughts into syntax for the computer to understand has not been easy.  With error messages such as:

Error in match.names(clabs, names(xi)) :

  names do not match previous names

Solution:  I had to go line by line and verify that every single variable name matched, but turned out it was a capital letter in the wrong place throwing the error!

Error in as.POSIXct.default(time1) :

  do not know how to convert ‘time1’ to class “POSIXct”

Solution: a weird case where the data was in the correct time format, but not being recognized, so I had to re-import the dataset as a different file format.

Error in data.frame(Whale.ID = Whale.ID, Site = Site, Latitude = Latitude,  :   arguments imply differing number of rows: 0, 2, 1

Solution: HELP! Yet to be solved….

Is it any wonder that when a friend asks how I am doing, my answer is “R is kicking my butt!”?

Science is a collaborative effort, where we build on the work of researchers who came before us. Rachael, a wonderful post-doc in the GEMM Lab, had already tackled this time-based interpolation problem earlier in the year working with albatross tracks. She graciously allowed me to build on her previous R code and tweak it for my own purposes. Two weeks ago, I was proud because I thought I had the code working – all that I needed to do was adjust the time interval we were looking for, and I could be off to the rest of my analysis!  However, this weekend, the code has decided it doesn’t work with any interval except 6 minutes, and I am lost.

Many of the difficulties encountered when coding can be fixed by judicious use of google, stackoverflow, and the CRAN repository.

But sometimes, when you’ve been staring at the problem for hours, what you really need is a little praise for trying your best. So, if you are an R user, go download this package: praise, load the library, and type praise() into your console. You won’t regret it (See Fig. 4).

Screenshot (74)
Fig. 4. A little compliment goes a long way to solving a headache.

Thank you to Rachael who created the code in the first place, thanks to Solene who helped me trouble shoot, thanks to Amanda for moral support. Go GEMM Lab!

Why do pirates have a hard time learning the alphabet?  It’s not because they love aaaR so much, it’s because they get stuck at “c”!

Stelle, L. L., W. M. Megill, and M. R. Kinzel. 2008. Activity budget and diving behavior of gray whales (Eschrichtius robustus) in feeding grounds off coastal British Columbia. Marine mammal science 24:462-478.

Exciting news for the GEMM Lab: SMM conference and a twitter feed!

By Amanda Holdman (M.S Student)

At the end of the week, the GEMM Lab will be pilling into our fuel efficient Subaru’s and start heading south to San Francisco! The 21st Biennial Conference on the Biology of Marine Mammals, hosted by the Society of Marine Mammalogy, kicks off this weekend and the GEMM Lab is all prepped and ready!

Workshops start on Saturday prior to the conference, and I will be attending the Harbor Porpoise Workshop, where I get to collaborate with several other researchers worldwide who study my favorite cryptic species. After morning introductions, we will have a series of talks, a lunch break, and then head to the Golden Gate Bridge to see the recently returned San Francisco harbor porpoise. Sounds fun right?!? But that’s just day one. A whole week of scientific fun is to be had! So let’s begin with Society’s mission:

smm-2015-logo

‘To promote the global advancement of marine mammal science and contribute to its relevance and impact in education, conservation and management’ 

And the GEMM Lab is all set to do just that! The conference will bring together approximately 2200 top marine mammal scientists and managers to investigate the theme of Marine Mammal Conservation in a Changing World. All GEMM Lab members will be presenting at this year’s conference, accompanied by other researchers from the Marine Mammal Institute, to total 34 researchers representing Oregon State University!

Here is our Lab line-up:

Our leader, Leigh will be starting us off strong with a speed talk on Moving from documentation to protection of a blue whale foraging ground in an industrial area of New Zealand

Tuesday morning I will be presenting a poster on the Spatio-temporal patterns and ecological drivers of harbor porpoises off of the central Oregon coast

Solène follows directly after me on Tuesday to give an oral presentation on the Environmental correlates of nearshore habitat distribution by the critically endangered Maui dolphin.

Florence helps us reconvene Thursday morning with a poster presentation on her work, Assessment of vessel response to foraging gray whales along the Oregon coast to promote sustainable ecotourism. 

And finally, Courtney, the most recent Master of Science, and the first graduate of the GEMM Lab will give an oral presentation to round us out on Citizen Science: Benefits and limitations for marine mammal research and education

However, while I am full of excitement and anticipation for the conference, I do regret to report that you will not be seeing a blog post from us next week. That’s because the GEMM Lab recently created a twitter feed and we will be “live tweeting” our conference experience with all of you! You can follow along the conference by searching #Marman15 and follow our Lab at @GemmLabOSU

Twitter is a great way to communicate our research, exchange ideas and network, and can be a great resource for scientific inspiration.

If you are new to twitter, like the GEMM Lab, or are considering pursuing graduate school, take some time to explore the scientific world of tweeting and following. I did and as it turns out there are tons of resources that are aimed for grad students to help other grad students.

For example:

Tweets by the thesis wisperer team (@thesiswisperer) offer advice and useful tips on writing and other grad related stuff. If you are having problems with statistics, there are lots of specialist groups such as R-package related hashtags like #rstats, or you could follow @Rbloggers and @statsforbios to name a few.

As always, thanks for following along, make sure to find us on twitter so you can follow along with the GEMM Labs scientific endeavors.

 

 

On learning to Code…

By Amanda Holdman, MSc student, Dept. Fisheries and Wildlife, OSU

I’ve never sworn so much in my life. I stared at a computer screen for hours trying to fix a bug in my script. The cause of the error escaped me, pushing me into a cycle of tension, self-loathing, and keyboard smashing.

The cause of the error? A typo in the filename.

When I finally fixed the error in my filename and my code ran perfectly – my mood quickly changed. I felt invincible; like I had just won the World Cup. I did a quick victory dance in my kitchen and high-fived my roommate, and then sat down and moved on the next task that needed to be conquered with code. Just like that, programming has quickly become a drug that makes me come back for more despite the initial pain I endure.

I had never opened a computer programming software until my first year of graduate school. Before then Matlab was just the subject of a muttered complaint by my college engineering roommate. As a biology major, I blew it off as something (thank goodness!) I would never need to use. Needless to say, that set me up for a rude awakening just one year later.

The time has finally come for me to, *gulp*, learn how to code. I honestly think I went through all 5 stages of grief before I realized I was at the point where I could no longer put it off.

By now you are familiar with the GEMM Lab updating you with photos of our charismatic study species in our beautiful study areas. However, summer is over. My field work is complete, and I’m enrolled in my last course of my master’s career. So what does this mean? Winter. And with winter comes data analysis. So, instead of spending my days out on a boat in calm seas, watching humpbacks breach, or tagging along with Florence to watch gray whales forage along the Oregon coast, I’ve reached the point of my graduate career that we don’t often tell you about: Figuring out what story our data is telling us. This stage requires lots of coffee and patience.

However, in just two short weeks of learning how to code, I feel like I’ve climbed mountains. I tackle task after task, each allowing me to learn new things, revise old knowledge, and make it just a little bit closer to my goals. One of the most striking things about learning how to code is that it teaches you how to problem solve. It forces you to think in a strategic and conceptual way, and to be honest, I think I like it.

For example, this week I mapped the percent of my harbor porpoise detections over tidal cycles. One of the most important factors explaining the distribution and behavior of coastal marine mammals are tides. Tidal forces drive a number of preliminary and secondary oceanographic processes like changes in water depth, salinity, temperature, and the speed and direction of currents. It’s often difficult to unravel which part of the tidal process is most influential to a species due to the several covariates related to the change in tides , how inter-related those covariates are, and the elusive nature of the species (like the cryptic harbor porpoise). However, while the analysis is preliminary, if we map the acoustic detections of harbor porpoise over the tidal cycle, we can already start to see some interesting trends between the number of porpoise detections and the phases of the tide. Check it out!

reef3_clicks

Now, I won’t promise that I’ll be an excellent coder by the end of the winter, but I think I might have a good chance at being able to mark the “proficient” box next to Matlab and R on my first job application. Yet, whatever your reason for learning code – whether you are an undergraduate hoping to get ahead for graduate school, a graduate student hoping to escape the inevitable (like me), or just someone who thinks getting a code to work properly is a fun game – my advice to you is this:

Google first. If that fails, take mental breaks. Revisit the problem later. Think through all possible sources of error. Ask around for help. Then, when you finally fix the bug or get the code to work the way you would like it to, throw a mini-party. After it’s all over, take a deep breath and go again. Remember, you are not alone!

Happy coding this winter GEMM Lab readers – and I wish you lots of celebratory dancing!

Following Tracks: A Summer of Research in Quantitative Ecology

**GUEST POST** written by Irina Tolkova from the University of Washington.

R, a programming language and software for statistical analysis, gives me an error message.

I mull it over. Revise my code. Run it again.

Hey, look! Two error messages.

I’m Irina, and I’m working on summer research in quantitative ecology with Dr. Leigh Torres in the GEMM Lab. Ironically, as much as I’m interested in the environment and the life inhabiting it, my background is actually in applied math, and a bit in computer science.

vl-dsc04212

(Also, my background is the sand dunes of Florence, OR, which are downright amazing.)

When I mention this in the context of marine research, I usually get a surprised look. But from firsthand experience, the mindsets and skills developed in those areas can actually be very useful for ecology. This is partly because both math and computer science develop a problem-solving approach that can apply to many interdisciplinary contexts, and partly because ecology itself is becoming increasingly influenced by technology.

Personally, I’m fascinated by the advancement in environmentally-oriented sensors and trackers, and admire the inventors’ cleverness in the way they extract useful information. I’ve heard about projects with unmanned ocean gliders that fly through the water, taking conductivity, temperature, depth measurements (Seaglider project by APL at the University of Washington), which can be used for oceanographic mapping. Arrays of hydrophones along the coast detect and recognize marine mammals through bioacoustics (OSU Animal Bioacoustics Lab), allowing for analysis of their population distributions and potentially movement. In the GEMM lab, I learned about light and small GPS loggers, which can be put on wildlife to learn about their movement, and even smaller lighter ones that determine the animal’s general position using the time of sunset and sunrise. Finally, scientists even made artificial nest mounds which hid a scale for recording the weight of breeding birds — looking at the data, I could see a distinctive sawtooth pattern, since the birds lost weight as they incubated the egg, and gained weight after coming home from a foraging trip…

On the whole, I’m really hopeful for the ecological opportunities opened up by technology. But the information coming in from sensors can be both a blessing and a curse, because — unlike manually collected data — the sample sizes tend to be massive. For statistical analysis, this is great! For actually working with the data… more difficult. For my project, this trade-off shows as R and Excel crash over the hundreds of thousands of points in my dataset… what dataset, you might ask? Albatross GPS tracking data.

In 2011, 2012, and 2013, a group of scientists (including Dr. Leigh!) tagged grey-headed albatrosses at Campbell Island, New Zealand, with small GPS loggers. This was done in the summer months, when the birds were breeding, so the GPS tracks represent the birds’ flights as they incubated and raised their chicks. A cool fact about albatrosses: they only raise one chick at a time! As a result, the survival of the population is very dependent on chick survival, which means that the health of the albatrosses during the breeding season, and in part their ability to find food, is critical for the population’s sustainability. So, my research question is: what environmental variables determine where these albatrosses choose to forage?

The project naturally breaks up into two main parts.

  • How can we quantify this “foraging effort” over a trajectory?
  • What is the statistical relationship between this “foraging effort metric” and environmental variables?

Luckily, R is pretty good for both data manipulation and statistical analysis, and that’s what I’m working on now. I’ve just about finished part (1), and will be moving on to part (2) in the coming week. For a start, here are some color-coded plots showing two different ways of measuring the “foraging value” over one GPS track:

track89518

Most of my time goes into writing code, and, of course, debugging. This might sound a bit dull, but the anticipation of new results, graphs, and questions is definitely worth it. Occasionally, that anticipation is met with a result or plot that I wasn’t quite expecting. For example, I was recently attempting to draw the predicted spatial distribution of an albatross population. I fixed some bugs. The code ran. A plot window opened up. And showed this:

pretty_circles

I stared at my laptop for a moment, closed it, and got some hot tea from the lab’s electronic kettle, all the while wondering how R came up with this abstract art.

All in all, while I spend most of my time programming, my motivation comes from the wildlife I hope to work for. And as any other ecologist, I love being out there on the Oregon coast, with the sun, the rain, sand, waves, valleys and mountains, cliff swallows and grey whales, and the rest of our fantastic wild outdoors.

SONY DSC

Irina5