Data Wrangling to Assess Data Availability: A Data Detective at Work

By Alexa Kownacki, Ph.D. Student, OSU Department of Fisheries and Wildlife, Geospatial Ecology of Marine Megafauna Lab

Data wrangling, in my own loose definition, is the necessary combination of both data selection and data collection. Wrangling your data requires accessing then assessing your data. Data collection is just what it sounds like: gathering all data points necessary for your project. Data selection is the process of cleaning and trimming data for final analyses; it is a whole new bag of worms that requires decision-making and critical thinking. During this process of data wrangling, I discovered there are two major avenues to obtain data: 1) you collect it, which frequently requires an exorbitant amount of time in the field, in the lab, and/or behind a computer, or 2) other people have already collected it, and through collaboration you put it to a good use (often a different use then its initial intent). The latter approach may result in the collection of so much data that you must decide which data should be included to answer your hypotheses. This process of data wrangling is the hurdle I am facing at this moment. I feel like I am a data detective.

Data wrangling illustrated by members of the R-programming community. (Image source: R-bloggers.com)

My project focuses on assessing the health conditions of the two ecotypes of bottlenose dolphins between the waters off of Ensenada, Baja California, Mexico to San Francisco, California, USA between 1981-2015. During the government shutdown, much of my data was inaccessible, seeing as it was in possession of my collaborators at federal agencies. However, now that the shutdown is over, my data is flowing in, and my questions are piling up. I can now begin to look at where these animals have been sighted over the past decades, which ecotypes have higher contaminant levels in their blubber, which animals have higher stress levels and if these are related to geospatial location, where animals are more susceptible to human disturbance, if sex plays a role in stress or contaminant load levels, which environmental variables influence stress levels and contaminant levels, and more!

Alexa, alongside collaborators, photographing transiting bottlenose dolphins along the coastline near Santa Barbara, CA in 2015 as part of the data collection process. (Image source: Nick Kellar).

Over the last two weeks, I was emailed three separate Excel spreadsheets representing three datasets, that contain partially overlapping data. If Microsoft Access is foreign to you, I would compare this dilemma to a very confusing exam question of “matching the word with the definition”, except with the words being in different languages from the definitions. If you have used Microsoft Access databases, you probably know the system of querying and matching data in different databases. Well, imagine trying to do this with Excel spreadsheets because the databases are not linked. Now you can see why I need to take a data management course and start using platforms other than Excel to manage my data.

A visual interpretation of trying to combine datasets being like matching the English definition to the Spanish translation. (Image source: Enchanted Learning)

In the first dataset, there are 6,136 sightings of Common bottlenose dolphins (Tursiops truncatus) documented in my study area. Some years have no sightings, some years have fewer than 100 sightings, and other years have over 500 sightings. In another dataset, there are 398 bottlenose dolphin biopsy samples collected between the years of 1992-2016 in a genetics database that can provide the sex of the animal. The final dataset contains records of 774 bottlenose dolphin biopsy samples collected between 1993-2018 that could be tested for hormone and/or contaminant levels. Some of these samples have identification numbers that can be matched to the other dataset. Within these cross-reference matches there are conflicting data in terms of amount of tissue remaining for analyses. Sorting these conflicts out will involve more digging from my end and additional communication with collaborators: data wrangling at its best. Circling back to what I mentioned in the beginning of this post, this data was collected by other people over decades and the collection methods were not standardized for my project. I benefit from years of data collection by other scientists and I am grateful for all of their hard work. However, now my hard work begins.

The cutest part of data wrangling: finding adorable images of bottlenose dolphins, photographed during a coastal survey. (Image source: Alexa Kownacki).

There is also a large amount of data that I downloaded from federally-maintained websites. For example, dolphin sighting data from research cruises are available for public access from the OBIS (Ocean Biogeographic Information System) Sea Map website. It boasts 5,927,551 records from 1,096 data sets containing information on 711 species with the help of 410 collaborators. This website is incredible as it allows you to search through different data criteria and then download the data in a variety of formats and contains an interactive map of the data. You can explore this at your leisure, but I want to point out the sheer amount of data. In my case, the OBIS Sea Map website is only one major platform that contains many sources of data that has already been collected, not specifically for me or my project, but will be utilized. As a follow-up to using data collected by other scientists, it is critical to give credit where credit is due. One of the benefits of using this website, is there is information about how to properly credit the collaborators when downloading data. See below for an example:

Example citation for a dataset (Dataset ID: 1201):

Lockhart, G.G., DiGiovanni Jr., R.A., DePerte, A.M. 2014. Virginia and Maryland Sea Turtle Research and Conservation Initiative Aerial Survey Sightings, May 2011 through July 2013. Downloaded from OBIS-SEAMAP (http://seamap.env.duke.edu/dataset/1201) on xxxx-xx-xx.

Citation for OBIS-SEAMAP:

Halpin, P.N., A.J. Read, E. Fujioka, B.D. Best, B. Donnelly, L.J. Hazen, C. Kot, K. Urian, E. LaBrecque, A. Dimatteo, J. Cleary, C. Good, L.B. Crowder, and K.D. Hyrenbach. 2009. OBIS-SEAMAP: The world data center for marine mammal, sea bird, and sea turtle distributions. Oceanography 22(2):104-115

Another federally-maintained data source that boasts more data than I can quantify is the well-known ERDDAP website. After a few Google searches, I finally discovered that the acronym stands for Environmental Research Division’s Data Access Program. Essentially, this the holy grail of environmental data for marine scientists. I have downloaded so much data from this website that Excel cannot open the csv files. Here is yet another reason why young scientists, like myself, need to transition out of using Excel and into data management systems that are developed to handle large-scale datasets. Everything from daily sea surface temperatures collected on every, one-degree of latitude and longitude line from 1981-2015 over my entire study site to Ekman transport levels taken every six hours on every longitudinal degree line over my study area. I will add some environmental variables in species distribution models to see which account for the largest amount of variability in my data. The next step in data selection begins with statistics. It is important to find if there are highly correlated environmental factors prior to modeling data. Learn more about fitting cetacean data to models here.

The ERDAPP website combined all of the average Sea Surface Temperatures collected daily from 1981-2018 over my study site into a graphical display of monthly composites. (Image Source: ERDDAP)

As you can imagine, this amount of data from many sources and collaborators is equal parts daunting and exhilarating. Before I even begin the process of determining the spatial and temporal spread of dolphin sightings data, I have to identify which data points have sex identified from either hormone levels or genetics, which data points have contaminants levels already quantified, which samples still have tissue available for additional testing, and so on. Once I have cleaned up the datasets, I will import the data into the R programming package. Then I can visualize my data in plots, charts, and graphs; this will help me identify outliers and potential challenges with my data, and, hopefully, start to see answers to my focal questions. Only then, can I dive into the deep and exciting waters of species distribution modeling and more advanced statistical analyses. This is data wrangling and I am the data detective.

What people may think a ‘data detective’ looks like, when, in reality, it is a person sitting at a computer. (Image source: Elder Research)

Like the well-known phrase, “With great power comes great responsibility”, I believe that with great data, comes great responsibility, because data is power. It is up to me as the scientist to decide which data is most powerful at answering my questions.

Data is information. Information is knowledge. Knowledge is power. (Image source: thedatachick.com)

 

“The joy of paper acceptance” or “The GEMM Lab’s recent scientific contributions”

Dr. Leigh Torres, Geospatial Ecology of Marine Megafauna Lab, Marine Mammal Institute, Oregon State University

The GEMM Lab is always active – running field projects, leading outreach events, giving seminars, hosting conferences, analyzing data, mentoring young scientists, oh the list goes on! (Yes, I am a proud lab PI). And, recently we have had a flurry of scientific papers either published or accepted for publication that I want to highlight. These are all great pieces of work that demonstrate our quality work, poignant and applied science, and strong collaborations. For each paper listed below I provide a short explanation of the study and implications. (Those names underlined are GEMM Lab members, and I provided a weblink where available.)

 

Sullivan, F.A. & Torres, L.G. Assessment of vessel disturbance to gray whales to inform sustainable ecotourism. The Journal of Wildlife Management, doi:10.1002/jwmg.21462.

This project integrated research and outreach regarding gray whale behavioral response to vessels. We simultaneously tracked whales and vessels, and data analysis showed significant differences in gray whale activity budgets when vessels were nearby. Working with stakeholders, we translated these results into community-developed vessel operation guidelines and an informational brochure to help mitigate impacts on whales.

 

Hann, C., Stelle, L., Szabo, A. & Torres, L. (2018) Obstacles and Opportunities of Using a Mobile App for Marine Mammal Research. ISPRS International Journal of Geo-Information, 7, 169. http://www.mdpi.com/2220-9964/7/5/169

This study demonstrates the strengths (fast and cheap data collection) and weaknesses (spatially biased data) of marine mammal data collected using the mobile app Whale mAPP. We emphasize the need for increased citizen science participation to overcome obstacles, which will enable this data collection method to achieve its great potential.

 

Barlow, D.R., Torres, L.G., Hodge, K., Steel, D., Baker, C.S., Chandler, T.E., Bott, N., Constantine, R., Double, M.C., Gill, P.C., Glasgow, D., Hamner, R.M., Lilley, C., Ogle, M., Olson, P.A., Peters, C., Stockin, K.A., Tessaglia-Hymes, C.T. & Klinck, H. (in press) Documentation of a New Zealand blue whale population based on multiple lines of evidence. Endangered Species Research. https://doi.org/10.3354/esr00891.

This study used genetics, acoustics, and photo-id to document a new population of blue whales around New Zealand that is genetically isolated, has high year-round residence, and shows limited connectivity to other blue whale populations. This discovery has important implication for population management, especially in the South Taranaki Bight region of New Zealand where the whales forage among industrial activity.

 

Burnett, J.D., Lemos, L., Barlow, D.R., Wing, M.G., Chandler, T.E. & Torres, L.G. (in press) Estimating morphometric attributes of baleen whales with photogrammetry from small UAS: A case study with blue and gray whales. Marine Mammal Science.

Here we developed methods to measure whale body morphometrics using images captured via Unmanned Aerial Systems (UAS; ‘drones’). The paper presents three freely available analysis programs and a protocol to help the community standardize methods, assess and minimize error, and compare data between studies.

 

Holdman, A.K., Haxel, J.H., Klinck, H. & Torres, L.G. (in press) Acoustic monitoring reveals the times and tides of harbor porpoise distribution off central Oregon, USA. Marine Mammal Science.

Right off the Newport, Oregon harbor entrance we listened for harbor porpoises at two locations using hydrophones. We found that porpoise presence at the shallow rocky reef site corresponds with the ebb tidal phase, while harbor porpoise presence at the deeper site with sandy bottom was associated with night-time foraging. It appears that harbor porpoise change their spatial and temporal patterns of habitat use to increase their foraging efficiency.

 

Derville, S., Torres, L.G., Iovan, C. & Garrigue, C. (in press) Finding the right fit: Comparative cetacean distribution models using multiple data sources. Diversity and Distributions.

Species distribution models (SDM) are used widely to understand the drivers of cetacean distribution patterns, and to predict their space-use patterns too. Using humpback whale sighting datasets in New Caledonia, this study explores the performance of different SDM algorithms (GAM, BRT, MAXENT,  GLM, SVM) and methods of modeling presence-only data. We highlight the importance of controlling for model overfitting and thorough model validation.

 

Bishop, A.M., Brown, C., Rehberg, M., Torres, L.G. & Horning, M. (in press) Juvenile Steller sea lion (Eumetopias jubatus) utilization distributions in the Gulf of Alaska. Movement Ecology.

This study examines the distribution patterns of juvenile Steller sea lions in the Gulf of Alaska to gain a better understanding of the habitat needs of this vulnerable demographic group within a threatened population. Utilization distributions were derived for 84 tagged sea lions, which showed sex, seasonal and spatial differences. This information will support the development of a species recovery plan.

This comic seemed appropriate here. Thanks for everyone’s hard work!

Coastal oceanography takes patience

Joe Haxel, Acoustician, Assistant Professor, CIMRS/OSU

Greetings GEMM Lab blog readers. My name is Joe Haxel and I’m a close collaborator with Leigh and other GEMM lab members on the gray whale ecology, physiology and noise project off the Oregon coast. Leigh invited me for a guest blog appearance to share some of the acoustics work we’ve been up to and as you’ve probably guessed by now, my specialty is in ocean acoustics. I’m a PI in NOAA’s Pacific Marine Environmental Laboratory’s Acoustics Program and OSU’s Cooperative Institute for Marine Resources Studies where I use underwater sound to study a variety of earth and ocean processes.

As a component of the gray whale noise project, during the field seasons of 2016 and 2017 we recorded some of the first measurements of ambient sound in the shallow coastal waters off Oregon between 7 and 20 meters depth. In the passive ocean acoustics world this is really shallow, and with that comes all kinds of instrument and logistical challenges, which is probably one of the main reasons there is little or no acoustic baseline information in this environment.

For instance, one of the significant challenges is rooted in the hydrodynamics surrounding mobile recording systems like the drifting hydrophone we used during the summer field season in 2016 (Fig 1). Decoupling motion of the surface buoy (e.g., caused by swell and waves) from the submerged hydrophone sensor is critical, and here’s why. Hydrophones convert pressure fluctuations at the sensor/ water interface to a calibrated voltage recorded by a logging system. Turbulence resulting from moving the sensor up and down in the water column with surface waves introduces non-acoustic pressure changes that severely contaminate the data for noise level measurements. Vertical and horizontal wave motions are constantly acting on the float, so we needed to engineer compliance between the surface float and the suspended hydrophone sensor to decouple these accelerations. To overcome this, we employed a couple of concepts in our drifting hydrophone design. 1) A 10 cm diameter by 3 m long spar buoy provided floatation for the system. Spar buoys are less affected by wave motion accelerations compared to most other types of surface floatation with larger horizontal profiles and drag. 2) A dynamic shock cord that could stretch up to double its resting length to accommodate vertical motion of the spar buoy; 3) a heave plate that significantly reduced any vertical motion of the hydrophone suspended below it. This was a very effective design, and although somewhat cumbersome in transport with the RHIB between deployment sites, the acoustic data we collected over 40 different drifts around Newport and Port Orford in 2016 was clean, high quality and devoid of system induced contamination.

Figure 1. The drifting hydrophone system used for 40 different drifts recording ambient noise levels in 7-20 m depths in the Newport and Port Orford, OR coastal areas.

 

 

 

 

 

 

 

 

 

 

 

 

Spatial information from the project’s first year acoustic recordings using the drifting hydrophone system helped us choose sites for the fixed hydrophone stations in 2017. Now that we had some basic information on the spatial variability of noise within the study areas we could focus on the temporal objectives of characterizing the range of acoustic conditions experienced by gray whales over the course of the entire foraging season at these sites in Oregon. In 2017 we deployed “lander” style instrument frames, each equipped with a single, omni-directional hydrophone custom built by Haru Matsumoto at our NOAA/OSU Acoustics lab (Fig. 2). The four hydrophone stations were positioned near each of the ports (Yaquina Bay and Port Orford) and in partnership with the Oregon Department of Fish and Wildlife Marine Reserves program in the Otter Rock Marine Reserve and the Redfish Rocks Marine Reserve. The hydrophones were programmed on a 20% duty cycle, recording 12 minutes of every hour at 32 kHz sample rate, providing spectral information in the frequency band from 10 Hz up to a 13 kHz.

Figure 2. The hydrophone (black cylinder) on its lander frame ready for deployment.

Here’s where the story gets interesting. In my experience so far putting out gear off the Oregon coast, anything that has a surface expression and is left out for more than a couple of weeks is going to have issues. Due to funding constraints, I had to challenge that theory this year and deploy 2 of the units with a surface buoy. This is not typically what we do with our equipment since it usually stays out for up to 2 years at a time, is sensitive, and expensive. The 2 frames with a surface float were going to be deployed in Marine Reserves far enough from the traffic lanes of the ports and in areas with significantly less traffic and presumably no fishing pressure.  The surface buoy consisted of an 18 inch diameter hard plastic float connected to an anchor that was offset from the instrument frame by a 150 foot weighted groundline. The gear was deployed off Newport in June and Port Orford in July. What could go wrong?

After monthly buoy checks by the project team, including GPS positions, and buoy cleanings my hopes were pretty high that the surface buoy systems might actually make it through the season with recoveries scheduled in mid-October. Had I gambled and won? Nope. The call came in September from Leigh that one of the whale watching outfits in Depoe Bay recovered a free floating buoy matching ours. Bummer. Alternative recovery plans initiated and this is where things began to get hairy. Fortunately, we had an ace in our back pocket. We have collaborators at the Oregon Coast Aquarium (OCA) who have a top-notch research diving team led by Jim Burke. In the last week of October, they performed a successful search dive on the missing unit near Gull Rock and attached a new set of floats directly to the instrument frame. The divers were in the water for a short 20 minutes thanks to the good series of marks recorded during the buoy checks throughout the summer (Fig. 3).

Figure 3. OCA divers, Jenna and Doug, heading out for a search dive to locate and mark the Gull Rock hydrophone lander.

 

 

 

 

 

We had surface marker floats on the frame, but there was a new problem. Video taken by Jenna and Doug from the OCA dive team revealed the landers were pretty sanded in from a couple of recent October storms (Fig. 4). Ugghhh!

Figure 4. Sanded in lander at Gull Rock. Notice the sand dollars and bull kelp wrapped on the frame.

Alternative recovery plan adjustment: we’re gonna need a diver assisted recovery with 2 boats. One to bring a dive team to air jet the sand out away from the legs of the frame and another larger vessel with pulling power to recover the freed lander. Enter the R/V Pacific Surveyor and Capt. Al Pazar. Al, Jim and I came up with a new recovery plan and only needed a decent weather window of a few hours to get the job done. Piece of cake in November off the Oregon coast, right?

The weather finally cooperated in early December in-line with the OCA dive team and R/V Pacific Surveyor’s availability. The 2 vessels and crew headed up to Gull Rock for the first recovery operation of the day. At first we couldn’t locate the surface floats. Oh no. It seemed the rough fall/ winter weather and high seas since late October were too much for the crab floats? As it turns out, we eventually found the floats eastward about 200 m but couldn’t initially see them in the glare and whitecapping conditions that morning. The lander frame had broken loose from its weakened anchor legs in the heavy weather (as it was designed to do through an Aluminum/ Stainless Steel galvanic reaction over time) and rolled or hopped eastward by about 200 m (Fig. 5). Oh dear!

Figure 5. A hydrophone lander after recovery. Notice all but 1 of the concrete anchor legs missing from the recovered lander and the amount of bio-fouling on the hydrophone (compared to Figure 2).

 

 

 

 

 

 

Thankfully, the hydrophone was well protected, and no air jetting was required. With OCA divers out of the water and clear, the Pacific Surveyor headed over to the floats and easily pulled the lander frame and hydrophone on board (Fig. 6). Yipee!

On to the next hydrophone station. This station, deployed ~ 800 m west of the south reef off of South Beach near the Yaquina Bay port entrance. It was deployed entirely subsurface and was outfitted with an acoustic release transponder that I could communicate with from the surface and command to release a pop-up messenger float and line for eventual recovery of the instrument frame. Once on station, communication with the release was established easily (a good start) and we began ranging and moving the OCA vessel Gracie Lynn in to a position within about 2 water depths of the unit (~40 m). I gave the command to the transponder and the submerged release confirmed it was free of its anchor and heading for the surface, but it never made it. Uh oh. Turns out this lander had also broke free of its anchored legs and rolled/ hopped 800 m eastward until it was pinned up against the boulder structure of the south reef. Amazingly, OCA divers Jenna and Doug located the messenger float ~ 5 m below the surface and the messenger line had been fouled by the rolling frame so it could not reach the surface. They dove down the messenger line and attached a new recovery line to the lander frame and the Pacific Surveyor hauled up the frame and hydrophone in-tact (Fig. 6). Double recovery success!

Figure 6. R/V Pacific Surveyor recovering hydrophone landers off Gull Rock and South Beach.

The hydrophone data from both systems looks outstanding and analysis is underway. This recovery effort took a huge amount of patience and the coordination of 3 busy groups (NOAA/OSU, OCA, Capt. Al). Thanks to these incredible collaborations and some heroic diving from Jim Burke and his OCA dive team, we now have a unique and unprecedented shallow water passive acoustic data set from the energetic waters off the Oregon coast.

So that’s some of the story from the 2016 and 2017 field season acoustic point of view. I’ll save the less exciting, but equally successful instrument recoveries from Port Orford for another time.