Data Wrangling to Assess Data Availability: A Data Detective at Work

By Alexa Kownacki, Ph.D. Student, OSU Department of Fisheries and Wildlife, Geospatial Ecology of Marine Megafauna Lab

Data wrangling, in my own loose definition, is the necessary combination of both data selection and data collection. Wrangling your data requires accessing then assessing your data. Data collection is just what it sounds like: gathering all data points necessary for your project. Data selection is the process of cleaning and trimming data for final analyses; it is a whole new bag of worms that requires decision-making and critical thinking. During this process of data wrangling, I discovered there are two major avenues to obtain data: 1) you collect it, which frequently requires an exorbitant amount of time in the field, in the lab, and/or behind a computer, or 2) other people have already collected it, and through collaboration you put it to a good use (often a different use then its initial intent). The latter approach may result in the collection of so much data that you must decide which data should be included to answer your hypotheses. This process of data wrangling is the hurdle I am facing at this moment. I feel like I am a data detective.

Data wrangling illustrated by members of the R-programming community. (Image source: R-bloggers.com)

My project focuses on assessing the health conditions of the two ecotypes of bottlenose dolphins between the waters off of Ensenada, Baja California, Mexico to San Francisco, California, USA between 1981-2015. During the government shutdown, much of my data was inaccessible, seeing as it was in possession of my collaborators at federal agencies. However, now that the shutdown is over, my data is flowing in, and my questions are piling up. I can now begin to look at where these animals have been sighted over the past decades, which ecotypes have higher contaminant levels in their blubber, which animals have higher stress levels and if these are related to geospatial location, where animals are more susceptible to human disturbance, if sex plays a role in stress or contaminant load levels, which environmental variables influence stress levels and contaminant levels, and more!

Alexa, alongside collaborators, photographing transiting bottlenose dolphins along the coastline near Santa Barbara, CA in 2015 as part of the data collection process. (Image source: Nick Kellar).

Over the last two weeks, I was emailed three separate Excel spreadsheets representing three datasets, that contain partially overlapping data. If Microsoft Access is foreign to you, I would compare this dilemma to a very confusing exam question of “matching the word with the definition”, except with the words being in different languages from the definitions. If you have used Microsoft Access databases, you probably know the system of querying and matching data in different databases. Well, imagine trying to do this with Excel spreadsheets because the databases are not linked. Now you can see why I need to take a data management course and start using platforms other than Excel to manage my data.

A visual interpretation of trying to combine datasets being like matching the English definition to the Spanish translation. (Image source: Enchanted Learning)

In the first dataset, there are 6,136 sightings of Common bottlenose dolphins (Tursiops truncatus) documented in my study area. Some years have no sightings, some years have fewer than 100 sightings, and other years have over 500 sightings. In another dataset, there are 398 bottlenose dolphin biopsy samples collected between the years of 1992-2016 in a genetics database that can provide the sex of the animal. The final dataset contains records of 774 bottlenose dolphin biopsy samples collected between 1993-2018 that could be tested for hormone and/or contaminant levels. Some of these samples have identification numbers that can be matched to the other dataset. Within these cross-reference matches there are conflicting data in terms of amount of tissue remaining for analyses. Sorting these conflicts out will involve more digging from my end and additional communication with collaborators: data wrangling at its best. Circling back to what I mentioned in the beginning of this post, this data was collected by other people over decades and the collection methods were not standardized for my project. I benefit from years of data collection by other scientists and I am grateful for all of their hard work. However, now my hard work begins.

The cutest part of data wrangling: finding adorable images of bottlenose dolphins, photographed during a coastal survey. (Image source: Alexa Kownacki).

There is also a large amount of data that I downloaded from federally-maintained websites. For example, dolphin sighting data from research cruises are available for public access from the OBIS (Ocean Biogeographic Information System) Sea Map website. It boasts 5,927,551 records from 1,096 data sets containing information on 711 species with the help of 410 collaborators. This website is incredible as it allows you to search through different data criteria and then download the data in a variety of formats and contains an interactive map of the data. You can explore this at your leisure, but I want to point out the sheer amount of data. In my case, the OBIS Sea Map website is only one major platform that contains many sources of data that has already been collected, not specifically for me or my project, but will be utilized. As a follow-up to using data collected by other scientists, it is critical to give credit where credit is due. One of the benefits of using this website, is there is information about how to properly credit the collaborators when downloading data. See below for an example:

Example citation for a dataset (Dataset ID: 1201):

Lockhart, G.G., DiGiovanni Jr., R.A., DePerte, A.M. 2014. Virginia and Maryland Sea Turtle Research and Conservation Initiative Aerial Survey Sightings, May 2011 through July 2013. Downloaded from OBIS-SEAMAP (http://seamap.env.duke.edu/dataset/1201) on xxxx-xx-xx.

Citation for OBIS-SEAMAP:

Halpin, P.N., A.J. Read, E. Fujioka, B.D. Best, B. Donnelly, L.J. Hazen, C. Kot, K. Urian, E. LaBrecque, A. Dimatteo, J. Cleary, C. Good, L.B. Crowder, and K.D. Hyrenbach. 2009. OBIS-SEAMAP: The world data center for marine mammal, sea bird, and sea turtle distributions. Oceanography 22(2):104-115

Another federally-maintained data source that boasts more data than I can quantify is the well-known ERDDAP website. After a few Google searches, I finally discovered that the acronym stands for Environmental Research Division’s Data Access Program. Essentially, this the holy grail of environmental data for marine scientists. I have downloaded so much data from this website that Excel cannot open the csv files. Here is yet another reason why young scientists, like myself, need to transition out of using Excel and into data management systems that are developed to handle large-scale datasets. Everything from daily sea surface temperatures collected on every, one-degree of latitude and longitude line from 1981-2015 over my entire study site to Ekman transport levels taken every six hours on every longitudinal degree line over my study area. I will add some environmental variables in species distribution models to see which account for the largest amount of variability in my data. The next step in data selection begins with statistics. It is important to find if there are highly correlated environmental factors prior to modeling data. Learn more about fitting cetacean data to models here.

The ERDAPP website combined all of the average Sea Surface Temperatures collected daily from 1981-2018 over my study site into a graphical display of monthly composites. (Image Source: ERDDAP)

As you can imagine, this amount of data from many sources and collaborators is equal parts daunting and exhilarating. Before I even begin the process of determining the spatial and temporal spread of dolphin sightings data, I have to identify which data points have sex identified from either hormone levels or genetics, which data points have contaminants levels already quantified, which samples still have tissue available for additional testing, and so on. Once I have cleaned up the datasets, I will import the data into the R programming package. Then I can visualize my data in plots, charts, and graphs; this will help me identify outliers and potential challenges with my data, and, hopefully, start to see answers to my focal questions. Only then, can I dive into the deep and exciting waters of species distribution modeling and more advanced statistical analyses. This is data wrangling and I am the data detective.

What people may think a ‘data detective’ looks like, when, in reality, it is a person sitting at a computer. (Image source: Elder Research)

Like the well-known phrase, “With great power comes great responsibility”, I believe that with great data, comes great responsibility, because data is power. It is up to me as the scientist to decide which data is most powerful at answering my questions.

Data is information. Information is knowledge. Knowledge is power. (Image source: thedatachick.com)

 

Science (or the lack thereof) in the Midst of a Government Shutdown

By Alexa Kownacki, Ph.D. Student, OSU Department of Fisheries and Wildlife, Geospatial Ecology of Marine Megafauna Lab

In what is the longest government shutdown in the history of the United States, many people are impacted. Speaking from a scientist’s point of view, I acknowledge the scientific community is one of many groups that is being majorly obstructed. Here at the GEMM Laboratory, all of us are feeling the frustrations of the federal government grinding to a halt in different ways. Although our research spans great distances—from Dawn’s work on New Zealand blue whales that utilizes environmental data managed by our federal government, to new projects that cannot get federal permit approvals to state data collection, to many of Leigh’s projects on the Oregon coast of the USA that are funded and collaborate with federal agencies—we all recognize that our science is affected by the shutdown. My research on common bottlenose dolphins is no exception; my academic funding is through the US Department of Defense, my collaborators are NOAA employees who contribute NOAA data; I use publicly-available data for additional variables that are government-maintained; and I am part of a federally-funded public university. Ironically, my previous blog post about the intersection of science and politics seems to have become even more relevant in the past few weeks.

Many graduate students like me are feeling the crunch as federal agencies close their doors and operations. Most people have seen the headlines that allude to such funding-related issues. However, it’s important to understand what the funding in question is actually doing. Whether we see it or not, the daily operations of the United States Federal government helps science progress on a multitude of levels.

Federal research in the United States is critical. Most governmental branches support research with the most well-known agencies for doing so being the National Science Foundation (NSF), the US Department of Agriculture (USDA), the National Oceanic and Atmospheric Administration (NOAA), and the National Aeronautics and Space Administration. There are 137 executive agencies in the USA (cei.org). On a finer scale, NSF alone receives approximately 40,000 scientific proposals each year (nsf.gov).

If I play a word association game and I am given the word “science”, my response would be “data”. Data—even absence data—informs science. The largest aggregate of metadata with open resources lives in the centralized website, data.gov, which is maintained by the federal government and is no longer accessible and directs you to this message:Here are a few more examples of science that has stopped in its track from lesser-known research entities operated by the federal government:

Currently, the National Weather Service (NWS) is unable to maintain or improve its advanced weather models. Therefore, in addition to those of us who include weather or climate aspects into our research, forecasters are having less and less information on which to base their weather predictions. Prior to the shutdown, scientists were changing the data format of the Global Forecast System (GFS)—the most advanced mathematical, computer-based weather modeling prediction system in the USA. Unfortunately, the GFS currently does not recognize much of the input data it is receiving. A model is only as good as its input data (as I am sure Dawn can tell you), and currently that means the GFS is very limited. Many NWS models are upgraded January-June to prepare for storm season later in the year. Therefore, there are long-term ramifications for the lack of weather research advancement in terms of global health and safety. (https://www.washingtonpost.com/weather/2019/01/07/national-weather-service-is-open-your-forecast-is-worse-because-shutdown/?noredirect=on&utm_term=.5d4c4c3c1f59)

An example of one output from the GFS model. (Source: weather.gov)

The Food and Drug Administration (FDA)—a federal agency of the Department of Health and Human Services—that is responsible for food safety, has reduced inspections. Because domestic meat and poultry are at the highest risk of contamination, their inspections continue, but by staff who are going without pay, according to the agency’s commissioner, Dr. Scott Gottlieb. Produce, dry foods, and other lower-risk consumables are being minimally-inspected, if at all.  Active research projects investigating food-borne illness that receive federal funding are at a standstill.  Is your stomach doing flips yet? (https://www.nytimes.com/2019/01/09/health/shutdown-fda-food-inspections.html?rref=collection%2Ftimestopic%2FFood%20and%20Drug%20Administration&action=click&contentCollection=timestopics&region=stream&module=stream_unit&version=latest&contentPlacement=2&pgtype=collection)

An FDA field inspector examines imported gingko nuts–a process that is likely not happening during the shutdown. (Source: FDA.gov)

The National Parks Service (NPS) recently made headlines with the post-shutdown acts of vandalism in the iconic Joshua Tree National Park. What you might not know is that the shutdown has also stopped a 40-year study that monitors how streams are recovering from acid rain. Scientists are barred from entering the park and conducting sampling efforts in remote streams of Shenandoah National Park, Virginia. (http://www.sciencemag.org/news/2019/01/us-government-shutdown-starts-take-bite-out-science)

A map of the sampling sites that have been monitored since the 1980s for the Shenandoah Watershed Study and Virginia Trout Stream Sensitivity Study that cannot be accessed because of the shutdown. (Source: swas.evsc.virginia.edu)

NASA’s Stratospheric Observatory for Infrared Astronomy (SOFIA), better known as the “flying telescope” has halted operations, which will require over a week to bring back online upon funding restoration. SOFIA usually soars into the stratosphere as a tool to study the solar system and collect data that ground-based telescopes cannot. (http://theconversation.com/science-gets-shut-down-right-along-with-the-federal-government-109690)

NASA’s Stratospheric Observatory for Infrared Astronomy (SOFIA) flies over the snowy Sierra Nevada mountains while the telescope gathers information. (Source: NASA/ Jim Ross).

It is important to remember that science happens outside of laboratories and field sites; it happens at meetings and conferences where collaborations with other great minds brainstorm and discover the best solutions to challenging questions. The shutdown has stopped most federal travel. The annual American Meteorological Society Meeting and American Astronomical Society meeting were two of the scientific conferences in the USA that attract federal employees and took place during the shutdown. Conferences like these are crucial opportunities with lasting impacts on science. Think of all the impressive science that could have sparked at those meetings. Instead, many sessions were cancelled, and most major agencies had zero representation (https://spacenews.com/ams-2019-overview/). Topics like lidar data applications—which are used in geospatial research, such as what the GEMM Laboratory uses in some its projects, could not be discussed. The cascade effects of the shutdown prove that science is interconnected and without advancement, everyone’s research suffers.

It should be noted, that early-career scientists are thought to be the most negatively impacted by this shutdown because of financial instability and job security—as well as casting a dark cloud on their futures in science: largely unknown if they can support themselves, their families, and their research. (https://eos.org/articles/federal-government-shutdown-stings-scientists-and-science). Graduate students, young professors, and new professionals are all in feeling the pressure. Our lives are based on our research. When the funds that cover our basic research requirements and human needs do not come through as promised, we naturally become stressed.

An adult and a juvenile common bottlenose dolphin, forage along the San Diego coastline in November 2018. (Source: Alexa Kownacki)

So, yes, funding—or the lack thereof—is hurting many of us. Federally-funded individuals are selling possessions to pay for rent, research projects are at a standstill, and people are at greater health and safety risks. But, also, science, with the hope for bettering the world and answering questions and using higher thinking, is going backwards. Every day without progress puts us two days behind. At first glance, you may not think that my research on bottlenose dolphins is imperative to you or that the implications of the shutdown on this project are important. But, consider this: my study aims to quantify contaminants in common bottlenose dolphins that either live in nearshore or offshore waters. Furthermore, I study the short-term and long-term impacts of contaminants and other health markers on dolphin hormone levels. The nearshore common bottlenose dolphin stocks inhabit the highly-populated coastlines that many of us utilize for fishing and recreation. Dolphins are mammals, that respond to stress and environmental hazards, in similar ways to humans. So, those blubber hormone levels and contamination results, might be more connected to your health and livelihood than at first glance. The fact that I cannot download data from ERDDAP, reach my collaborators, or even access my data (that starts in the early 1980s), does impact you. Nearly everyone’s research is connected to each other’s at some level, and that, in turn has lasting impacts on all people—scientists or not. As the shutdown persists, I continue to question how to work through these research hurdles. If anything, it has been a learning experience that I hope will end soon for many reasons—one being: for science.

Are bacteria important? What do we get by analyzing microbiomes?

By Leila Lemos, PhD candidate, Fisheries and Wildlife Department, OSU

As previously mentioned in one of Florence’s blog posts, the GEMM Lab holds monthly lab meetings, where we share updates about our research and discuss articles and advances in our field, among other activities.

In a past lab meeting we were asked to bring an article to discuss that had inspired us in the past to conduct research in the marine field or in our current position. I brought to the meeting a literature review regarding methodologies to overcome the challenges of studying conservation physiology in large whales [1]. This article discusses different non-invasive or minimally invasive matrices (e.g., feces, blow, skin/blubber) that can be gathered from whales, and what types of analyses could be carried out, as well as their pros and cons.

One of the possible analyses that can be performed with fecal samples that was discussed in the article is the gut microflora (i.e., bacterial gut community) via genetic analysis. Since my PhD project analyzes fecal samples to determine/quantify stress responses in gray whales, we have since discussed the possibility of integrating this extra parameter to our analysis.

But… what is the importance of analyzing the gut microflora of a whale? What is the relationship between microflora and stress responses? Should we really use our limited sample size, time and money to work on this extra analysis? In order to be able to answer all of these questions, I began reading some articles of the field to better understand its importance and what kind of research questions this analysis can answer.

The gut of a mammal comprises a natural habitat for a large and dynamic community of bacteria [2] that is first developed in early life. Colonization of facultative bacteria (i.e., aerobic bacteria) begins at birth [3], and later, anaerobic bacteria also colonizes the gut. In humans, at the age of 1 year old, the microbiome should have a stable adult-like signature (Fig. 1).

Figure 01: Development of the microbiome in early life.
Source: [3]

The gut bacterial community is important for the physiology and pathology of its host and plays an important role in mammal digestion and health [2], responsible for many metabolic activities, including:

  • fermentation of non-digestible dietary residue and endogenous mucus [2];
  • recovery of energy [2];
  • recovery of absorbable nutrients [2];
  • cellulose digestion [4];
  • vitamin K synthesis [4];
  • important trophic effects on intestinal epithelia (cell proliferation and differentiation) [2];
  • angiogenesis promotion [4];
  • enteric nerve function [4];
  • immune structure [2];
  • immune function [2];
  • protection of the colonized host against invasion by alien microbes (barrier effect) [2];

Despite all the benefits, the bacterial community might also be potentially harmful when changes in the community composition (i.e., dysbiosis) occur due to the use of antibiotics, illness, stress, aging, lifestyle, bad dietary habits [4], and prolonged food and water deprivation [5]. Thus, potential pathological disorders might emerge when the microbiome community changes, such as allergy, obesity, diabetes, autism, multisystem organ failure, gastrointestinal and prostate cancers, inflammatory bowel diseases (IBD), and cardiovascular diseases [2, 4].

Changes in gut bacterial composition may also alter the brain-gut axis and the central nervous system (CNS) signaling [3]. More specifically, the core pathway affected is the hypothalamic-pituitary-adrenal (HPA) axis, which is activated by physical/psychological stressors. According to a previous study [6], the microbial community in the gut is critical for the development of an appropriate stress response. In addition, the microbial colonization in early life should occur within a certain time window, otherwise an abnormal development of the HPA axis might happen.

However, the gut microbiome can not only affect the HPA axis, but the opposite can also occur [3]. Signaling molecules released by the axis can alter the gastrointestinal (GIT) environment (i.e., motility, secretion, and permeability) [7]. Stress responses, as well as diseases, may also alter the gut permeability, causing the bacteria to cross the epithelial barrier (reducing the overall numbers of bacteria in the gut), activating immune responses that also alter the composition of the bacterial community in the gut [8, 9].

Figure 02: Communication between the brain, gut and microbiome in a healthily and in a stressed or diseased (mucosal inflammation) mammal.
Source: [3]

Thus, when thinking about whales, monitoring of the gut microflora might allow us to detect changes caused by factors such as aging, illness, prolonged food deprivation, and stressful events [2, 5]. However, since these are two-way factors, it is important to find an association between bacterial composition alterations and stressful events, such as the presence of predators (e.g., killer whales), illness (e.g., bad body condition), prolonged food deprivation (e.g., low prey availability and high competition), noise (e.g., noisy vessel traffic, fisheries opening and seismic surveys), and stressful reproductive status (e.g., pregnancy and lactating period). Examination of possible shifts in the gut microflora may be able to detect and be linked to many of these events, and also forecast possible chronic events within the population. In addition, the bacterial community monitoring study could aid in validating the hormone data (i.e., cortisol) we have been working with.

Therefore, the main research questions that arise in this context that can aid in elucidating the stress physiology in gray whales are:

  1. What is the microflora community content in guts of gray whales along the Oregon coast?
  2. Is it possible to detect shifts in the gut microflora from our gray fecal samples over time?
  3. How do gut microflora and cortisol levels correlate?
  4. Am I able to correlate shifts in gut microflora with any of the stressful events listed above?

We can answer so many other questions by analyzing the microbiome of baleen whales. Microbiomes are mainly correlated with host diet [10], so the composition of a microbiome can be associated with specific diets and functional gut capacity, and consequently, be linked to other animal populations, which helps to decode evolutionary questions. Results of a previous study on baleen whale microbiomes [10] point out that whales harbor unique gut microbiomes that are actually similar to those of terrestrial herbivores. Baleen whales and terrestrial herbivores have a shared physical structure of the GIT tract itself (i.e., multichambered foregut) and a shared hole for fermentative metabolisms. The multichambered foregut of baleen whales fosters the maintenance of the gut microbiome that is capable of extracting relatively unavailable nutrients from zooplankton (i.e., chitin, “sea cellulose”).

Figure 03: The similarities between whale and other terrestrial herbivore gut microbiomes: sea and land ruminants.
Source: [11]

Thus, the importance of studying the gut microbiome of a baleen whale is clear. Monitoring of the bacterial community and possible shifts can help us elucidate many questions regarding diet, overall health, stress physiology and evolution. Thinking about my PhD project, it may also help in validating our cortisol level results. I am confident that a microbiome analysis would significantly enhance my studies on the health and ecology of gray whales.

 

References

  1. Hunt, K.E., et al., Overcoming the challenges of studying conservation physiology in large whales: a review of available methods.Conservation Physiology, 2013. 1: p. 1-24.
  2. Guarner, F. and J.-R. Malagelada, Gut flora in health and disease.The Lancet, 2003. 360: p. 512–519.
  3. Grenham, S., et al., Brain–gut–microbe communication in health and disease.Frontiers in Physiology, 2011. 2: p. 1-15.
  4. Zhang, Y., et al., Impacts of Gut Bacteria on Human Health and Diseases.International Journal of Molecular Sciences, 2015. 16: p. 7493-7519.
  5. Bailey, M.T., et al., Stressor exposure disrupts commensal microbial populations in the intestines and leads to increased colonization by Citrobacter rodentium.Infection and Immunity, 2010. 78: p. 1509–1519.
  6. Sudo, N., et al., Postnatal microbial colonization programs the hypothalamic-pituitary-adrenal system for stress response in mice.The Journal of Physiology, 2004. 558: p. 263–275.
  7. Rhee, S.H., C. Pothoulakis, and E.A. Mayer, Principles and clinical implications of the brain–gut–enteric microbiota axis Nature Reviews Gastroenterology & Hepatology, 2009. 6: p. 306–314.
  8. Kiliaan, A.J., et al., Stress stimulates transepithelial macromolecular uptake in rat jejunum.American Journal of Physiology, 1998. 275: p. G1037–G1044.
  9. Dinan, T.G. and J.F. Cryan, Regulation of the stress response by the gut microbiota: Implications for psychoneuroendocrinology.Psychoneuroendocrinology 2012. 37: p. 1369—1378.
  10. Sanders, J.G., et al., Baleen whales host a unique gut microbiome with similarities to both carnivores and herbivores.Nature Communications, 2015. 6(8285): p. 1-8.
  11. El Gamal, A. Of whales and cows: the baleen whale microbiome revealed. Oceanbites 2016[cited 2018 07/31/2018]; Available from: https://oceanbites.org/of-whales-and-cows-the-baleen-whale-microbiome-revealed/.