Mapping data to Darwin Core using the LivingNorwayR package Part 1

{LivingNorwayR} (Chipperfield, Grainger, and Nilsen (2022)) is our newly developed R (R Core Team (2020)) package that allows the user to create a Darwin Core Standard ( compliant data archive (“a data package”) for their biodiversity data.

We assume that the reader knows something about the Darwin Core Standards (a very basic understanding is okay). Have a look at our vignette “Handling Darwin Core Files With Living Norway: Example Using the TOV-E Dataset” for a good overview.

Here we will run through an example of how to map biodiversity data to the Darwin Core terms using {LivingNorwayR}.

Load the packages we need

As {LivingNorwayR} is still in development (although now firmly in the testing stage of development) we need to install it from GitHub. You can do this using the following code.

list.of.packages <- c("tidyverse", "devtools", "uuid")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(tidyverse, quietly = TRUE)

The data

Let’s use a well-known and openly available dataset from R; The Palmer Penguins dataset (Horst, Hill, and Gorman (2020)).

This dataset consists of observations and measurements of three different species of penguin in Antarctica [Artwork by Allison Horst].

The three species of penguin.Artwork by Allison
Horst We can have a quick look at the dataset.


#clean the column names
penguin_data<-penguin_data %>% 

penguin_data %>% 
  head() %>% 
  kableExtra::kable() %>% 
  kableExtra::kable_styling("striped", full_width = F) %>% 
 kableExtra::scroll_box(width = "800px", height = "200px")
study\_name sample\_number species region island stage individual\_id clutch\_completion date\_egg culmen\_length\_mm culmen\_depth\_mm flipper\_length\_mm body\_mass\_g sex delta\_15\_n\_o\_oo delta\_13\_c\_o\_oo comments
PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 2007-11-11 39.1 18.7 181 3750 MALE NA NA Not enough blood for isotopes.
PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 2007-11-11 39.5 17.4 186 3800 FEMALE 8.94956 -24.69454 NA
PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 2007-11-16 40.3 18.0 195 3250 FEMALE 8.36821 -25.33302 NA
PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 2007-11-16 NA NA NA NA NA NA NA Adult not sampled.
PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 2007-11-16 36.7 19.3 193 3450 FEMALE 8.76651 -25.32426 NA
PAL0708 6 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A2 Yes 2007-11-16 39.3 20.6 190 3650 MALE 8.66496 -25.29805 NA

Each row is a single individual penguin of one of the three species. There are measures of body size (bill and flipper lengths), sex (male or female), as well as information on egg laying date and stable isotope analysis from blood samples.

Bill length and depth measurements for each penguin.Artwork by Allison

Mapping to Darwin Core

Deciding on the Core

The first task is to decide how our data will be structured. We need to decide what class of file will be our core data file in the Darwin Core Archive. {LivingNorwayR} can give us a list of the potential core classes that could best describe our data.

From the above list of Event Members we can select those that are most relevant for our dataset. We do not have to use them all! GBIF recommends some required and suggested terms for Events here ( These include eventID, eventDate, samplingProtocol, samplingSizeValue and samplingSizeUnit as required. Some of the strongly recommended elements that it makes sense for us to include are parentEventID, countryCode, locationID decimalLatitude, decimalLongitude, geodeticDatum and coordinateUncertaintyInMeters. We can also add type,datasetName, ownerInstitutionCode, country, year, month and day.

Parent Events

Each event is a part of a higher level Event which is referred to as a Parent Event. The Parent Event in our case is represented by the “studyName” column. This represents a unique expedition carried out at a separate time. We can include this information in the Event table.

The Parent Events are three different expeditions in different years.

Each Parent Event needs a unique persistent identifier, parentEventID, which we can obtain from the {uuid} package (Urbanek and Ts’o (2021)).

penguin_data=penguin_data %>%
  group_by(study_name) %>%
    parentEventID = uuid::UUIDgenerate(use.time = FALSE)

There are different date ranges for each parent event and these need to be added as an eventDate.

# Event Date for parentIDs

parent_penguinEvent=penguin_data %>%
  group_by(parentEventID) %>%
  summarise(min=min(date_egg), max=max(date_egg)) %>%
  mutate(eventDate=paste0(min,"/", max)) %>% mutate(eventID=parentEventID)

We can also add some wider scale geographic information to the parent events. Such as continent and islandGroup.

# Event continent and islandGroup for parentIDs
parent_penguinEvent=parent_penguinEvent %>%
  mutate(continent="Antarctica") %>%
  mutate(islandGroup="Palmer Archipelago") %>%


Let’s start with the type, datasetName and ownerInstitutionCode. The type is “Event”, the datasetName is “Palmer-penguins” and the Palmer Station Antarctica LTER ownerInstitutionCode is “PAL”.

penguin_data=penguin_data %>% 

Each Event also needs a unique identifier (eventID) and we can use the same approach as above. This time as each row is an Event we need to make sure the dataframe is ungrouped.

penguin_data=penguin_data %>% 
  ungroup() %>%
    eventID = uuid::UUIDgenerate(use.time = FALSE)

Again we can include an eventDate for each event. For the penguins data we do not have a true date of the sampling event, but we do have a egg laying date and we shall use this for illustrative purposes. We can also extract the day, month and year information at the same time.

penguin_data=penguin_data %>% 
    eventDate = date_egg) %>% 

As each event is a sample in the penguins dataset where they measured a individual penguin we can set the sampleSizeValue as 1. The sampleSizeUnit can be “Adult penguin”.

penguin_data=penguin_data %>% 
  mutate(sampleSizeValue=1) %>% 
  mutate(sampleSizeUnit="Adult penguin")

We can get the samplingProtocol for each event by looking at the original data package (links can be found when you type ??palmerpenguins::penguins_raw in to the R Console). All three parent events have the same protocol.

penguin_data=penguin_data %>% 
  mutate(samplingProtocol= "Each season, study nests, where pairs of adults were present, were individually marked and chosen before the onset of egg-laying, and consistently monitored. When study nests were found at the one-egg stage, both adults were captured to obtain blood samples used for molecular sexing and stable isotope analyses, and measurements of structural size and body mass. At the time of capture, each adult penguin was quickly blood sampled (~1 ml) from the brachial vein using a sterile 3 ml syringe and heparinized infusion needle. Collected blood was stored in 1.5 ml micro-centrifuge tubes that were kept cool. In the field, a small amount of whole blood was smeared on clean filter paper stored in a 1.5 ml micro-centrifuge tube for molecular sexing. Measurements of culmen length and depth (using dial calipers ± 0.1 mm), right flipper (using a ruler ± 1 mm), and body mass (using 5 kg ± 25 g or 10 kg ± 50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation. After handling, individuals at study nests were further monitored to ensure the pair reached clutch completion, i.e., two eggs. Molecular analyses were conducted at Simon Fraser University following standard PCR protocols, and stable isotope analyses were conducted at the Stable Isotope Facility at the University of California, Davis using an elemental analyzer interfaced with an isotope ratio mass spectrometer")

countryCode is the two letter standard (using ISO 3166-1-alpha-2) code for a country. Antarctica, defined as the territories south of 60°S is given the code AQ.

penguin_data =penguin_data %>% 

The locationID can be a global unique identifier or an identifier specific to the data set. We have the region and island for the penguins data so we can use these to develop a data specific identifier.

penguin_data=penguin_data %>% 
  mutate(locationID=paste0(region, "_", island))

We are not provided with precise coordinates (decimalLatitude, decimalLongitude) for the samples in the penguin data. However, we can get the centroid for each island; Torgersen is at -64.77308, -64.07413; Biscoe is at -65.4333316 -65.499998; and Dream is at -64.7333333, -64.2333333. The geodeticDatum is WGS 84 (this is the default assumed by GBIF if you do not add this field explicitly).

penguin_data=penguin_data %>% 
  ) %>% 
  ) %>% 
  mutate(geodeticDatum="WGS 84")

As the coordinates are just the centroid for the island we need to include some measure of uncertainty (coordinateUncertaintyInMeters). We can guess the uncertainity by looking at the size of the islands. Torgersen is 400 m wide; Biscoe is around 500m wide and Dream is also around 400 m wide.

penguin_data=penguin_data %>% 

Finally, we can create the Event core by selecting those elements that we have listed above.

eventDF=penguin_data %>% 

Then we need to add in the Parent Events in to the Event dataframe.

eventDF=eventDF %>% 
  mutate(continent=NA) %>% 
  mutate(islandGroup=NA) %>% 
  mutate(eventDate=as.character(eventDate)) %>% bind_rows(parent_penguinEvent)

The final stage is to initialise an event object in the {livingNorwayR} package - this will be used later to build the Darwin Core compliant data package.

GBIF_Event=initializeGBIFEvent(eventDF, idColumnInfo = "eventID", nameAutoMap = TRUE)

The Occurrence extension

We can find the supported terms for the Occurrence extension by using the following function.

GBIF also has a list of required and recommended terms for Occurrence data ( The required terms are occurrenceID, basisOfRecord,scientificName, eventDate (included in the event core). The recommended terms include countryCode, taxonRank,kingdom, decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters,individualCount, organismQuantity and organismQuantityType, some of which are included in the event core.

We will also add some more information including type, collectionCode, organismQuantity, organismQuantityType, phylum, class, order, family, genus, vernacularName and sex.

The type is an “Occurrence”. The collectionCode can be “Palmer Station Antarctica LTER” and the occurrenceID should be a globally unique identifier.

penguin_data=penguin_data %>% 
    collectionCode="Palmer Station Antarctica LTER") %>% 
  mutate(occurrenceID=uuid::UUIDgenerate(use.time = FALSE))

The basisOfRecord records how the observation was made (e.g. PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, etc.)

klippy::klippy(position = "right")
penguin_data=penguin_data %>% 

organismQuantity and organismQuantityType, are 1 individual penguin.

penguin_data=penguin_data %>% 
  mutate(organismQuantity= 1,
         organismQuantityType= "individual")

For the scientificName and vernacularName we can use the species name from the original data set.

penguin_data=penguin_data %>% 
  mutate(scientificName=gsub("[\\(\\)]", "", regmatches(species, gregexpr("\\(.*?\\)",species))[[1]]),

taxonRank, kingdom, phylum, class, order, family and genus all relate to the scientific name of the species. The taxonRank will be species because we have identified the penguins to the species level.

penguin_data=penguin_data %>% 
         order= "Sphenisciformes",
         family= "Spheniscidae",
         ) %>% 

Finally, as we did with the Event core, we can create the occurrence extension by selecting those elements that we have listed above and create a {livingNorwayR} object.

occ_ext=penguin_data %>% 
  select(type,collectionCode,basisOfRecord,occurrenceID, organismQuantity, organismQuantityType, 
         eventID, eventDate, scientificName,kingdom, phylum, class, order, family, genus,vernacularName,taxonRank, sex)
GBIF_Occ=initializeGBIFOccurrence(occ_ext, idColumnInfo = "occurrenceID",nameAutoMap = TRUE )

The Measurement or Fact extension

Our final extension is the “Measurement or Fact extension”. We can look at the definition of this extension using the following code:

## - Measurement or Fact
## A measurement of or fact about an rdfs:Resource (
##  Defined in:
##  IRI:
##  Version IRI:
##  Type: Class
##  Date modified: 2018-09-06
##  Notes:
##      Resources can be thought of as identifiable records or instances of classes and may include, but need not be limited to dwc:Occurrence, dwc:Organism, dwc:MaterialSample, dwc:Event, dwc:Location, dwc:GeologicalContext, dwc:Identification, or dwc:Taxon.
##  Executive committee decisions:
##  Examples:
##      The weight of an organism in grams. The number of placental scars. Surface water temperature in Celsius.
##  Miscellaneous information:
##      Datasets/Dataset/Units/Unit/MeasurementsOrFacts or DataSets/DataSet/Units/Unit/Gathering/SiteMeasurementsOrFacts

GBIF has a list of properties for this extension ( These include measurementID measurementType, measurementValue, measurementAccuracy, measurementUnit, measurementDeterminedDate, measurementDeterminedBy, measurementMethod and measurementRemarks, none of which are required.

Let’s start with the measurementID. We should also include the occurrenceID and eventID so that we know which individual and in which sampling event the measurements were taken.

M_or_f=penguin_data %>% 
  mutate(measurementID=uuid::UUIDgenerate(use.time = FALSE))

There are a number of measurements that we can include in this extension.

M_or_f=M_or_f %>% 
         culmen_length_mm, culmen_depth_mm, delta_13_c_o_oo, delta_15_n_o_oo, body_mass_g, flipper_length_mm)

We need to pivot the data so that all the measurement types go in to a single column called measurementType. All the measurements go in to a column called measurementValue.

M_or_f=M_or_f %>% 
  pivot_longer(cols = c(culmen_length_mm, culmen_depth_mm, delta_13_c_o_oo, delta_15_n_o_oo, body_mass_g, flipper_length_mm), names_to="measurementType",
               values_to = "measurementValue"
  ) %>% 

Finally we create a measurement or fact object using {livingNorwayR}.

GBIF_Measure=initializeGBIFMeasurementOrFact(M_or_f, idColumnInfo = "measurementID", nameAutoMap = TRUE)

Next steps

The next step is to write the metadata for the data. We can do this using the {LivingNorwayR} package and in the next tutorial we will show you how to do this (LINK TO PART TWO) and how to bring it all together in to a data package.


Chipperfield, Joseph, Matthew Grainger, and Erlend Nilsen. 2022. LivingNorwayR: Creates a Darwin Core Standard Compliant Data Archive (“a Data Package”) for Biodiversity Data.

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Urbanek, Simon, and Theodore Ts’o. 2021. Uuid: Tools for Generating and Handling of Uuids.

