Data Diversity of the Week: Sex

SYTYCD would like to welcome guest blogger, John Wieczorek (also known as Tuco)

Data aggregators such as VertNet and GBIF work with data publishers (such as museums and research stations) to share their data in common formats.  We’ve been talking about this for a long time, and the great news is that data publication has become much easier with the advent of ratified standards such as Darwin Core (DwC), and tools such as GBIF’s Integrated Publishing Toolkit (IPT).

For some data fields, Darwin Core recommends the use of controlled vocabularies, but it doesn’t always recommend which vocabulary to use, and it certainly does not prohibit data content that does not conform to a controlled vocabulary. This is a double-edged sword.  On the plus side, there is no unnecessary obstacle to data sharing – data publishers don’t have to convert their data to conform with a standardized list of values.  On the downside… well… that bring us to our new feature installment on So You Think You Can Digitize:  This Week in Data Diversity.

In the coming weeks, we’ll be serenading you with interesting examples of the diversity of data published to VertNet, starting with everyone’s favorite Darwin Core term: “sex”.  This term is intended to express whether an organism is “male” or “female”.

So how many different ways can one say “male” and “female”?

Well, from just 48 collections at 20 institutions covering about 2.7 million records that have been processed through the data cleaning work flows of VertNet so far, there have been … 189 distinct values in the sex field that mean “male”! Don’t worry, there are also 184 ways to say “female”!  Now that we’ve established some parity between the genders, you are free to worry, because there remain 331 variations that are either ambiguous or that actually mean either “undetermined” or “unknowable”. REALLY?  Yes, really. None of this is counting case-sensitivity, by the way.

Luckily, standard terms do appear in the real-world data:  “female”, “male”, “gynandromorph”, “hermaphrodite”. That’s great, but … then, there are the abbreviations: “F”, “M”, “U”, “F.”, “M.”, “U.”, “UNK”, etc. There are also language variations, such as “hembra”, “macho”, and “H”  … and the ones where the data publishers aren’t really sure – “M?”, “F?”, “F ? M”, “M [almost surely F]”. We mentioned these are all real examples, didn’t we?

Then there are the variations that also include information from other Darwin Core fields such as lifeStage: “subadult male”, “f (adult)”, etc.  There are variations where the evidence is included: “male by plumage”, “macho testes 15×9” (yes, it is a real example!).  But we aren’t done yet. There are variations that try to capture what was written on a label: “M* with [?] written in pencil af”, “M [perhaps a F]”, “M[illeg]”. Up to this point we can more or less figure out what these were about, but that’s not always the case. There are variations expressing, perhaps, some measure of over-exuberance, such as “M +” and “F ++”.  It gets even crazier: “Tadpoles”, “hg”, “Yankee Pond”, “gonads not found”, “F4”, “Fall Male”, “U.S.”, “O~”, “M(i.e. upside down ‘F’)”, “M mate of # 357”, “M=[F] C.H.R.”, “(?) six”, “[F pract. certainly]”, “Apr”, “<”, “263.5”.  And what list would not be complete without …”downy”?

VertNet is working to reduce this …diversity… by managing folksonomies (real-world terms for a concept) produced by real verbatim content, and interpreting as well as possible against controlled vocabularies. These lookups are used in data migration processes that take original source data structures and contents and massage them into a Darwin Core format ready for publishing. The idea is to reduce this diversity to something manageable on the scale of VertNet and to facilitate data searching.

Below is a list of fields that Darwin Core recommends to be vocabulary-controlled and that VertNet is trying to track and manage. The job is non-trivial. The vocabularies are very difficult to curate because of the overabundance of content diversity, as shown in the Table below:

DwC Field

Distinctive variants in the field

# of variants resolved before IPT publishing

















































We will delve into the mysteries of some of these Darwin Core fields (certainly “preparations” and “country”) in later posts and we’ll talk more about potential solutions to the problem of data diversity.  In the meantime, you can wet your appetite by perusing the lists of distinct values for many of these fields in our Darwin Core Vocabularies Github repository. We’ll keep this up to date with incoming terms as we encounter them. By now we hope we have convinced you that it is a huge problem.

Next time you want to find some male specimens of a particular bird, just remember to include “downy” as a variant in the “sex” field, unless of course you are searching on VertNet, where this problem has been taken care of for you. Don’t worry, you’ll still be able to find downy specimens on VertNet with a full-text search.


authors:  John Wieczorek, Rob Guralnick, David Bloom, Andrea Thomer and Paula Zermoglio

Posted in Uncategorized | 2 Comments

This week in digitization: The good, the buggy, and the curious

This will be old news to many, but regardless: two big projects related to specimen digitization and biodiversity informatics launched in the past couple weeks.   Quick impressions on both below, focusing on the good, the buggy and a few items of curiosity.  Both projects are great, but — how will they fit into the broader landscape of existing resources, and into what niches?

1) Notes From Nature — a new Zooniverse project for the transcription of natural history collection ledgers.  This has been a long time in the making (more details here) and as of this writing, the two available collections (Herbarium specimens from SERNEC and insects from CALBUG) are already 26% and 21% transcribed, respectively.

The Good: As always, clean and intuitive interfaces from the Zooniverse team make transcription fast and easy.  Data entry screens are customized to each type of collection (e.g. plant labels often contain more detailed locality descriptions than insects, whereas insect labels often contain data about what host-organism they were found on).  Awesomely, all the code is available on Github ( in case other Museums want to set up their own transcription engines locally.  There is also an intriguing teaser buried at the very bottom of the Notes from Nature “About” page: “Interested in publishing your collection? Contact us.”

The Buggy: Maaaan, I’ve transcribed around 40 labels and my total isn’t showing up under my user profile. This bothers me more than I care to admit, though it’s primarily out of worry that my transcriptions aren’t being saved.

The Curious:  It would be great to learn more about how these data get back to the collections databases, and how exactly that handoff happens.  What do the transcribed files look like?  How is accuracy checked?  Do the museums have plans to make these records publicly available, or harvestable by aggregators like GBIF?

2) The patriotically-named Biodiversity Information Serving Our Nation (AKA BISON) biodiversity data portal out of USGS —  I know less about this project, other than what I’ve learned at various conference talks — however I’ve heard it referred to as the “federal version of iDigBio.”

The Good: On first look, really nice integration of specimen occurrence data with USGS map layers, and as Hilmar Lapp pointed out, there’s an API, which is great.

The Buggy:  There are no identifiers on these specimens — not even their local catalog numbers.  Per Stinger Guala in the G+ thread linked above, the data is there — it’s just not yet visible (though will be soon).  Perhaps there are reasons (a need for better formatting? a need for cleaner data?  a need for more server space?) that they’re not yet making this data visible yet — but it struck me as a pretty glaring omission.  While I realize that many researchers don’t spend a lot of time looking at  catalog numbers, I imagine that they’d be absolutely critical if one was integrating BISON data with that from other sources (say, something from another portal like GBIF). Also, how could any of these records ever be linked back to the source data or any other data out there?  Provenance = important, no?

The Curious: BISON is apparently the US node of GBIF — which I had assumed meant they would be providing GBIF with US data  — however, the data in BISON appears to invert that model and is a US-centric mirror of GBIF.  I hope that BISON becomes a platform through which US, federally owned and managed biocollections can be made publically discoverable, and would be interested to hear from BISON reps if there are any plans in place to do this.

Posted in citizen science, crowdsourcing | Tagged , , , | 9 Comments

What gets linked to global unique identifiers (GUIDs) in natural history collections digitization?

Co-written with David Bloom.

For as long as explorers have been collecting specimens and bringing them back to museums, collection managers and museum staff have been assigning unique (well, more or less unique, with a margin for human error) numbers to them. Collections management isn’t just about conservation of physical objects – it’s also about care and support of an information retrieval system.  And while new technologies come and go, the maintenance of this system of locally unique identifiers remains at the core of collections management.

Recently, that aforementioned information retrieval system has been growing increasingly dispersed, and the ways in which information about specimens is disseminated, accessed, and manipulated have been changing rapidly.  The ability to place digital representations of specimens and their data en masse onto the World Wide Web is fundamentally changing collections-based scholarship.  Consequently, locating the right giant clam specimen in the University of Colorado Museum of Natural History (CU Museum) invertebrate collection is a very different task than locating a digital image of the same specimen on the Internet.  Tracking and connecting all of the digital representations (e.g., images, metadata records digitized from labels, tissue samples derived from the specimen, sound and video files) derived from that single specimen is even more difficult.  While local identifiers suffice to connect data and specimens within the CU Museum, global identifiers are needed to maintain these connections when content is released into the wilds of the Internet.  The challenge before us is how to best set up a system of globally unique identifiers (GUIDs) that work at Internet-scale.

iDigBio, an NSF-funded project tasked with coordinating the collections community’s digitization efforts, just released a GUID guide for data providers.  The document clarified the importance of GUIDs and recommended that iDigBio data providers adopt universally unique identifiers (UUIDs) as GUIDs.  It went further, however.  In the document, a call-out box (on page 3) states that “It has been agreed by the iDigBio community that the identifier represents the digital record (database record) of the specimen not the specimen itself. Unlike the barcode that would be on the physical specimen, for instance, the GUID uniquely represents the digital record only.”

In response, Rod Page wrote an (as always, entertaining and illuminating) iPhylo blog post “iDigBio: You are putting identifiers on the wrong thing ” in which he makes a strong case that a GUID must refer to the physical object.  “Surely, “ writes Rod, “the key to integrating specimens with other biodiversity data is to have globally unique identifiers for the specimens.”

The disagreement above underscores our community’s need to be very clear about what GUIDs reference and how they resolve[1].  This seems simple, but it has been one of the most contentious issues within our community.  So who is right – should GUIDs point to digital records, or physical objects?

iDigBio has a clear mission to support the digitization of natural history specimens, and thus, deals exclusively with digital objects, which, as any database manager knows, need to have identifiers.  So, it does make sense that they would be concerned with identifiers for digital objects.  Those identifiers, however, absolutely must be as closely associated with the physical specimens as possible.  In particular, they need to be assigned and linked to the local identifiers stored in local databases managed by on-site collections staff.  If iDigBio is saying that only digitized objects that get passed to iDigBio from their data providers need GUIDs, not the original digital catalogs, we can’t agree with that.

On the other side, we’re not sure we agree with Rod either.  If Rod is suggesting that GUIDs replace, or serve as additions to the catalog numbers literally, physically attached to specimens or jars, we think that is simply impractical.  What is the incentive for putting a GUID on every single specimen in a collection, especially wet collections, from the point of view of a Collections Manager?  Does it help with loans?  Who is going to go into a collection and assign yet another number to all the objects in that collection (and how many institutions have the resources to make that happen)?

What we think is feasible and useful (and likely what Rod meant) is to assign GUIDs to digital specimen records stored in local museum databases and linked to the local identifiers.  When these data get published online, the associated GUID can be pushed downstream as well. Assigning GUIDs to the local, authoritative, electronic specimen records as they are digitized should be a mandatory step in the digitization process — a process that iDigBio is uniquely poised to support. This is the only way that GUIDs will be consistently propagated downstream to other data aggregators like VertNet, GBIF, and whatever else comes along fifty years from now (and fifty years after funding runs out on some existing projects).  Again, we want to point out: it’s important to remember that natural history collections management has always entailed the management of identifiers; the adoption of global identifiers will only increase the need for local identifier management

Now, we can imagine one case in which GUIDs could conceivably serve as the originating catalog number: during field collection.  As biologists generate more and more digital content in the field (such as images and DNA collected in the field), minting GUIDs at the moment of collection (or during the review of daily collection events) and assigning them to samples and specimens directly could be quite useful.  As these physical objects make their way into collections, we anticipate that collections folks will still assign local identifiers.  Both have their uses and are made stronger and more useful when linked.

In summary: we are less worried about what exactly a GUID will point to (digital record vs physical object) as long as the content referenced is valuable to the biodiversity collections and science community.  However, we are more worried that we’re not explicitly identifying what we’re assigning identifiers to, and not discussing who and how these identifiers will be managed and integrated.  Our focus should be on developing trusted and well-understood GUID services that provide content resolution for the long (50-100 years) term.

[1] To resolve a GUID, you dump it into a resolution service maintained by a naming authority that originally created that GUID, such as CrossRef or DataCite . That service then returns to you links and other information that point you to other content attached to the same identifier.  A great example of a resolvable GUID is a Digital Object Identifier (DOI).  In the case of journal articles, resolution of a DOI will usually direct you to the paper itself via hyperlink, but it could also be a web page with information about the resource, a sound file, or any other representation of the object associated with the GUID.

Posted in Uncategorized | 34 Comments