Data Diversity of the Week: Sex

SYTYCD would like to welcome guest blogger, John Wieczorek (also known as Tuco)

Data aggregators such as VertNet and GBIF work with data publishers (such as museums and research stations) to share their data in common formats.  We’ve been talking about this for a long time, and the great news is that data publication has become much easier with the advent of ratified standards such as Darwin Core (DwC), and tools such as GBIF’s Integrated Publishing Toolkit (IPT).

For some data fields, Darwin Core recommends the use of controlled vocabularies, but it doesn’t always recommend which vocabulary to use, and it certainly does not prohibit data content that does not conform to a controlled vocabulary. This is a double-edged sword.  On the plus side, there is no unnecessary obstacle to data sharing – data publishers don’t have to convert their data to conform with a standardized list of values.  On the downside… well… that bring us to our new feature installment on So You Think You Can Digitize:  This Week in Data Diversity.

In the coming weeks, we’ll be serenading you with interesting examples of the diversity of data published to VertNet, starting with everyone’s favorite Darwin Core term: “sex”.  This term is intended to express whether an organism is “male” or “female”.

So how many different ways can one say “male” and “female”?

Well, from just 48 collections at 20 institutions covering about 2.7 million records that have been processed through the data cleaning work flows of VertNet so far, there have been … 189 distinct values in the sex field that mean “male”! Don’t worry, there are also 184 ways to say “female”!  Now that we’ve established some parity between the genders, you are free to worry, because there remain 331 variations that are either ambiguous or that actually mean either “undetermined” or “unknowable”. REALLY?  Yes, really. None of this is counting case-sensitivity, by the way.

Luckily, standard terms do appear in the real-world data:  “female”, “male”, “gynandromorph”, “hermaphrodite”. That’s great, but … then, there are the abbreviations: “F”, “M”, “U”, “F.”, “M.”, “U.”, “UNK”, etc. There are also language variations, such as “hembra”, “macho”, and “H”  … and the ones where the data publishers aren’t really sure – “M?”, “F?”, “F ? M”, “M [almost surely F]”. We mentioned these are all real examples, didn’t we?

Then there are the variations that also include information from other Darwin Core fields such as lifeStage: “subadult male”, “f (adult)”, etc.  There are variations where the evidence is included: “male by plumage”, “macho testes 15×9” (yes, it is a real example!).  But we aren’t done yet. There are variations that try to capture what was written on a label: “M* with [?] written in pencil af”, “M [perhaps a F]”, “M[illeg]”. Up to this point we can more or less figure out what these were about, but that’s not always the case. There are variations expressing, perhaps, some measure of over-exuberance, such as “M +” and “F ++”.  It gets even crazier: “Tadpoles”, “hg”, “Yankee Pond”, “gonads not found”, “F4”, “Fall Male”, “U.S.”, “O~”, “M(i.e. upside down ‘F’)”, “M mate of # 357”, “M=[F] C.H.R.”, “(?) six”, “[F pract. certainly]”, “Apr”, “<”, “263.5”.  And what list would not be complete without …”downy”?

VertNet is working to reduce this …diversity… by managing folksonomies (real-world terms for a concept) produced by real verbatim content, and interpreting as well as possible against controlled vocabularies. These lookups are used in data migration processes that take original source data structures and contents and massage them into a Darwin Core format ready for publishing. The idea is to reduce this diversity to something manageable on the scale of VertNet and to facilitate data searching.

Below is a list of fields that Darwin Core recommends to be vocabulary-controlled and that VertNet is trying to track and manage. The job is non-trivial. The vocabularies are very difficult to curate because of the overabundance of content diversity, as shown in the Table below:

DwC Field

Distinctive variants in the field

# of variants resolved before IPT publishing

basisOfRecord

5

5

country

4989

4989

disposition

3307

17

establishmentMeans

12

12

genus

15884

15884

geodeticDatum

251

251

georeferenceProtocol

177

177

identificationQualifier

140

140

language

6

6

lifeStage

13087

429

preparations

10745

2052

reproductiveCondition

13482

361

sex

710

710

taxonRank

8

8

type

5

5

typeStatus

1861

481

We will delve into the mysteries of some of these Darwin Core fields (certainly “preparations” and “country”) in later posts and we’ll talk more about potential solutions to the problem of data diversity.  In the meantime, you can wet your appetite by perusing the lists of distinct values for many of these fields in our Darwin Core Vocabularies Github repository. We’ll keep this up to date with incoming terms as we encounter them. By now we hope we have convinced you that it is a huge problem.

Next time you want to find some male specimens of a particular bird, just remember to include “downy” as a variant in the “sex” field, unless of course you are searching on VertNet, where this problem has been taken care of for you. Don’t worry, you’ll still be able to find downy specimens on VertNet with a full-text search.

___________________________________________________________________

authors:  John Wieczorek, Rob Guralnick, David Bloom, Andrea Thomer and Paula Zermoglio

Advertisements

About Rob

Three "B's" of importance: biodiversity, bikes and bunnies. I get to express these "B's" in neat ways --- I bike to a job at the University of Florida where I am an Associate Curator of Biodiversity Informatics. Along with caretaking collections, I also have a small zoo at home, filled with two disapproving bunnies.
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Data Diversity of the Week: Sex

  1. Garry Jolley- Rogers says:

    This is important even in humans. there are at least three alternate sexes not including sexual orientation or transformation wrought by an operation. And is a serious matter, that is often not well managed, for medical databases.

  2. Dan Stoner says:

    I have referenced this blog post frequently over the last few weeks in talking about data quality. Great information!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s