We’ve always been interested in citizen science approaches to the digitization of museum labels and ledgers, even before there were tools out there to do it. But now, projects such as Notes from Nature (NfN; notesfromnature.org) haven’t just built the tools — they’re giving us the opportunity to put those tools to the test, and in doing so, more deeply explore how museums can work with data generated not just by volunteers, but volunteers at a distance.
As NfN and other Zooniverse volunteers may already know, the specimen records in NfN aren’t transcribed just once — they’re actually transcribed at least four times each. These redundant transcriptions are collected to improve the accuracy of transcriptions overall, the idea being that the four-plus transcripts can be combined and compared against each other to create a final, canonical version of each specimen record. But how?
In this post, we want to present one of many potential approaches — some of which is also discussed in a related blog post over at NfN (http://blog.notesfromnature.org/2014/01/14/checking-notes-from-nature-data/). But here, we want to lead up to a discussion not just how to create a canonical, consensus record — but also about what we mean when we say “canonical” in the first place.
But before we get to that, let’s look at some raw data, and briefly review the “how”. Below are two raw output transcription records from Notes from Nature (the Calbug project to be exact), with some minor edits (some columns have been removed based on relevance etc.). The main columns to focus on are subject_id which defines the “subject” (in this case the image being transcribed) and, for the purposes of illustration, the locality string that was entered by the transcribers.
The first data curation phase involves some general normalizations: abbreviations and capitalizations are standardized with a simple script, and we perform some find-and-replacements to sort out blank spaces and other problem characters. The transcriptions and subject_ids are then formatted for use with a something most biologists will be familiar with: sequence alignment software.
Here are the FASTA (http://en.wikipedia.org/wiki/FASTA_format) formatted outputs for the two specimens above:
Why FASTA format, and why sequence alignment software? Because the next step of the normalization process involves aligning each word of the locality string — much in the same way the sequence alignment involves the alignment of homologous characters. And just as sequences can have both mutations and “gaps” that need to be discovered through alignment, so to do transcripts have the same issue. Aligned results (using MAFFT) are below:
[Note from Andrea: The use of sequence alignment tools on plain text is not unheard of! Linguists and textual critics have been applying phylogenetic tools and methods to things other than genes and traits for a while now -- more on this next post.]
Normalizing and aligning the outputs makes it much easier to see a growing consensus, but note that both these transcripts still have a few problems. In the first case, only one transcriber transcribed the record verbatim from the label (e.g. Friona, 10 mi.N.) while everyone else transposed it. And one transcriber also put an “of” in the transcription that isn’t there (e.g. 10 miles north of Friona). To account for variants that only occur once across the pool of transcripts, another script is used to remove all positions from the alignment where there is only one letter and the rest are gaps. So the very first position, for example, is all gaps except for one “F” and thus is culled.
The final output alignments for both records looks like this:
The penultimate step is to take a consensus of these aligned transcripts using this: http://biopython.org/DIST/docs/api/Bio.Align.AlignInfo.SummaryInfo-class.html#dumb_consensus. The output is basically all the places where the alignment agrees at least 50% of the time minus all gaps. The final step is to reconvert underscores to spaces for readability, and print a final consensus transcript — in this case:
# subject_id consensus
4 516d7fa4ea3052046a000004 10 miles North Friona
6 516d7fa6ea3052046a00000a Rushing Xiver Provincial Park 15 miles Southeast
Note: The “X” in the second transcript is a case where there was a “tie” — 50% of the records said “B”, 50% said “R”, so it places an “X” for those cases.
So how’d the consensus approach work? In one sense pretty darn well. The results are readable and pretty reflective of the label contents — maybe an improvement, even. In another sense, they don’t resemble the verbatim label contents very much at all (click on those image links above to view the verbatim labels).
So, this is where return to our newfound struggle to define canonical: do we want labels that are verbatim transcriptions of the specimen labels? Or do we want easily “georeferenceable” contents that are accurate, normalized, and easily interpretable to future researchers. Or both?
The answer, of course, is both, and more — Next post, we discuss more why we need both verbatim AND interpreted, easily georeferenceable transcriptions, and on how studies of medieval manuscripts and natural history museum curation ain’t so different after all….