Re:Sourcing Primary Materials: Notes from A Workshop

Last week, we both attended a workshop on Primary Source Material digitization organized by iDigBio, where we had some great, truly interdisciplinary conversations with folks like Rusty Russell over at the Field Book Project, Ben Brumfield at From The Page, Terry Catapano at Plazi and many others.  We’re both eager to stay involved with this working group — the topic is a rich one that connects together the museum and library communities — but I (Andrea) did want to get down some few thoughts about the primary source (aka field book and similar documentation or realia) digitization in general, and some potential future directions before they escape my memory:

1) At the workshop, we spent a lot of time doing things like describing different imaging workflows, potential citizen science applications, and defining what exactly we mean when we say “source materials” (all important tasks), but we didn’t spend much time talking about formatting and making available all the already transcribed field books that are surely scattered on hard drives throughout natural history museums all over the world.  While the various CLIR “Hidden Collections” grants have helped support the cataloging and preservation of paper archives, I suspect there are a huge number of already semi-digitized (e.g. transcribed into word/text/wordPerfect docs, or even typewritten files) field notes out there without any long-term preservation plan (particularly in smaller museums).  As fragile as paper archives are, digital archives can be even more delicate; given a dry room, paper will remain legible for centuries whereas the information contained a hard drive will degrade unless periodically migrated.  Already existing transcriptions are likely housed on older media, like floppy disks and zip drives — making their fragility even more acute.

Why not just re-transcribe the notebooks once they’re scanned?  Because good transcriptions are massively time consuming to produce!  Thus, these already transcribed field books represent dozens-to-hundreds of person hours of work per book that are in danger of being lost if not properly curated.  There are ways — in fact, an entire field of study dedicated to articulating the ways — to conduct this curation, but it needs to be identified as a priority in order for that to happen.  For now, we want to make sure we keep shedding a little light on some this particular breed of “dark data,” and encourage everyone to keep thinking about how to make it a little less obscure. Do many folks out there have digitized products but aren’t quite sure about next steps?  If you have any of those fragile digital copies out there, we want to hear about them!

2) Similarly, while we spent a fair bit of time talking about what it would take to get books newly transcribed, we didn’t spend much time talking about important details like how to best format, preserve and present these notebooks once typed, or how those various formats and presentation will support different kinds of use.  It’s not just be enough to just type field books into a text file and throw them on a server somewhere.  Its not even enough to provide some metadata about the field notebooks.  And it may not even be enough to develop a wiki-style markup syntax from scratch for every single project out there.  We need to move toward a standardized way of “marking up” these transcriptions so that they’re more findable, machine readable and text mineable for future researchers and to assure they don’t end up glued down in a particular database system.

Projects like the Smithsonian’s Transcription Project, Ben Brumfield’s From The Page software (currently being used by the MVZ for field note transcription), and our own Henderson Field Book Project make use of wiki-style syntax to annotate items in digitized text.  What happens next, though? How to search across these resources?  For the Henderson Project our solution was to ultimately rip the records from the page (no pun intended, Ben), and place them into a format that ensures interoperability (e.g. Darwin Core Archives), while also linking back to the page where those records resided.  Ideally, however, one could find a different solution, one that builds on existing markup standards.

At the workshop, we learned that Terry Catapano and Plazi have been doing just that — building on existing standards — by developing a much needed (and not nearly talked about enough) taxonomic extension for both TEI (the Text Encoding Initiative) and JATS (the Journal Article Tag Suite).  Using these markup standards can make taxonomic information vastly more findable.  However…  we’re not all using these standards in the same way, or necessarily understanding where there are overlaps, existing terms, etc. — defeating the purpose of a standard.  Thus, there’s a need to develop what’s called an “application profile” of standards like TEI, JATS, or wiki markup for field notebooks and other primary source material.  In other words: there are already plenty of ways and languages with which to mark up text, but we still need to continue the conversation beyond this workshop, and develop best practices for these standards’ use, thereby best supporting future use of source materials.

3) Finally: Part of this development of best practices needs to include identifying potential use cases, or user communities, that these source materials might be of interest to.  At the workshop, while a range of use cases were articulated — using field books to reconstruct historical ecologies, to clarify specimens’ provenance, to re-find localities — all of these cases could be said to rely on a fairly “close” reading of the text, in which a researcher reads each book line by line and word by word, and might do her own extraction of relevant content.

Supporting this kind of reading is certainly important — but I argue we should also push to support more computational, or distant kinds of reading: if we truly want to capitalize on the power of digitization we need to aim to support text and data mining.  “Text and data mining” includes some methods that the biodiversity informatics community is already familiar with: for instance, using natural language processing to identify and extract data like proper names, taxonomic names, dates, geographic locations; publishing said data to the semantic web; and so on.  However text mining also includes some methods that the biodiversity community isn’t as familiar with but may be helpful  — methods like topic modeling, in which algorithms are used to model a documents ‘aboutness’, and other machine learning methods.  Though these methods have been used by information scientists, digital humanists and social scientists on their large collections of digital text, they haven’t been used by us (we think?) — partially because it’s only been recently that we’ve had these large digital archives and libraries, and also partially because these methods have been trapped in their disciplinary silos.

I wish I could better explain these techniques, and better argue for their applicability to biodiversity archives but I can’t quite yet — both because as a student, I’m still learning them myself, but also because some of this is just new ground.  But I do know this: supporting these kinds of computational approaches could greatly speed up the rate at which we’re able to extract useful scientific information from source materials, and would also likely attract an interdisciplinary crowd of information scientists, digital humanists and other researchers to the collections.

Clearly, there is much more to discuss!  Sounds like we need another workshop, no?

Posted in Uncategorized | Leave a comment

How is finding a consensus among citizen science transcriptions like aligning gene sequences AND textual analysis of medieval codices? Part 2

(cross-posted at SciStarter)

In our last post, we went through the mechanics of how to find consensus from a set of independently created transcriptions by citizen scientists — this involved a mash-up of bioinformatics tools for sequence alignment (repurposed for use with text strings) and natural language processing tools to find tokens and perform some word synonymizing.  In the end, the informatics blender did indeed churn out a consensus —  but this attempt at automation led us to realize that there’s more than one kind of consensus.  In this post we want to to explore that issue a bit more.

So, lets return to our example text:

Some volunteers spelled out abbreviations (changing “SE” to “Southeast”) or corrected errors on the original label (changing “Biv” to “River”); but others did their best to transcribe each label verbatim – typos and all.

These differences in transcription style led us to ask — when we build “consensus,” what kind do we want? Do we want a verbatim transcription of each label (thus preserving a more accurate, historical record)?  Or do we want to take advantage of our volunteers’ clever human brains, and preserve the far more legible, more georeferenceable strings that they (and the text clean-up algorithms described in our last post) were able to produce? Which string is more ‘canonical’?

Others have asked these questions before us — in fact, after doing a bit of research (read: googling and reading wikipedia), we realized we were essentially reinventing the wheel that is textual criticism, “the branch of literary criticism that is concerned with the identification and removal of transcription errors in the texts of manuscripts” (thanks, wikipedia!).  Remember, before there were printing presses there were scribes: individuals tasked with transcribing sometimes messy, sometimes error-ridden texts by hand — sometimes introducing new errors in the process.  Scholars studying these older, hand-duplicated texts often must resolve discrepancies across different copies of a manuscripts (or “witnesses”) in order to create either:

– a “critical edition” of the text, one which “most closely approximates the original“, or

– a “copy-text” edition, which “the critic examines the base text and makes corrections (called emendations) in places where the base text appears wrong” (thanks again, wikipedia).

Granted, the distinction between a “critical edition” and a “copy-text edition” may be a little unwieldy when applied to something like a specimen label as opposed to a manuscript.  And while existing biodiversity data standards developers have recognized the issue — Darwin Core, for example,  has “verbatim” and “interpreted” fields (e.g. dwc:verbatimLatitude) — those existing terms don’t necessarily capture the complexity of multiple interpretations, done multiple times, by multiple people and algorithms and then a further interpretation to compute some final “copy text”.   Citizen science approaches place us right between existing standards-oriented thinking in biodiversity informatics and edition-oriented thinking in the humanities.  This middle spot is a challenging but fascinating one  — and another confirmation of the clear, and increasing, interdisciplinarity of fields like biodiversity informatics and the digital humanities.

In prior posts, we’ve talked about finding links between the sciences and humanities — what better example of cross-discipline-pollination than this?  Before, we mentioned we’re not the first to meditate on the meaning of “consensus” — we’re also not the first to repurpose tools originally designed for phylogenetic analysis for use with general text; linguists and others in the field of phylomemetics (h/t to Nic Weber for the linked paper) have been doing the same for years.  While the sciences and humanities may still have very different research questions and epistemologies, our informatics tools have much in common.  Being aware of, if not making use of, one another’s conceptual frameworks may be a first step to sharing informatics tools, and building towards new, interesting collaborations.

Finally, back to our question about what we mean by “consensus”: we can now see that our volunteers and algorithms are currently better suited to creating “copy-text” editions, or interpreted versions of the specimen labels — which makes sense, given the many levels of human and machine interpretation that each label goes through.  Changes to the NfN transcription workflow would need to be made if museums want a “critical edition,” or verbatim version of each label as well.  Whether this is necessary is up for debate, however — would the preserved image, on which transcriptions were based be enough for museum curators’ and collection managers’ purposes? Could that be our most “canonical” representation of the label, to which we link later interpretations? More (interdisciplinary) work and discussion is clearly necessary — but we hope this first attempt to link a few disparate fields and methods will help open the door for future exchange of ideas and methods.

References, and links of potential interest:

If you’re interested in learning more about DH tools relevant to this kind of work, check out Juxta, an open source software package designed to support collation and comparison of different “witnesses” (or texts).

For more on phylomemetics:

Howe, C. J., & Windram, H. F. (2011). Phylomemetics–evolutionary analysis beyond the gene. PLoS biology, 9(5), e1001069. doi:10.1371/journal.pbio.1001069

Posted in Uncategorized | Tagged , , | 3 Comments

How is finding a consensus among citizen science transcriptions like aligning gene sequences AND textual analysis of medieval codices? Part 1

We’ve always been interested in citizen science approaches to the digitization of museum labels and ledgers, even before there were tools out there to do it. But now, projects such as Notes from Nature (NfN; notesfromnature.org) haven’t just built the tools — they’re giving us the opportunity to put those tools to the test, and in doing so, more deeply explore how museums can work with data generated not just by volunteers, but volunteers at a distance.

As NfN and other Zooniverse volunteers may already know, the specimen records in NfN aren’t transcribed just once — they’re actually transcribed at least four times each. These redundant transcriptions are collected to improve the accuracy of transcriptions overall, the idea being that the four-plus transcripts can be combined and compared against each other to create a final, canonical version of each specimen record. But how?

In this post, we want to present one of many potential approaches — some of which is also discussed in a related blog post over at NfN (http://blog.notesfromnature.org/2014/01/14/checking-notes-from-nature-data/). But here, we want to lead up to a discussion not just how to create a canonical, consensus record — but also about what we mean when we say “canonical” in the first place.

But before we get to that, let’s look at some raw data, and briefly review the “how”. Below are two raw output transcription records from Notes from Nature (the Calbug project to be exact), with some minor edits (some columns have been removed based on relevance etc.). The main columns to focus on are subject_id which defines the “subject” (in this case the image being transcribed) and, for the purposes of illustration, the locality string that was entered by the transcribers.

table2Here are the actual images from the Essig Entomology Museum database so you can compare the transcripts to what the label looks like:

Label 1
Label 2

The first data curation phase involves some general normalizations: abbreviations and capitalizations are standardized with a simple script, and we perform some find-and-replacements to sort out blank spaces and other problem characters. The transcriptions and subject_ids are then formatted for use with a something most biologists will be familiar with: sequence alignment software.

Here are the FASTA (http://en.wikipedia.org/wiki/FASTA_format) formatted outputs for the two specimens above:

unalignedWhy FASTA format, and why sequence alignment software? Because the next step of the normalization process involves aligning each word of the locality string — much in the same way the sequence alignment involves the alignment of homologous characters. And just as sequences can have both mutations and “gaps” that need to be discovered through alignment, so to do transcripts have the same issue. Aligned results (using MAFFT) are below:

alignednew[Note from Andrea:  The use of sequence alignment tools on plain text is not unheard of!  Linguists and textual critics have been applying phylogenetic tools and methods to things other than genes and traits for a while now — more on this next post.]

Normalizing and aligning the outputs makes it much easier to see a growing consensus, but note that both these transcripts still have a few problems.  In the first case, only one transcriber transcribed the record verbatim from the label (e.g. Friona, 10 mi.N.) while everyone else transposed it.  And one transcriber also put an “of” in the transcription that isn’t there (e.g. 10 miles north of Friona).  To account for variants that only occur once across the pool of transcripts, another script is used to remove all positions from the alignment where there is only one letter and the rest are gaps.  So the very first position, for example, is all gaps except for one “F” and thus is culled.

The final output alignments for both records looks like this:

aligned2The penultimate step is to take a consensus of these aligned transcripts using this: http://biopython.org/DIST/docs/api/Bio.Align.AlignInfo.SummaryInfo-class.html#dumb_consensus.  The output is basically all the places where the alignment agrees at least 50% of the time minus all gaps.  The final step is to reconvert underscores to spaces for readability, and print a final consensus transcript — in this case:

 #         subject_id                                     consensus                                                        
4          516d7fa4ea3052046a000004    10 miles North Friona
6          516d7fa6ea3052046a00000a    Rushing Xiver Provincial Park 15 miles Southeast

Note:  The “X” in the second transcript is a case where there was a “tie” — 50% of the records said “B”, 50% said “R”, so it places an “X” for those cases.

So how’d the consensus approach work?  In one sense pretty darn well. The results are readable and pretty reflective of the label contents — maybe an improvement, even.  In another sense, they don’t resemble the verbatim label contents very much at all (click on those image links above to view the verbatim labels).

So, this is where return to our newfound struggle to define canonical: do we want labels that are verbatim transcriptions of the specimen labels? Or do we want easily “georeferenceable” contents that are accurate, normalized, and easily interpretable to future researchers. Or both?

The answer, of course, is both, and more — Next post, we discuss more why we need both verbatim AND interpreted, easily georeferenceable transcriptions, and on how studies of medieval manuscripts and natural history museum curation ain’t so different after all….

Posted in Uncategorized | 6 Comments