How is finding a consensus among citizen science transcriptions like aligning gene sequences AND textual analysis of medieval codices? Part 2

(cross-posted at SciStarter)

In our last post, we went through the mechanics of how to find consensus from a set of independently created transcriptions by citizen scientists — this involved a mash-up of bioinformatics tools for sequence alignment (repurposed for use with text strings) and natural language processing tools to find tokens and perform some word synonymizing.  In the end, the informatics blender did indeed churn out a consensus —  but this attempt at automation led us to realize that there’s more than one kind of consensus.  In this post we want to to explore that issue a bit more.

So, lets return to our example text:

Some volunteers spelled out abbreviations (changing “SE” to “Southeast”) or corrected errors on the original label (changing “Biv” to “River”); but others did their best to transcribe each label verbatim – typos and all.

These differences in transcription style led us to ask — when we build “consensus,” what kind do we want? Do we want a verbatim transcription of each label (thus preserving a more accurate, historical record)?  Or do we want to take advantage of our volunteers’ clever human brains, and preserve the far more legible, more georeferenceable strings that they (and the text clean-up algorithms described in our last post) were able to produce? Which string is more ‘canonical’?

Others have asked these questions before us — in fact, after doing a bit of research (read: googling and reading wikipedia), we realized we were essentially reinventing the wheel that is textual criticism, “the branch of literary criticism that is concerned with the identification and removal of transcription errors in the texts of manuscripts” (thanks, wikipedia!).  Remember, before there were printing presses there were scribes: individuals tasked with transcribing sometimes messy, sometimes error-ridden texts by hand — sometimes introducing new errors in the process.  Scholars studying these older, hand-duplicated texts often must resolve discrepancies across different copies of a manuscripts (or “witnesses”) in order to create either:

– a “critical edition” of the text, one which “most closely approximates the original“, or

– a “copy-text” edition, which “the critic examines the base text and makes corrections (called emendations) in places where the base text appears wrong” (thanks again, wikipedia).

Granted, the distinction between a “critical edition” and a “copy-text edition” may be a little unwieldy when applied to something like a specimen label as opposed to a manuscript.  And while existing biodiversity data standards developers have recognized the issue — Darwin Core, for example,  has “verbatim” and “interpreted” fields (e.g. dwc:verbatimLatitude) — those existing terms don’t necessarily capture the complexity of multiple interpretations, done multiple times, by multiple people and algorithms and then a further interpretation to compute some final “copy text”.   Citizen science approaches place us right between existing standards-oriented thinking in biodiversity informatics and edition-oriented thinking in the humanities.  This middle spot is a challenging but fascinating one  — and another confirmation of the clear, and increasing, interdisciplinarity of fields like biodiversity informatics and the digital humanities.

In prior posts, we’ve talked about finding links between the sciences and humanities — what better example of cross-discipline-pollination than this?  Before, we mentioned we’re not the first to meditate on the meaning of “consensus” — we’re also not the first to repurpose tools originally designed for phylogenetic analysis for use with general text; linguists and others in the field of phylomemetics (h/t to Nic Weber for the linked paper) have been doing the same for years.  While the sciences and humanities may still have very different research questions and epistemologies, our informatics tools have much in common.  Being aware of, if not making use of, one another’s conceptual frameworks may be a first step to sharing informatics tools, and building towards new, interesting collaborations.

Finally, back to our question about what we mean by “consensus”: we can now see that our volunteers and algorithms are currently better suited to creating “copy-text” editions, or interpreted versions of the specimen labels — which makes sense, given the many levels of human and machine interpretation that each label goes through.  Changes to the NfN transcription workflow would need to be made if museums want a “critical edition,” or verbatim version of each label as well.  Whether this is necessary is up for debate, however — would the preserved image, on which transcriptions were based be enough for museum curators’ and collection managers’ purposes? Could that be our most “canonical” representation of the label, to which we link later interpretations? More (interdisciplinary) work and discussion is clearly necessary — but we hope this first attempt to link a few disparate fields and methods will help open the door for future exchange of ideas and methods.

References, and links of potential interest:

If you’re interested in learning more about DH tools relevant to this kind of work, check out Juxta, an open source software package designed to support collation and comparison of different “witnesses” (or texts).

For more on phylomemetics:

Howe, C. J., & Windram, H. F. (2011). Phylomemetics–evolutionary analysis beyond the gene. PLoS biology, 9(5), e1001069. doi:10.1371/journal.pbio.1001069


About Andrea

Andrea is a Ph.D. student in Library and Information Science at the University of Illinois at Urbana-Champaign, and is supported by the Center for Informatics Research in Science and Scholarship.
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

3 Responses to How is finding a consensus among citizen science transcriptions like aligning gene sequences AND textual analysis of medieval codices? Part 2

  1. William Ulate says:

    Thanks Andrea, I was expecting this second part since Rob’s first post! VERY interesting suggestions and great references to look into! For the copy-text version of the blog, I’m emending DH with Digital Humanities and h/t with hat tip 😉

    • And to address your question, I imagine the scan of the specimen label would suffice as a canonical form (the purest of them all), but it doesn’t address the requirements for further usage that a critical-edition text version could allow nor it incorporates the high added-value that a human-generated copy-text edition could provide. For me, an element not mentioned here is how much confidence the user places in the interpretation that a transcriber could include in the process when producing a copy-text version. (In your label example, they might argue that someone will change “mi.” for “minutes”). But my guess is that this is what the consensus will take care of. Anyway, the best solution I’ve found to address this is by allowing the identification of the intended purpose when the version was produced and make that potential filter available. This allows for both: the critical edition (verbatim) and the copy-text (any correction) versions to co-exist as long as they can be filtered out when needed, just like specimens collected and observations made are plotted in the same map!

  2. Snakeweight says:

    Reblogged this on Snakeweight and commented:
    Great observation from the Notes from Nature team: ‘Citizen science approaches place us right between existing standards-oriented thinking in biodiversity informatics and edition-oriented thinking in the humanities.’

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s