(cross-posted at SciStarter)
In our last post, we went through the mechanics of how to find consensus from a set of independently created transcriptions by citizen scientists — this involved a mash-up of bioinformatics tools for sequence alignment (repurposed for use with text strings) and natural language processing tools to find tokens and perform some word synonymizing. In the end, the informatics blender did indeed churn out a consensus — but this attempt at automation led us to realize that there’s more than one kind of consensus. In this post we want to to explore that issue a bit more.
So, lets return to our example text:
Some volunteers spelled out abbreviations (changing “SE” to “Southeast”) or corrected errors on the original label (changing “Biv” to “River”); but others did their best to transcribe each label verbatim – typos and all.
These differences in transcription style led us to ask — when we build “consensus,” what kind do we want? Do we want a verbatim transcription of each label (thus preserving a more accurate, historical record)? Or do we want to take advantage of our volunteers’ clever human brains, and preserve the far more legible, more georeferenceable strings that they (and the text clean-up algorithms described in our last post) were able to produce? Which string is more ‘canonical’?
Others have asked these questions before us — in fact, after doing a bit of research (read: googling and reading wikipedia), we realized we were essentially reinventing the wheel that is textual criticism, “the branch of literary criticism that is concerned with the identification and removal of transcription errors in the texts of manuscripts” (thanks, wikipedia!). Remember, before there were printing presses there were scribes: individuals tasked with transcribing sometimes messy, sometimes error-ridden texts by hand — sometimes introducing new errors in the process. Scholars studying these older, hand-duplicated texts often must resolve discrepancies across different copies of a manuscripts (or “witnesses”) in order to create either:
– a “critical edition” of the text, one which “most closely approximates the original“, or
– a “copy-text” edition, which “the critic examines the base text and makes corrections (called emendations) in places where the base text appears wrong” (thanks again, wikipedia).
Granted, the distinction between a “critical edition” and a “copy-text edition” may be a little unwieldy when applied to something like a specimen label as opposed to a manuscript. And while existing biodiversity data standards developers have recognized the issue — Darwin Core, for example, has “verbatim” and “interpreted” fields (e.g. dwc:verbatimLatitude) — those existing terms don’t necessarily capture the complexity of multiple interpretations, done multiple times, by multiple people and algorithms and then a further interpretation to compute some final “copy text”. Citizen science approaches place us right between existing standards-oriented thinking in biodiversity informatics and edition-oriented thinking in the humanities. This middle spot is a challenging but fascinating one — and another confirmation of the clear, and increasing, interdisciplinarity of fields like biodiversity informatics and the digital humanities.
In prior posts, we’ve talked about finding links between the sciences and humanities — what better example of cross-discipline-pollination than this? Before, we mentioned we’re not the first to meditate on the meaning of “consensus” — we’re also not the first to repurpose tools originally designed for phylogenetic analysis for use with general text; linguists and others in the field of phylomemetics (h/t to Nic Weber for the linked paper) have been doing the same for years. While the sciences and humanities may still have very different research questions and epistemologies, our informatics tools have much in common. Being aware of, if not making use of, one another’s conceptual frameworks may be a first step to sharing informatics tools, and building towards new, interesting collaborations.
Finally, back to our question about what we mean by “consensus”: we can now see that our volunteers and algorithms are currently better suited to creating “copy-text” editions, or interpreted versions of the specimen labels — which makes sense, given the many levels of human and machine interpretation that each label goes through. Changes to the NfN transcription workflow would need to be made if museums want a “critical edition,” or verbatim version of each label as well. Whether this is necessary is up for debate, however — would the preserved image, on which transcriptions were based be enough for museum curators’ and collection managers’ purposes? Could that be our most “canonical” representation of the label, to which we link later interpretations? More (interdisciplinary) work and discussion is clearly necessary — but we hope this first attempt to link a few disparate fields and methods will help open the door for future exchange of ideas and methods.
References, and links of potential interest:
If you’re interested in learning more about DH tools relevant to this kind of work, check out Juxta, an open source software package designed to support collation and comparison of different “witnesses” (or texts).
For more on phylomemetics:
Howe, C. J., & Windram, H. F. (2011). Phylomemetics–evolutionary analysis beyond the gene. PLoS biology, 9(5), e1001069. doi:10.1371/journal.pbio.1001069