Last week, we both attended a workshop on Primary Source Material digitization organized by iDigBio, where we had some great, truly interdisciplinary conversations with folks like Rusty Russell over at the Field Book Project, Ben Brumfield at From The Page, Terry Catapano at Plazi and many others. We’re both eager to stay involved with this working group — the topic is a rich one that connects together the museum and library communities — but I (Andrea) did want to get down some few thoughts about the primary source (aka field book and similar documentation or realia) digitization in general, and some potential future directions before they escape my memory:
1) At the workshop, we spent a lot of time doing things like describing different imaging workflows, potential citizen science applications, and defining what exactly we mean when we say “source materials” (all important tasks), but we didn’t spend much time talking about formatting and making available all the already transcribed field books that are surely scattered on hard drives throughout natural history museums all over the world. While the various CLIR “Hidden Collections” grants have helped support the cataloging and preservation of paper archives, I suspect there are a huge number of already semi-digitized (e.g. transcribed into word/text/wordPerfect docs, or even typewritten files) field notes out there without any long-term preservation plan (particularly in smaller museums). As fragile as paper archives are, digital archives can be even more delicate; given a dry room, paper will remain legible for centuries whereas the information contained a hard drive will degrade unless periodically migrated. Already existing transcriptions are likely housed on older media, like floppy disks and zip drives — making their fragility even more acute.
Why not just re-transcribe the notebooks once they’re scanned? Because good transcriptions are massively time consuming to produce! Thus, these already transcribed field books represent dozens-to-hundreds of person hours of work per book that are in danger of being lost if not properly curated. There are ways — in fact, an entire field of study dedicated to articulating the ways — to conduct this curation, but it needs to be identified as a priority in order for that to happen. For now, we want to make sure we keep shedding a little light on some this particular breed of “dark data,” and encourage everyone to keep thinking about how to make it a little less obscure. Do many folks out there have digitized products but aren’t quite sure about next steps? If you have any of those fragile digital copies out there, we want to hear about them!
2) Similarly, while we spent a fair bit of time talking about what it would take to get books newly transcribed, we didn’t spend much time talking about important details like how to best format, preserve and present these notebooks once typed, or how those various formats and presentation will support different kinds of use. It’s not just be enough to just type field books into a text file and throw them on a server somewhere. Its not even enough to provide some metadata about the field notebooks. And it may not even be enough to develop a wiki-style markup syntax from scratch for every single project out there. We need to move toward a standardized way of “marking up” these transcriptions so that they’re more findable, machine readable and text mineable for future researchers and to assure they don’t end up glued down in a particular database system.
Projects like the Smithsonian’s Transcription Project, Ben Brumfield’s From The Page software (currently being used by the MVZ for field note transcription), and our own Henderson Field Book Project make use of wiki-style syntax to annotate items in digitized text. What happens next, though? How to search across these resources? For the Henderson Project our solution was to ultimately rip the records from the page (no pun intended, Ben), and place them into a format that ensures interoperability (e.g. Darwin Core Archives), while also linking back to the page where those records resided. Ideally, however, one could find a different solution, one that builds on existing markup standards.
At the workshop, we learned that Terry Catapano and Plazi have been doing just that — building on existing standards — by developing a much needed (and not nearly talked about enough) taxonomic extension for both TEI (the Text Encoding Initiative) and JATS (the Journal Article Tag Suite). Using these markup standards can make taxonomic information vastly more findable. However… we’re not all using these standards in the same way, or necessarily understanding where there are overlaps, existing terms, etc. — defeating the purpose of a standard. Thus, there’s a need to develop what’s called an “application profile” of standards like TEI, JATS, or wiki markup for field notebooks and other primary source material. In other words: there are already plenty of ways and languages with which to mark up text, but we still need to continue the conversation beyond this workshop, and develop best practices for these standards’ use, thereby best supporting future use of source materials.
3) Finally: Part of this development of best practices needs to include identifying potential use cases, or user communities, that these source materials might be of interest to. At the workshop, while a range of use cases were articulated — using field books to reconstruct historical ecologies, to clarify specimens’ provenance, to re-find localities — all of these cases could be said to rely on a fairly “close” reading of the text, in which a researcher reads each book line by line and word by word, and might do her own extraction of relevant content.
Supporting this kind of reading is certainly important — but I argue we should also push to support more computational, or distant kinds of reading: if we truly want to capitalize on the power of digitization we need to aim to support text and data mining. “Text and data mining” includes some methods that the biodiversity informatics community is already familiar with: for instance, using natural language processing to identify and extract data like proper names, taxonomic names, dates, geographic locations; publishing said data to the semantic web; and so on. However text mining also includes some methods that the biodiversity community isn’t as familiar with but may be helpful – methods like topic modeling, in which algorithms are used to model a documents ‘aboutness’, and other machine learning methods. Though these methods have been used by information scientists, digital humanists and social scientists on their large collections of digital text, they haven’t been used by us (we think?) — partially because it’s only been recently that we’ve had these large digital archives and libraries, and also partially because these methods have been trapped in their disciplinary silos.
I wish I could better explain these techniques, and better argue for their applicability to biodiversity archives but I can’t quite yet — both because as a student, I’m still learning them myself, but also because some of this is just new ground. But I do know this: supporting these kinds of computational approaches could greatly speed up the rate at which we’re able to extract useful scientific information from source materials, and would also likely attract an interdisciplinary crowd of information scientists, digital humanists and other researchers to the collections.
Clearly, there is much more to discuss! Sounds like we need another workshop, no?