Re:Sourcing Primary Materials: Notes from A Workshop

Last week, we both attended a workshop on Primary Source Material digitization organized by iDigBio, where we had some great, truly interdisciplinary conversations with folks like Rusty Russell over at the Field Book Project, Ben Brumfield at From The Page, Terry Catapano at Plazi and many others.  We’re both eager to stay involved with this working group — the topic is a rich one that connects together the museum and library communities — but I (Andrea) did want to get down some few thoughts about the primary source (aka field book and similar documentation or realia) digitization in general, and some potential future directions before they escape my memory:

1) At the workshop, we spent a lot of time doing things like describing different imaging workflows, potential citizen science applications, and defining what exactly we mean when we say “source materials” (all important tasks), but we didn’t spend much time talking about formatting and making available all the already transcribed field books that are surely scattered on hard drives throughout natural history museums all over the world.  While the various CLIR “Hidden Collections” grants have helped support the cataloging and preservation of paper archives, I suspect there are a huge number of already semi-digitized (e.g. transcribed into word/text/wordPerfect docs, or even typewritten files) field notes out there without any long-term preservation plan (particularly in smaller museums).  As fragile as paper archives are, digital archives can be even more delicate; given a dry room, paper will remain legible for centuries whereas the information contained a hard drive will degrade unless periodically migrated.  Already existing transcriptions are likely housed on older media, like floppy disks and zip drives — making their fragility even more acute.

Why not just re-transcribe the notebooks once they’re scanned?  Because good transcriptions are massively time consuming to produce!  Thus, these already transcribed field books represent dozens-to-hundreds of person hours of work per book that are in danger of being lost if not properly curated.  There are ways — in fact, an entire field of study dedicated to articulating the ways — to conduct this curation, but it needs to be identified as a priority in order for that to happen.  For now, we want to make sure we keep shedding a little light on some this particular breed of “dark data,” and encourage everyone to keep thinking about how to make it a little less obscure. Do many folks out there have digitized products but aren’t quite sure about next steps?  If you have any of those fragile digital copies out there, we want to hear about them!

2) Similarly, while we spent a fair bit of time talking about what it would take to get books newly transcribed, we didn’t spend much time talking about important details like how to best format, preserve and present these notebooks once typed, or how those various formats and presentation will support different kinds of use.  It’s not just be enough to just type field books into a text file and throw them on a server somewhere.  Its not even enough to provide some metadata about the field notebooks.  And it may not even be enough to develop a wiki-style markup syntax from scratch for every single project out there.  We need to move toward a standardized way of “marking up” these transcriptions so that they’re more findable, machine readable and text mineable for future researchers and to assure they don’t end up glued down in a particular database system.

Projects like the Smithsonian’s Transcription Project, Ben Brumfield’s From The Page software (currently being used by the MVZ for field note transcription), and our own Henderson Field Book Project make use of wiki-style syntax to annotate items in digitized text.  What happens next, though? How to search across these resources?  For the Henderson Project our solution was to ultimately rip the records from the page (no pun intended, Ben), and place them into a format that ensures interoperability (e.g. Darwin Core Archives), while also linking back to the page where those records resided.  Ideally, however, one could find a different solution, one that builds on existing markup standards.

At the workshop, we learned that Terry Catapano and Plazi have been doing just that — building on existing standards — by developing a much needed (and not nearly talked about enough) taxonomic extension for both TEI (the Text Encoding Initiative) and JATS (the Journal Article Tag Suite).  Using these markup standards can make taxonomic information vastly more findable.  However…  we’re not all using these standards in the same way, or necessarily understanding where there are overlaps, existing terms, etc. — defeating the purpose of a standard.  Thus, there’s a need to develop what’s called an “application profile” of standards like TEI, JATS, or wiki markup for field notebooks and other primary source material.  In other words: there are already plenty of ways and languages with which to mark up text, but we still need to continue the conversation beyond this workshop, and develop best practices for these standards’ use, thereby best supporting future use of source materials.

3) Finally: Part of this development of best practices needs to include identifying potential use cases, or user communities, that these source materials might be of interest to.  At the workshop, while a range of use cases were articulated — using field books to reconstruct historical ecologies, to clarify specimens’ provenance, to re-find localities — all of these cases could be said to rely on a fairly “close” reading of the text, in which a researcher reads each book line by line and word by word, and might do her own extraction of relevant content.

Supporting this kind of reading is certainly important — but I argue we should also push to support more computational, or distant kinds of reading: if we truly want to capitalize on the power of digitization we need to aim to support text and data mining.  “Text and data mining” includes some methods that the biodiversity informatics community is already familiar with: for instance, using natural language processing to identify and extract data like proper names, taxonomic names, dates, geographic locations; publishing said data to the semantic web; and so on.  However text mining also includes some methods that the biodiversity community isn’t as familiar with but may be helpful  — methods like topic modeling, in which algorithms are used to model a documents ‘aboutness’, and other machine learning methods.  Though these methods have been used by information scientists, digital humanists and social scientists on their large collections of digital text, they haven’t been used by us (we think?) — partially because it’s only been recently that we’ve had these large digital archives and libraries, and also partially because these methods have been trapped in their disciplinary silos.

I wish I could better explain these techniques, and better argue for their applicability to biodiversity archives but I can’t quite yet — both because as a student, I’m still learning them myself, but also because some of this is just new ground.  But I do know this: supporting these kinds of computational approaches could greatly speed up the rate at which we’re able to extract useful scientific information from source materials, and would also likely attract an interdisciplinary crowd of information scientists, digital humanists and other researchers to the collections.

Clearly, there is much more to discuss!  Sounds like we need another workshop, no?

Posted in Uncategorized | Leave a comment

How is finding a consensus among citizen science transcriptions like aligning gene sequences AND textual analysis of medieval codices? Part 2

(cross-posted at SciStarter)

In our last post, we went through the mechanics of how to find consensus from a set of independently created transcriptions by citizen scientists — this involved a mash-up of bioinformatics tools for sequence alignment (repurposed for use with text strings) and natural language processing tools to find tokens and perform some word synonymizing.  In the end, the informatics blender did indeed churn out a consensus —  but this attempt at automation led us to realize that there’s more than one kind of consensus.  In this post we want to to explore that issue a bit more.

So, lets return to our example text:

Some volunteers spelled out abbreviations (changing “SE” to “Southeast”) or corrected errors on the original label (changing “Biv” to “River”); but others did their best to transcribe each label verbatim – typos and all.

These differences in transcription style led us to ask — when we build “consensus,” what kind do we want? Do we want a verbatim transcription of each label (thus preserving a more accurate, historical record)?  Or do we want to take advantage of our volunteers’ clever human brains, and preserve the far more legible, more georeferenceable strings that they (and the text clean-up algorithms described in our last post) were able to produce? Which string is more ‘canonical’?

Others have asked these questions before us — in fact, after doing a bit of research (read: googling and reading wikipedia), we realized we were essentially reinventing the wheel that is textual criticism, “the branch of literary criticism that is concerned with the identification and removal of transcription errors in the texts of manuscripts” (thanks, wikipedia!).  Remember, before there were printing presses there were scribes: individuals tasked with transcribing sometimes messy, sometimes error-ridden texts by hand — sometimes introducing new errors in the process.  Scholars studying these older, hand-duplicated texts often must resolve discrepancies across different copies of a manuscripts (or “witnesses”) in order to create either:

- a “critical edition” of the text, one which “most closely approximates the original“, or

- a “copy-text” edition, which “the critic examines the base text and makes corrections (called emendations) in places where the base text appears wrong” (thanks again, wikipedia).

Granted, the distinction between a “critical edition” and a “copy-text edition” may be a little unwieldy when applied to something like a specimen label as opposed to a manuscript.  And while existing biodiversity data standards developers have recognized the issue — Darwin Core, for example,  has “verbatim” and “interpreted” fields (e.g. dwc:verbatimLatitude) — those existing terms don’t necessarily capture the complexity of multiple interpretations, done multiple times, by multiple people and algorithms and then a further interpretation to compute some final “copy text”.   Citizen science approaches place us right between existing standards-oriented thinking in biodiversity informatics and edition-oriented thinking in the humanities.  This middle spot is a challenging but fascinating one  — and another confirmation of the clear, and increasing, interdisciplinarity of fields like biodiversity informatics and the digital humanities.

In prior posts, we’ve talked about finding links between the sciences and humanities — what better example of cross-discipline-pollination than this?  Before, we mentioned we’re not the first to meditate on the meaning of “consensus” — we’re also not the first to repurpose tools originally designed for phylogenetic analysis for use with general text; linguists and others in the field of phylomemetics (h/t to Nic Weber for the linked paper) have been doing the same for years.  While the sciences and humanities may still have very different research questions and epistemologies, our informatics tools have much in common.  Being aware of, if not making use of, one another’s conceptual frameworks may be a first step to sharing informatics tools, and building towards new, interesting collaborations.

Finally, back to our question about what we mean by “consensus”: we can now see that our volunteers and algorithms are currently better suited to creating “copy-text” editions, or interpreted versions of the specimen labels — which makes sense, given the many levels of human and machine interpretation that each label goes through.  Changes to the NfN transcription workflow would need to be made if museums want a “critical edition,” or verbatim version of each label as well.  Whether this is necessary is up for debate, however — would the preserved image, on which transcriptions were based be enough for museum curators’ and collection managers’ purposes? Could that be our most “canonical” representation of the label, to which we link later interpretations? More (interdisciplinary) work and discussion is clearly necessary — but we hope this first attempt to link a few disparate fields and methods will help open the door for future exchange of ideas and methods.

References, and links of potential interest:

If you’re interested in learning more about DH tools relevant to this kind of work, check out Juxta, an open source software package designed to support collation and comparison of different “witnesses” (or texts).

For more on phylomemetics:

Howe, C. J., & Windram, H. F. (2011). Phylomemetics–evolutionary analysis beyond the gene. PLoS biology, 9(5), e1001069. doi:10.1371/journal.pbio.1001069

Posted in Uncategorized | Tagged , , | 2 Comments

How is finding a consensus among citizen science transcriptions like aligning gene sequences AND textual analysis of medieval codices? Part 1

We’ve always been interested in citizen science approaches to the digitization of museum labels and ledgers, even before there were tools out there to do it. But now, projects such as Notes from Nature (NfN; notesfromnature.org) haven’t just built the tools — they’re giving us the opportunity to put those tools to the test, and in doing so, more deeply explore how museums can work with data generated not just by volunteers, but volunteers at a distance.

As NfN and other Zooniverse volunteers may already know, the specimen records in NfN aren’t transcribed just once — they’re actually transcribed at least four times each. These redundant transcriptions are collected to improve the accuracy of transcriptions overall, the idea being that the four-plus transcripts can be combined and compared against each other to create a final, canonical version of each specimen record. But how?

In this post, we want to present one of many potential approaches — some of which is also discussed in a related blog post over at NfN (http://blog.notesfromnature.org/2014/01/14/checking-notes-from-nature-data/). But here, we want to lead up to a discussion not just how to create a canonical, consensus record — but also about what we mean when we say “canonical” in the first place.

But before we get to that, let’s look at some raw data, and briefly review the “how”. Below are two raw output transcription records from Notes from Nature (the Calbug project to be exact), with some minor edits (some columns have been removed based on relevance etc.). The main columns to focus on are subject_id which defines the “subject” (in this case the image being transcribed) and, for the purposes of illustration, the locality string that was entered by the transcribers.

table2Here are the actual images from the Essig Entomology Museum database so you can compare the transcripts to what the label looks like:

Label 1
Label 2

The first data curation phase involves some general normalizations: abbreviations and capitalizations are standardized with a simple script, and we perform some find-and-replacements to sort out blank spaces and other problem characters. The transcriptions and subject_ids are then formatted for use with a something most biologists will be familiar with: sequence alignment software.

Here are the FASTA (http://en.wikipedia.org/wiki/FASTA_format) formatted outputs for the two specimens above:

unalignedWhy FASTA format, and why sequence alignment software? Because the next step of the normalization process involves aligning each word of the locality string — much in the same way the sequence alignment involves the alignment of homologous characters. And just as sequences can have both mutations and “gaps” that need to be discovered through alignment, so to do transcripts have the same issue. Aligned results (using MAFFT) are below:

alignednew[Note from Andrea:  The use of sequence alignment tools on plain text is not unheard of!  Linguists and textual critics have been applying phylogenetic tools and methods to things other than genes and traits for a while now — more on this next post.]

Normalizing and aligning the outputs makes it much easier to see a growing consensus, but note that both these transcripts still have a few problems.  In the first case, only one transcriber transcribed the record verbatim from the label (e.g. Friona, 10 mi.N.) while everyone else transposed it.  And one transcriber also put an “of” in the transcription that isn’t there (e.g. 10 miles north of Friona).  To account for variants that only occur once across the pool of transcripts, another script is used to remove all positions from the alignment where there is only one letter and the rest are gaps.  So the very first position, for example, is all gaps except for one “F” and thus is culled.

The final output alignments for both records looks like this:

aligned2The penultimate step is to take a consensus of these aligned transcripts using this: http://biopython.org/DIST/docs/api/Bio.Align.AlignInfo.SummaryInfo-class.html#dumb_consensus.  The output is basically all the places where the alignment agrees at least 50% of the time minus all gaps.  The final step is to reconvert underscores to spaces for readability, and print a final consensus transcript — in this case:

 #         subject_id                                     consensus                                                        
4          516d7fa4ea3052046a000004    10 miles North Friona
6          516d7fa6ea3052046a00000a    Rushing Xiver Provincial Park 15 miles Southeast

Note:  The “X” in the second transcript is a case where there was a “tie” — 50% of the records said “B”, 50% said “R”, so it places an “X” for those cases.

So how’d the consensus approach work?  In one sense pretty darn well. The results are readable and pretty reflective of the label contents — maybe an improvement, even.  In another sense, they don’t resemble the verbatim label contents very much at all (click on those image links above to view the verbatim labels).

So, this is where return to our newfound struggle to define canonical: do we want labels that are verbatim transcriptions of the specimen labels? Or do we want easily “georeferenceable” contents that are accurate, normalized, and easily interpretable to future researchers. Or both?

The answer, of course, is both, and more — Next post, we discuss more why we need both verbatim AND interpreted, easily georeferenceable transcriptions, and on how studies of medieval manuscripts and natural history museum curation ain’t so different after all….

Posted in Uncategorized | 5 Comments

Data Diversity of the Week: Sex

SYTYCD would like to welcome guest blogger, John Wieczorek (also known as Tuco)

Data aggregators such as VertNet and GBIF work with data publishers (such as museums and research stations) to share their data in common formats.  We’ve been talking about this for a long time, and the great news is that data publication has become much easier with the advent of ratified standards such as Darwin Core (DwC), and tools such as GBIF’s Integrated Publishing Toolkit (IPT).

For some data fields, Darwin Core recommends the use of controlled vocabularies, but it doesn’t always recommend which vocabulary to use, and it certainly does not prohibit data content that does not conform to a controlled vocabulary. This is a double-edged sword.  On the plus side, there is no unnecessary obstacle to data sharing – data publishers don’t have to convert their data to conform with a standardized list of values.  On the downside… well… that bring us to our new feature installment on So You Think You Can Digitize:  This Week in Data Diversity.

In the coming weeks, we’ll be serenading you with interesting examples of the diversity of data published to VertNet, starting with everyone’s favorite Darwin Core term: “sex”.  This term is intended to express whether an organism is “male” or “female”.

So how many different ways can one say “male” and “female”?

Well, from just 48 collections at 20 institutions covering about 2.7 million records that have been processed through the data cleaning work flows of VertNet so far, there have been … 189 distinct values in the sex field that mean “male”! Don’t worry, there are also 184 ways to say “female”!  Now that we’ve established some parity between the genders, you are free to worry, because there remain 331 variations that are either ambiguous or that actually mean either “undetermined” or “unknowable”. REALLY?  Yes, really. None of this is counting case-sensitivity, by the way.

Luckily, standard terms do appear in the real-world data:  “female”, “male”, “gynandromorph”, “hermaphrodite”. That’s great, but … then, there are the abbreviations: “F”, “M”, “U”, “F.”, “M.”, “U.”, “UNK”, etc. There are also language variations, such as “hembra”, “macho”, and “H”  … and the ones where the data publishers aren’t really sure – “M?”, “F?”, “F ? M”, “M [almost surely F]”. We mentioned these are all real examples, didn’t we?

Then there are the variations that also include information from other Darwin Core fields such as lifeStage: “subadult male”, “f (adult)”, etc.  There are variations where the evidence is included: “male by plumage”, “macho testes 15×9” (yes, it is a real example!).  But we aren’t done yet. There are variations that try to capture what was written on a label: “M* with [?] written in pencil af”, “M [perhaps a F]”, “M[illeg]”. Up to this point we can more or less figure out what these were about, but that’s not always the case. There are variations expressing, perhaps, some measure of over-exuberance, such as “M +” and “F ++”.  It gets even crazier: “Tadpoles”, “hg”, “Yankee Pond”, “gonads not found”, “F4”, “Fall Male”, “U.S.”, “O~”, “M(i.e. upside down ‘F’)”, “M mate of # 357”, “M=[F] C.H.R.”, “(?) six”, “[F pract. certainly]”, “Apr”, “<”, “263.5”.  And what list would not be complete without …”downy”?

VertNet is working to reduce this …diversity… by managing folksonomies (real-world terms for a concept) produced by real verbatim content, and interpreting as well as possible against controlled vocabularies. These lookups are used in data migration processes that take original source data structures and contents and massage them into a Darwin Core format ready for publishing. The idea is to reduce this diversity to something manageable on the scale of VertNet and to facilitate data searching.

Below is a list of fields that Darwin Core recommends to be vocabulary-controlled and that VertNet is trying to track and manage. The job is non-trivial. The vocabularies are very difficult to curate because of the overabundance of content diversity, as shown in the Table below:

DwC Field

Distinctive variants in the field

# of variants resolved before IPT publishing

basisOfRecord

5

5

country

4989

4989

disposition

3307

17

establishmentMeans

12

12

genus

15884

15884

geodeticDatum

251

251

georeferenceProtocol

177

177

identificationQualifier

140

140

language

6

6

lifeStage

13087

429

preparations

10745

2052

reproductiveCondition

13482

361

sex

710

710

taxonRank

8

8

type

5

5

typeStatus

1861

481

We will delve into the mysteries of some of these Darwin Core fields (certainly “preparations” and “country”) in later posts and we’ll talk more about potential solutions to the problem of data diversity.  In the meantime, you can wet your appetite by perusing the lists of distinct values for many of these fields in our Darwin Core Vocabularies Github repository. We’ll keep this up to date with incoming terms as we encounter them. By now we hope we have convinced you that it is a huge problem.

Next time you want to find some male specimens of a particular bird, just remember to include “downy” as a variant in the “sex” field, unless of course you are searching on VertNet, where this problem has been taken care of for you. Don’t worry, you’ll still be able to find downy specimens on VertNet with a full-text search.

___________________________________________________________________

authors:  John Wieczorek, Rob Guralnick, David Bloom, Andrea Thomer and Paula Zermoglio

Posted in Uncategorized | 2 Comments

This week in digitization: The good, the buggy, and the curious

This will be old news to many, but regardless: two big projects related to specimen digitization and biodiversity informatics launched in the past couple weeks.   Quick impressions on both below, focusing on the good, the buggy and a few items of curiosity.  Both projects are great, but — how will they fit into the broader landscape of existing resources, and into what niches?

1) Notes From Nature — a new Zooniverse project for the transcription of natural history collection ledgers.  This has been a long time in the making (more details here) and as of this writing, the two available collections (Herbarium specimens from SERNEC and insects from CALBUG) are already 26% and 21% transcribed, respectively.

The Good: As always, clean and intuitive interfaces from the Zooniverse team make transcription fast and easy.  Data entry screens are customized to each type of collection (e.g. plant labels often contain more detailed locality descriptions than insects, whereas insect labels often contain data about what host-organism they were found on).  Awesomely, all the code is available on Github (https://github.com/zooniverse/notesFromNature) in case other Museums want to set up their own transcription engines locally.  There is also an intriguing teaser buried at the very bottom of the Notes from Nature “About” page: “Interested in publishing your collection? Contact us.”

The Buggy: Maaaan, I’ve transcribed around 40 labels and my total isn’t showing up under my user profile. This bothers me more than I care to admit, though it’s primarily out of worry that my transcriptions aren’t being saved.

The Curious:  It would be great to learn more about how these data get back to the collections databases, and how exactly that handoff happens.  What do the transcribed files look like?  How is accuracy checked?  Do the museums have plans to make these records publicly available, or harvestable by aggregators like GBIF?

2) The patriotically-named Biodiversity Information Serving Our Nation (AKA BISON) biodiversity data portal out of USGS —  I know less about this project, other than what I’ve learned at various conference talks — however I’ve heard it referred to as the “federal version of iDigBio.”

The Good: On first look, really nice integration of specimen occurrence data with USGS map layers, and as Hilmar Lapp pointed out, there’s an API, which is great.

The Buggy:  There are no identifiers on these specimens — not even their local catalog numbers.  Per Stinger Guala in the G+ thread linked above, the data is there — it’s just not yet visible (though will be soon).  Perhaps there are reasons (a need for better formatting? a need for cleaner data?  a need for more server space?) that they’re not yet making this data visible yet — but it struck me as a pretty glaring omission.  While I realize that many researchers don’t spend a lot of time looking at  catalog numbers, I imagine that they’d be absolutely critical if one was integrating BISON data with that from other sources (say, something from another portal like GBIF). Also, how could any of these records ever be linked back to the source data or any other data out there?  Provenance = important, no?

The Curious: BISON is apparently the US node of GBIF — which I had assumed meant they would be providing GBIF with US data  — however, the data in BISON appears to invert that model and is a US-centric mirror of GBIF.  I hope that BISON becomes a platform through which US, federally owned and managed biocollections can be made publically discoverable, and would be interested to hear from BISON reps if there are any plans in place to do this.

Posted in citizen science, crowdsourcing | Tagged , , , | 9 Comments

What gets linked to global unique identifiers (GUIDs) in natural history collections digitization?

Co-written with David Bloom.

For as long as explorers have been collecting specimens and bringing them back to museums, collection managers and museum staff have been assigning unique (well, more or less unique, with a margin for human error) numbers to them. Collections management isn’t just about conservation of physical objects – it’s also about care and support of an information retrieval system.  And while new technologies come and go, the maintenance of this system of locally unique identifiers remains at the core of collections management.

Recently, that aforementioned information retrieval system has been growing increasingly dispersed, and the ways in which information about specimens is disseminated, accessed, and manipulated have been changing rapidly.  The ability to place digital representations of specimens and their data en masse onto the World Wide Web is fundamentally changing collections-based scholarship.  Consequently, locating the right giant clam specimen in the University of Colorado Museum of Natural History (CU Museum) invertebrate collection is a very different task than locating a digital image of the same specimen on the Internet.  Tracking and connecting all of the digital representations (e.g., images, metadata records digitized from labels, tissue samples derived from the specimen, sound and video files) derived from that single specimen is even more difficult.  While local identifiers suffice to connect data and specimens within the CU Museum, global identifiers are needed to maintain these connections when content is released into the wilds of the Internet.  The challenge before us is how to best set up a system of globally unique identifiers (GUIDs) that work at Internet-scale.

iDigBio, an NSF-funded project tasked with coordinating the collections community’s digitization efforts, just released a GUID guide for data providers.  The document clarified the importance of GUIDs and recommended that iDigBio data providers adopt universally unique identifiers (UUIDs) as GUIDs.  It went further, however.  In the document, a call-out box (on page 3) states that “It has been agreed by the iDigBio community that the identifier represents the digital record (database record) of the specimen not the specimen itself. Unlike the barcode that would be on the physical specimen, for instance, the GUID uniquely represents the digital record only.”

In response, Rod Page wrote an (as always, entertaining and illuminating) iPhylo blog post “iDigBio: You are putting identifiers on the wrong thing ” in which he makes a strong case that a GUID must refer to the physical object.  “Surely, “ writes Rod, “the key to integrating specimens with other biodiversity data is to have globally unique identifiers for the specimens.”

The disagreement above underscores our community’s need to be very clear about what GUIDs reference and how they resolve[1].  This seems simple, but it has been one of the most contentious issues within our community.  So who is right – should GUIDs point to digital records, or physical objects?

iDigBio has a clear mission to support the digitization of natural history specimens, and thus, deals exclusively with digital objects, which, as any database manager knows, need to have identifiers.  So, it does make sense that they would be concerned with identifiers for digital objects.  Those identifiers, however, absolutely must be as closely associated with the physical specimens as possible.  In particular, they need to be assigned and linked to the local identifiers stored in local databases managed by on-site collections staff.  If iDigBio is saying that only digitized objects that get passed to iDigBio from their data providers need GUIDs, not the original digital catalogs, we can’t agree with that.

On the other side, we’re not sure we agree with Rod either.  If Rod is suggesting that GUIDs replace, or serve as additions to the catalog numbers literally, physically attached to specimens or jars, we think that is simply impractical.  What is the incentive for putting a GUID on every single specimen in a collection, especially wet collections, from the point of view of a Collections Manager?  Does it help with loans?  Who is going to go into a collection and assign yet another number to all the objects in that collection (and how many institutions have the resources to make that happen)?

What we think is feasible and useful (and likely what Rod meant) is to assign GUIDs to digital specimen records stored in local museum databases and linked to the local identifiers.  When these data get published online, the associated GUID can be pushed downstream as well. Assigning GUIDs to the local, authoritative, electronic specimen records as they are digitized should be a mandatory step in the digitization process — a process that iDigBio is uniquely poised to support. This is the only way that GUIDs will be consistently propagated downstream to other data aggregators like VertNet, GBIF, and whatever else comes along fifty years from now (and fifty years after funding runs out on some existing projects).  Again, we want to point out: it’s important to remember that natural history collections management has always entailed the management of identifiers; the adoption of global identifiers will only increase the need for local identifier management

Now, we can imagine one case in which GUIDs could conceivably serve as the originating catalog number: during field collection.  As biologists generate more and more digital content in the field (such as images and DNA collected in the field), minting GUIDs at the moment of collection (or during the review of daily collection events) and assigning them to samples and specimens directly could be quite useful.  As these physical objects make their way into collections, we anticipate that collections folks will still assign local identifiers.  Both have their uses and are made stronger and more useful when linked.

In summary: we are less worried about what exactly a GUID will point to (digital record vs physical object) as long as the content referenced is valuable to the biodiversity collections and science community.  However, we are more worried that we’re not explicitly identifying what we’re assigning identifiers to, and not discussing who and how these identifiers will be managed and integrated.  Our focus should be on developing trusted and well-understood GUID services that provide content resolution for the long (50-100 years) term.


[1] To resolve a GUID, you dump it into a resolution service maintained by a naming authority that originally created that GUID, such as CrossRef or DataCite . That service then returns to you links and other information that point you to other content attached to the same identifier.  A great example of a resolvable GUID is a Digital Object Identifier (DOI).  In the case of journal articles, resolution of a DOI will usually direct you to the paper itself via hyperlink, but it could also be a web page with information about the resource, a sound file, or any other representation of the object associated with the GUID.

Posted in Uncategorized | 34 Comments

Post-Henderson Post

So You Think You Can Digitize had a bit of an unplanned hiatus; turns out that maintaining a blog while its authors take a something like 15+ trips, attend to work/school responsibilities, and write gobs of papers is a bit trickier than anticipated.  One of those papers was born directly out of this blog: a paper submitted to Zookeys about our work on the Henderson Field Notes Project.  Ironically, we spent so much time on the Henderson manuscript that we were forced to spend less time uploading Henderson’s notebooks for additional annotation.  That should be remedied soon!   With help from our some still anonymous friends, we’ve made a lot of progress, including the annotation and extraction of over a thousand species observations made by Henderson between 1905-1909, which we then packaged as a Darwin Core archive for publication along with the paper — but there are still 9 more notebooks to go.

We’ve written a lot about field notebooks, especially older notebooks, penned by pioneers, describing the Old West, but we do worry that this reverence for the past has the unintended consequence of diminishing the importance of field and lab data recorded in the exact same way in the present day.  Digitizing old things, showing off the labels written in India ink and cursive: the romance of this might reinforce unfortunate stereotypes about the “dustiness” and antiquity of museum collections.  In the here and now, field notes are still handwritten, and specimens are still collected and catalogued, often at higher rates than ever before.  What do we do with the analog present?

This very issue came up in a post on Andy Farke’s blog, The Open Source Paleontologist.  Dr. Farke is a paleontologist who recently published a paper describing a new species not found in the field, but rather, in Yale’s paleontology collection.  Kudos to him, because he decided that publishing his findings in PLoS ONE was not enough; he also published the lab notes on which his paper was based.  In his words,

There isn’t really anything earthshaking in there… but in any case now other folks can use them. The sketches of real bone vs. reconstruction should be particularly useful.”

Earthshaking or no, we agree that there’s a huge need to openly archive this sort of documentation, both to support the reproducibility and replicability of scientific results, and to better describe the soon-to-be-historical use of the specimens in question.

Because repositories like Dryad aren’t intended to accept field or lab notes,  Dr. Farke turned to Figshare as a place to deposit them.  While Figshare is great for its ease of use and flexibility in licensing options, it’s nevertheless not ideal for text; pdfs are not easily navigable, nor is there any support for transcription and annotation of notes.  This made us wonder: could WikiSource be a place to deposit these notes?

After Gaurav conferred with the Wiki-community, we found that Wikisource could indeed be used to provision more recent field or lab notes in addition to historical documents, provided they meet certain criteria (and the Wikipedians don’t eventually find this to be a violation of existing policies).  Dr. Farke’s notes on the Torosaurus are now available here, and in need of transcription too.

Why put lab notes on Wikisource?  Why not just leave them on Figshare?  Well:

  • Just as we were able to annotate the Henderson field notes for taxa, it’s easy to imagine notes like Dr. Farke’s being annotated with specimen catalog numbers, and even linked to other records describing the specimen in question.
  • Lots of Copies Keeps Stuff Safe — If either of these sites goes belly up, then a copy of the notes would still be available, with the same CC-BY license that Dr. Farke requires.
  • Publishing notebooks to a platform like Wikisource bridges gaps between formal and informal publication, not to mention institutional archiving and self-archiving (which more often than not is simply left-bottom-desk-drawer-archiving).  Yes, though anyone can edit or post anything to the various Wikimedia sites, there are nevertheless quality and notability requirements that must be met for an article to be considered Wiki-worthy; e.g. Dr Farke’s notes qualify because he (or his papers) meets that notability requirement, but generally speaking, Wikisource is not a place to “just put notes.”
  • Something worth noting and maybe exploring further: Deposition of notebooks post-publication is not quite the same thing as maintaining an Open Notebook, although clearly related.  Wikisource/Wikimedia aren’t intended to be means of making science transparent, and may balk at that level of repurposing of the platform — only time (and continued experimentation…) will tell.

We are curious to see what others think of the idea of using Wikisource as a repository not just for historical notes, but more recent notes as well (and we also wonder if there are any eager paleo people out there looking to help transcribe Dr. Farke’s notes).

Post-Henderson, and post-Wikisource: we do want to turn our attention back to digitization, natural history collections, and what is going on right now.  A lot has changed since we started this blog back in March 2011; a number of projects that were merely in the planning stages have not only been funded, but have actually started digitizing collections.  Some of that work putting project plans into practice has been happening right on Rob’s doorstep.  A few weeks ago, the Herbarium at the University of Colorado had  a visit from New York Botanical Gardens traveling digitization set-up gurus Melissa Tulig and Kim Watson, who were here to set up an imaging station for use in the Tri-Trophic Thematic Collections Network — we hope to talk about that next post.

And hey — we’ll also be at SPNHC again this year, presenting on the Henderson project in the Archives and Special Collections session, so if you’re in New Haven this June, and wanna talk digitization, Wikisource, or anything else, do say hello!

Posted in field notes, Henderson Project, SPNHC | Leave a comment