Co-written with David Bloom.
For as long as explorers have been collecting specimens and bringing them back to museums, collection managers and museum staff have been assigning unique (well, more or less unique, with a margin for human error) numbers to them. Collections management isn’t just about conservation of physical objects – it’s also about care and support of an information retrieval system. And while new technologies come and go, the maintenance of this system of locally unique identifiers remains at the core of collections management.
Recently, that aforementioned information retrieval system has been growing increasingly dispersed, and the ways in which information about specimens is disseminated, accessed, and manipulated have been changing rapidly. The ability to place digital representations of specimens and their data en masse onto the World Wide Web is fundamentally changing collections-based scholarship. Consequently, locating the right giant clam specimen in the University of Colorado Museum of Natural History (CU Museum) invertebrate collection is a very different task than locating a digital image of the same specimen on the Internet. Tracking and connecting all of the digital representations (e.g., images, metadata records digitized from labels, tissue samples derived from the specimen, sound and video files) derived from that single specimen is even more difficult. While local identifiers suffice to connect data and specimens within the CU Museum, global identifiers are needed to maintain these connections when content is released into the wilds of the Internet. The challenge before us is how to best set up a system of globally unique identifiers (GUIDs) that work at Internet-scale.
iDigBio, an NSF-funded project tasked with coordinating the collections community’s digitization efforts, just released a GUID guide for data providers. The document clarified the importance of GUIDs and recommended that iDigBio data providers adopt universally unique identifiers (UUIDs) as GUIDs. It went further, however. In the document, a call-out box (on page 3) states that “It has been agreed by the iDigBio community that the identifier represents the digital record (database record) of the specimen not the specimen itself. Unlike the barcode that would be on the physical specimen, for instance, the GUID uniquely represents the digital record only.”
In response, Rod Page wrote an (as always, entertaining and illuminating) iPhylo blog post “iDigBio: You are putting identifiers on the wrong thing ” in which he makes a strong case that a GUID must refer to the physical object. “Surely, “ writes Rod, “the key to integrating specimens with other biodiversity data is to have globally unique identifiers for the specimens.”
The disagreement above underscores our community’s need to be very clear about what GUIDs reference and how they resolve. This seems simple, but it has been one of the most contentious issues within our community. So who is right – should GUIDs point to digital records, or physical objects?
iDigBio has a clear mission to support the digitization of natural history specimens, and thus, deals exclusively with digital objects, which, as any database manager knows, need to have identifiers. So, it does make sense that they would be concerned with identifiers for digital objects. Those identifiers, however, absolutely must be as closely associated with the physical specimens as possible. In particular, they need to be assigned and linked to the local identifiers stored in local databases managed by on-site collections staff. If iDigBio is saying that only digitized objects that get passed to iDigBio from their data providers need GUIDs, not the original digital catalogs, we can’t agree with that.
On the other side, we’re not sure we agree with Rod either. If Rod is suggesting that GUIDs replace, or serve as additions to the catalog numbers literally, physically attached to specimens or jars, we think that is simply impractical. What is the incentive for putting a GUID on every single specimen in a collection, especially wet collections, from the point of view of a Collections Manager? Does it help with loans? Who is going to go into a collection and assign yet another number to all the objects in that collection (and how many institutions have the resources to make that happen)?
What we think is feasible and useful (and likely what Rod meant) is to assign GUIDs to digital specimen records stored in local museum databases and linked to the local identifiers. When these data get published online, the associated GUID can be pushed downstream as well. Assigning GUIDs to the local, authoritative, electronic specimen records as they are digitized should be a mandatory step in the digitization process — a process that iDigBio is uniquely poised to support. This is the only way that GUIDs will be consistently propagated downstream to other data aggregators like VertNet, GBIF, and whatever else comes along fifty years from now (and fifty years after funding runs out on some existing projects). Again, we want to point out: it’s important to remember that natural history collections management has always entailed the management of identifiers; the adoption of global identifiers will only increase the need for local identifier management
Now, we can imagine one case in which GUIDs could conceivably serve as the originating catalog number: during field collection. As biologists generate more and more digital content in the field (such as images and DNA collected in the field), minting GUIDs at the moment of collection (or during the review of daily collection events) and assigning them to samples and specimens directly could be quite useful. As these physical objects make their way into collections, we anticipate that collections folks will still assign local identifiers. Both have their uses and are made stronger and more useful when linked.
In summary: we are less worried about what exactly a GUID will point to (digital record vs physical object) as long as the content referenced is valuable to the biodiversity collections and science community. However, we are more worried that we’re not explicitly identifying what we’re assigning identifiers to, and not discussing who and how these identifiers will be managed and integrated. Our focus should be on developing trusted and well-understood GUID services that provide content resolution for the long (50-100 years) term.
 To resolve a GUID, you dump it into a resolution service maintained by a naming authority that originally created that GUID, such as CrossRef or DataCite . That service then returns to you links and other information that point you to other content attached to the same identifier. A great example of a resolvable GUID is a Digital Object Identifier (DOI). In the case of journal articles, resolution of a DOI will usually direct you to the paper itself via hyperlink, but it could also be a web page with information about the resource, a sound file, or any other representation of the object associated with the GUID.