What gets linked to global unique identifiers (GUIDs) in natural history collections digitization?

Co-written with David Bloom.

For as long as explorers have been collecting specimens and bringing them back to museums, collection managers and museum staff have been assigning unique (well, more or less unique, with a margin for human error) numbers to them. Collections management isn’t just about conservation of physical objects – it’s also about care and support of an information retrieval system.  And while new technologies come and go, the maintenance of this system of locally unique identifiers remains at the core of collections management.

Recently, that aforementioned information retrieval system has been growing increasingly dispersed, and the ways in which information about specimens is disseminated, accessed, and manipulated have been changing rapidly.  The ability to place digital representations of specimens and their data en masse onto the World Wide Web is fundamentally changing collections-based scholarship.  Consequently, locating the right giant clam specimen in the University of Colorado Museum of Natural History (CU Museum) invertebrate collection is a very different task than locating a digital image of the same specimen on the Internet.  Tracking and connecting all of the digital representations (e.g., images, metadata records digitized from labels, tissue samples derived from the specimen, sound and video files) derived from that single specimen is even more difficult.  While local identifiers suffice to connect data and specimens within the CU Museum, global identifiers are needed to maintain these connections when content is released into the wilds of the Internet.  The challenge before us is how to best set up a system of globally unique identifiers (GUIDs) that work at Internet-scale.

iDigBio, an NSF-funded project tasked with coordinating the collections community’s digitization efforts, just released a GUID guide for data providers.  The document clarified the importance of GUIDs and recommended that iDigBio data providers adopt universally unique identifiers (UUIDs) as GUIDs.  It went further, however.  In the document, a call-out box (on page 3) states that “It has been agreed by the iDigBio community that the identifier represents the digital record (database record) of the specimen not the specimen itself. Unlike the barcode that would be on the physical specimen, for instance, the GUID uniquely represents the digital record only.”

In response, Rod Page wrote an (as always, entertaining and illuminating) iPhylo blog post “iDigBio: You are putting identifiers on the wrong thing ” in which he makes a strong case that a GUID must refer to the physical object.  “Surely, “ writes Rod, “the key to integrating specimens with other biodiversity data is to have globally unique identifiers for the specimens.”

The disagreement above underscores our community’s need to be very clear about what GUIDs reference and how they resolve[1].  This seems simple, but it has been one of the most contentious issues within our community.  So who is right – should GUIDs point to digital records, or physical objects?

iDigBio has a clear mission to support the digitization of natural history specimens, and thus, deals exclusively with digital objects, which, as any database manager knows, need to have identifiers.  So, it does make sense that they would be concerned with identifiers for digital objects.  Those identifiers, however, absolutely must be as closely associated with the physical specimens as possible.  In particular, they need to be assigned and linked to the local identifiers stored in local databases managed by on-site collections staff.  If iDigBio is saying that only digitized objects that get passed to iDigBio from their data providers need GUIDs, not the original digital catalogs, we can’t agree with that.

On the other side, we’re not sure we agree with Rod either.  If Rod is suggesting that GUIDs replace, or serve as additions to the catalog numbers literally, physically attached to specimens or jars, we think that is simply impractical.  What is the incentive for putting a GUID on every single specimen in a collection, especially wet collections, from the point of view of a Collections Manager?  Does it help with loans?  Who is going to go into a collection and assign yet another number to all the objects in that collection (and how many institutions have the resources to make that happen)?

What we think is feasible and useful (and likely what Rod meant) is to assign GUIDs to digital specimen records stored in local museum databases and linked to the local identifiers.  When these data get published online, the associated GUID can be pushed downstream as well. Assigning GUIDs to the local, authoritative, electronic specimen records as they are digitized should be a mandatory step in the digitization process — a process that iDigBio is uniquely poised to support. This is the only way that GUIDs will be consistently propagated downstream to other data aggregators like VertNet, GBIF, and whatever else comes along fifty years from now (and fifty years after funding runs out on some existing projects).  Again, we want to point out: it’s important to remember that natural history collections management has always entailed the management of identifiers; the adoption of global identifiers will only increase the need for local identifier management

Now, we can imagine one case in which GUIDs could conceivably serve as the originating catalog number: during field collection.  As biologists generate more and more digital content in the field (such as images and DNA collected in the field), minting GUIDs at the moment of collection (or during the review of daily collection events) and assigning them to samples and specimens directly could be quite useful.  As these physical objects make their way into collections, we anticipate that collections folks will still assign local identifiers.  Both have their uses and are made stronger and more useful when linked.

In summary: we are less worried about what exactly a GUID will point to (digital record vs physical object) as long as the content referenced is valuable to the biodiversity collections and science community.  However, we are more worried that we’re not explicitly identifying what we’re assigning identifiers to, and not discussing who and how these identifiers will be managed and integrated.  Our focus should be on developing trusted and well-understood GUID services that provide content resolution for the long (50-100 years) term.


[1] To resolve a GUID, you dump it into a resolution service maintained by a naming authority that originally created that GUID, such as CrossRef or DataCite . That service then returns to you links and other information that point you to other content attached to the same identifier.  A great example of a resolvable GUID is a Digital Object Identifier (DOI).  In the case of journal articles, resolution of a DOI will usually direct you to the paper itself via hyperlink, but it could also be a web page with information about the resource, a sound file, or any other representation of the object associated with the GUID.

About these ads

About Rob

Three "B's" of importance: biodiversity, bikes and bunnies. I get to express these "B's" in neat ways --- I bike to a job at the University of Colorado where I have a split appointment as Curator of Zoology and Associate Professor. Along with caretaking collections, I also have a small zoo at home, filled with two disapproving bunnies.
This entry was posted in Uncategorized. Bookmark the permalink.

34 Responses to What gets linked to global unique identifiers (GUIDs) in natural history collections digitization?

  1. rdmpage says:

    So my question is: why bother putting global identifiers on specimens (or, indeed, anything)? From my perspective it is so we can refer to those specimens and make statements such as “this is an image of this specimen”, “I assigned this specimen to this species”, “this DNA sequence comes from these specimens”, “we have this specimen in our collection”. If the specimen identifier refer to a digital record and not a specimen then it is not easy to make that those statements. We can, but we have to do things like “this DNA sequence comes from the specimen that has the digital record with this identifier”. This would be like saying

    The paper with the identifier http://dx.doi.org/10.1371/journal.pbio.1000309 cites the paper with the identifier http://dx.doi.org/10.1371/journal.pbio.0040381. Doable, but not as simple as

    http://dx.doi.org/10.1371/journal.pbio.1000309 -> cites -> http://dx.doi.org/10.1371/journal.pbio.0040381

    Any attribute or relationship we assign to a specimen will have to go through this convoluted step of saying “this is an attribute if the thing that has this identifier”. In the RDF world the specimens are bNodes (see http://www.w3.org/2005/rules/wg/wiki/bNode_Semantics.html ). bNodes don’t have names (identifiers). Surely this is the exact problem we want to solve?

    What I’m looking for is the equivalent for specimens of the DOI we have for papers. The paper http://dx.doi.org/10.1371/journal.pone.0042499 cites the specimen USNM 509539. The image http://dx.doi.org/10.1371/journal.pone.0042499.g003 depicts specimen MCZ 161803. The paper and image exist as first class digital citizens. “USNM 509539″ and “MCZ 161803″ aren’t resolvable, nor are they unique. All the things we want in terms of citation and provenance hang on having identifiers for the actual specimens.

    Tackling this problem isn’t easy, in that we have a long legacy of using codes like “USNM 509539″ which are only locally unique (i.e., typically within a taxon-based collection within an institution). So we will need services to map these codes to identifiers, in the same way that we have tools to map bibliographic citations to identifiers such as DOIs.

    My long-term concern is that as these large-scale digitisation efforts ramp up people will start to ask “OK, now these specimens are digitised, what can I learn about them? How many times have they been cited, what is the value of the collection in terms of citations, vouchers for sequences, etc.”. It would be a pity if the answer is “we don’t know”.

    • John Deck says:

      Per the statement “… assign GUIDs to digital specimen records stored in local museum databases and linked to the local identifiers”, the term “linked” needs clarification. Linking identifiers in this case is impossible in RDF mainly because the localIDs are not constructed as URIs so can only be treated as literals, limiting our ability to link & construct graphs in the usual sense. Also, the local IDs are not globally unique so even if we do start to join them up in RDF to GUIDs using instance level relationship predicates we’ll get in trouble w/ duplicates.

      Looking at the years of trouble on the internet about the very subject of “digital vs physical” identifiers makes me more convinced than ever we need to brand an identifier scheme as purely for standing for physical material. These IDs would be known to explicitly stand for physical material even though they are not directly affixed to the specimen. The last couple of BiSciCol blogs have been building the case for this (http://biscicol.blogspot.com/2012/12/biscicol-in-four-pictures.html and http://biscicol.blogspot.com/2012/10/making-it-ez-to-guid.html).

      So, the “link” we’re talking about is implicit in the type of digital GUID used (not constructed in RDF!), so people can lookup and link their localIDs to a specific digital GUID known to stand in relation to physical objects. Currently, BiSciCol is working on this very concept with the California Digital Library, by adopting their EZID scheme for this purpose. The trick is in the scaling, getting up to billions of identifiers as a community service, but at least we’re getting better at defining the problem!

      • Rod Page says:

        John, I’m not sure what you mean by “we need to brand an identifier scheme as purely for standing for physical material” I know that the semantic web community has (and continues to) go into meltdown over information versus non-information resources, but it’s not clear to me that we need a new, or a branded identifier scheme as such. We have DOIs that identify resources that may exist digitally (e.g., a PDF) or physically (e.g., a printed book). There’s nothing in the DOI itself that says the thing it identifiers is digital.

        I guess I don’t particularly care what technology is adopted, so long as the identifiers are persistent, resolvable, and used. I think the solution of minting ARKs by default, and “upgrading” them do DOIs if cited makes a lot of sense in terms of scalability.

        I look forward to identifiers being minted, and being used. Unfortunately anyone wanting to link to specimen data right now has to rely on fragile URLs that break on a whim (GBIF, I’m looking at you). I’m toying with building a service that manages the link between a museum specimen code and the (possibly multiple) GBIF URL(s), so that even in the absence of GUIDs I can have a way of linking to specimens that is robust.

  2. Andy Bentley says:

    All products (images, DNA sequences, published materials etc.) emanate and are produced from the original specimens. In essence they are simply preparations of the original object. The only way you are going to have a resolvable GUID that can be attributed to a specimen and all its products is to have that GUID associated with the specimen in the original database where that record is housed and originated from. The other part of the problem is to ensure that this GUID “follows” the specimen and its products wherever they may go (GBIF, Vertnet, Fishnet, Fishbase, BOLD, Genbank, publication etc. etc.) and is cited along with any use of the specimens or its products. It needs to be as readily used as catalog numbers are today.

    • Rod Page says:

      +1 for “It needs to be as readily used as catalog numbers are today.” From the perspective of someone who wants to integrate specimen data I am interested in both the future going forward (ideally, specimen GUIDs routinely cited in publications and databases such as GenBank and GBIF) and the legacy literature and data (valuable information on specimens referred to by their museum codes). This is essentially the same task facing anyone building citation networks for the scientific literature. Going forward, citing a publication in a manuscript should be as trivial as pasting in a DOI for a paper, looking backwards there is joyous task of reconciling all manner of citation strings.

  3. rogerhyam says:

    The more this debate goes on the more naive my approach. If you want to say, just for example, that one resource is made from another resource just use dc:source “A related resource from which the described resource is derived.” assertion. Use dc:type and dc:format to describe what the thing is. Use HTTP URIs as the identifiers. Ownership and trust is based on the DNS system.
    There is no issue here. No problem to solve. Nothing to talk about.

    Shameless plug for our paper on linking to herbarium specimens.

    Hyam, R.D., Drinkwater, R.E. & Harris, D.J. Stable citations for herbarium specimens on the internet: an illustration from a taxonomic revision of Duboscia (Malvaceae) Phytotaxa 73: 17–30 (2012). http://www.mapress.com/phytotaxa/content/2012/f/pt00073p030.pdf

    • Rod Page says:

      Well, there are a couple of things to talk about ;)

      I think the RBGE is in the nice position of having a degree of institutional stability, and having a consistent way of referring to specimens that is locally unique and hence easy to translate into globally unique identifiers (using the “http://data.rbge.org.uk/herb/” prefix), which makes them easily discoverable. It also seems to have sensible approach to managing it’s domain name (e.g., the ease with which you could set up a subdomain). It’s reasonable to expect that these URLs will perform as you expect.

      But this need not always be so, and many institutions will not find themselves in this position. Many collections do not have locally unique identifiers, or if they are locally unique, it is only because they have inserted additional namespaces that mean the identifiers are no longer those that have been cited in the literature (and hence the digital identifier is not easily discoverable). I suspect that in some (if not most) institutions, there is little understanding of the desirability of maintaining stable URLs. I have seen entire museum research publication archives disappear on the whim of a web site redesign.

      In the paper you mention “The only thing we depend on is continued legal rights to the rbge.org.uk domain name.” Just to play Devil’s advocate, I wonder what will be the fate of “.uk” URLs for Scottish institutions if, say, Scotland becomes independent? What if it becomes a republic and decides that “Royal” is no longer a suitable prefix for Scottish institutions? What happens if the RBGE is subsumed into another collection, will that collection be able (or even willing) to maintain URLs in another Internet domain?

      I guess we could go round and round on this. I think it’s great what RGBE are doing, it’s an elegant solution to RGBE’s problem “How do we get our specimens online?”. The more specimens are on the web as individually addressable items of data the better. If we added discovery services, usage tracking (citation), and caching of metadata to handle service outages (and support search), then we could get the kind of system I’d like to see.

      • Regarding obsoleted URLs: I assume that they would maintain legacy permanence in the sam way that herbarium abbreviations do, even when collections are merged, i.e. UC-JEPS and RSA-POM. The name doesn’t have to be an accurate description of the current status of the institution, as long as it’s uniquely identifiable, right?

      • rogerhyam says:

        Most of what you say only applies to small collections. The vast majority of specimens are in a few large collections (http://www.hyam.net/blog/archives/1235) that are perfectly capable of doing what we did within their current resources – but they don’t and that *is* worth talking about.

        And yes independence is a question but not having access to the domain even after independence is extremely unlikely. I’m sure we could find an English resident to act as registrant or have an exception made if .UK names were ever restricted. Scotland and England would have to go to war to stop it . That hasn’t happened since the 14th Century so perhaps we are due one. In the last century we were at war with Germany twice so we better not rely on institutions base there. The century before that we were at war with both the French and the Americans so we can’t trust them either. Don’t even mention the Dutch.

        To me major global conflicts and subsequent loss of domain names seem far less likely to occur than that we simply fall out with some third party who is supposed to do the GUID resolution. I may be wrong.

        The institution could cheerfully change its name and domain provided we keep the old one for existing identifiers. RBGE stands for The Rabbie Burns Garden Edinburgh anyhow comrade Rod :)

      • John Deck says:

        There are two other issues with using Http URIs to identify specimens. First, what happens when RBGE decides to move some of its collections to another institution in a more stable country, say Switzerland (or perhaps decides to consolidate collections to save costs to pay for the National Haggis fund). At the very least, redirecting that URI becomes a pain in the long run. Second, using Http: really muddies the waters re: looking at these identifiers as representing either physical or digital objects. When i see “Http:” i always think of a web resource— not a physical resource. My first comment on this thread was pointing out the advantages of a new scheme, or authority, that confers explicit meaning to specimens as physical objects just so we’re all clear its not digital (which both urn:uuid: and http: tend to convey).

        Per your comment about using “…dc:source ‘A related resource from which the described resource is derived.'” — if we want to make some sense of relationships in a graph, we’ll need more expressivity than that, essentially needing to cover transitive & symmetric properties of the joiners (For example… there is a big difference between the nature of the relationship of a tissue sub-sample to a specimen than say, to an Agent who made an identification of the specimen. Same goes for the relationship to different identifiers pointing to the same object.) More at: http://biscicol.org/terms/index.html

      • rogerhyam says:

        The thing I love about science is that we can have fun coming up with 100 reasons why a proposed hypothesis is wrong. In contrast engineering is about doing a cost/benefit analysis and choosing a solution that looks like the most optimal one today then building the thing. In science you don’t have to build it just write it up.

        Whatever you suggest as a relationship between two entities someone else could always say “that isn’t expressive enough” or “someone may want to …”. Each one may be valid and each one could be a new publication and each one puts enough doubt in the mind of a curator that they don’t put their stuff on line for fear of being ‘wrong’.

        Because our specimens have nice HTTP URIs you can make whatever assertions you like about them and be as expressive as you want. If we had waited to sort out a more detailed ontology before we put them on line you wouldn’t be able to say anything about them at all in any ontology.

        What I am preaching is an agile approach to building a network of resources. Do something really really simple and make it work. Then step back and look at the new world we have made before designing the next bit.

        “What happens if….” will prevent anyone ever doing anything.

        And yes – httpRange-14 is a spiritual answer to many questions in life! I love it.

      • Rob says:

        Yeah my problem is when a poorly constructed set of assertions leads to nonsense inferences when reasoning over those assertions.

      • Rod Page says:

        (This comment is in response to John’s below, but the WordPress comment system won’t let me respond directly to that comment, sigh).

        Isn’t the “http:// means it’s digital” argument dead now? In the linked data world http://dbpedia.org/resource/Paris is the city in France, not a digital record about that city. I realise that people can get muddled about this (and let’s not mention httpRange-14), but our experience with alternative schemes has been something of a disaster (e.g., LSIDs).

        As a concrete example, the Rosetta Stone in the British Museum, very much a physical object, has the identifier http://collection.britishmuseum.org/id/object/YCA62958. If it’s good enough for the BM and the Rosetta Stone, why is not not appropriate for natural history specimens?

  4. John Deck says:

    People will use HTTP: for identifiers and we can figure out what they mean, but this doesn’t mean that there aren’t other techniques to use that are also valid and more robust.

    LSIDs may have failed for other reasons, and the ongoing support role that Crossref/Datacite plays seems key for DOIs. From John Kunze’s comments on another blog/thread: “There is no magic bullet for persistence, which is a sweaty, onerous service undertaking.”

    • Rod Page says:

      Persistence is indeed hard work, but we can make some choices that make the task easier. For example, DOIs lack branding, which makes it easier for a publisher to accept that if they purchase another publisher they must honour their DOIs. The layer of redirection that comes with DOIs (or, indeed, PURLs) frees both consumer and publisher from some of their biggest concerns. The consumer is worried that the identifier won’t persist, the publisher may balk at the commitment to maintaining a given URL. A system like DOIs brings with it the need for management of a centralised service, but this has advantages such as the ability to track down and fix broken identifiers, as well as build tools on top of the centralised aggregation of data (e.g., http://search.crossref.org/ ).

      I have enormous sympathy with Roger’s “just do it” approach, especially as RGBE have created nice, clean URLs for specimens. If everyone were to do this we’d be making progress. So I don’t want to be the person who says “you don’t want to do it like that”.

      I think the museum community will repeat many of the experiences of the academic publishing world. Initially people used URLs for articles (or just in time lining using OpenURL), and some publishers have fairly clean URLs, e.g. http://www.sciencedirect.com/science/article/pii/S1055790312004976 (for the paper http://dx.doi.org/10.1016/j.ympev.2012.12.019 ). But the fluid nature of publishing web site design (where new systems resulted in new URLs), changes in ownership of publishers, and movement of journals between publishers lead to URLs changing, hence the birth of CrossRef and DOIs.

      I suspect we will see technically savvy institutions such as RGBE put their collections online with a commitment to maintaining the URLs, and others will attempt to do the same with varying levels of success. There will be failures, cases where specimens will be cited using URLs and the URLs will break. We will possibly just accept this, or we will think about ways of mitigating this problem. And, eventually we will likely come up with something like CrossRef. I still think we have issues of persistence and discovery that would be best tackled by a centralised aggregation service. GBIF would be the obvious candidate for this role, though I’ve seen little evidence it’s one they want. The pity is we already have some 383,000,000 specimens and observations online, each with a unique URL, but without a commitment to their stability (or decent mechanisms to ensure their stability) these URLs have little value as identifiers.

      • Rob says:

        +1 to this whole post. I see a HUGE advantage to a centralized and respected authority/aggregator that helps the community manage GUIDs for the exact reasons Rod has outlined above. As GBIF moves to a dataset publishing model, I wouldn’t be at all surprised to see them focus on dataset level GUIDs (DOIs, basically) managed by DataCite. What is less clear to me is the record or occurrence level IDs needed for linking specimens with other stuff (e.g. literature and downstream “preparations”).

        One other huge advantage to a centralized and respected service is that while GUIDs themselves might be “static” in a sense, there is a lot of innovation and change happening with the quality and type of services built around GUIDs that can help both publishers and consumers. Having expertise in this arena and leveraging that expertise and existing services means that we aren’t constantly homebrewing solutions.

      • John Deck says:

        ++ re: ‘Roger’s “just do it”’ approach. Agreed thats moved us forward fast.

  5. rogerhyam says:

    I was just about to go to bed when I remembered my blog post from back in April 2009 where I said “I believe this is my final word on persistence of GUIDs” http://www.hyam.net/blog/archives/346

    • Rod Page says:

      Well, you will keep feeding this particular troll ;)

    • Rod Page says:

      Oh, and the “The Rabbie Burns Garden Edinburgh” was genius…

    • John Deck says:

      A fun read that post…However, I didn’t see a recognition that the indirection DOI offers (and EZID as an extension) as being the key to offering some persistent format for web-resolution. That is, URLs can be updated on the backend not as a way to mutate the data but as a way to offer some stability in web resolution. Key metadata stays with the DOI (e.g. what is this resource).

      • rogerhyam says:

        You see no recognition because I don’t recognise it because it doesn’t exist.

      • Rod Page says:

        Well, this is fun :)

        At the risk of this discussion going on forever, I’m going to try and articulate why Roger and I disagree, but can both be “right”. Bear with me.

        As I sketched in http://iphylo.blogspot.co.uk/2013/01/megascience-platforms-for-biodiversity.html I view biodiversity data as forming a graph with nodes and edges (“links”), in other words:

        ❍――❍

        If you are a node (❍ e.g., a museum or herbarium) your goal is to get your collection online. Roger’s “just do it” with HTTP URIs will achieve this. You want the URIs out there, you want them resolvable, and you want people to use them. If a URI breaks, that’s maybe embarrassing, but you do your best to avoid it happening.

        If you are an integrator then you are interested the links (――), which means you construct pairs of nodes like this (❍,❍). Now, if either one of the URIs changes your link breaks, and that’s bad. So you are very sensitive the persistence of URIs across the whole of biodiversity informatics, and are painfully aware that these can and do break (no matter what the underlying technology). Whole chains of inference can collapse if URIs change. All the effort expended on discovering the link is wasted (hence my anguish when GBIF casually deletes vast numbers of URLs http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html ).

        There is a strong incentive to get data online, so nodes (❍) will work to make this happen. In other words, there is a “constituency” (bunch of people/organisations who want to make it happen because it is in their interests for this to happen).

        There is not, however, a “constituency” for integrating data using links. Sure, we have GBIF, but GBIF integrates by geography and tags (taxonomic names), and is less interested than individual specimens than the patterns that emerge from the aggregation of that specimen data. In short, GBIF can get by fine without links (as evidence by the fact that it changes its own occurrence URLs at will, and that most of the specimens it harvests don’t have their own URLs).

        In the science publishing industry I think there is a clear constituency for links. In order to get institutions to buy subscriptions, publishers need to convince them of the value of their catalogue: “these journals are highly cited in the field in which your people work” is a good argument (hence the focus on impact factor). To get content they need to convince authors that they should send them their manuscripts “publish in our journal and you will be read and cited” (impact factor again). The currency of value is the citation (aggregated over individual journals), so you need a robust system to maintain that, and to drive the citation traffic (equivalent to “Google juice”) around the network. Hence the publishing industry created CrossRef.

        In the absence of a clear economic, social, or scientific incentive (i.e., a constituency) for links, the argument for infrastructure to maintain identifiers seems over the top and gets in the way of “just do it”. The benefits of the extra steps (e.g., indirection, centralised caching and discovery) only become compelling if you think the links matter (or, more precisely, matter enough to be your primary focus).

        Perhaps a way to frame our disagreement is that Roger argues that HTTP URIs for each specimen are a pre-requiste for getting things online, and that if we do it carefully and sensibly the links will naturally emerge. I am less optimistic that the links will emerge without identifiers being managed, and worry that some of the expected benefits of digitisation (such as being able to track the provenance and use of data from specimens) will not be properly realised without some degree of curation of those links.

      • rogerhyam says:

        I think we both have the frustration that GBIF could be our CrossRef.

      • Rob says:

        Why expect GBIF to take on a _completely_ different mission. GBIF publishes data, much as journals publish articles. DOIs arose from international trade associations within the publication industry, not from an individual publisher. For the collections community, I can see a similar trajectory where some organization that doesn’t publish data directly helps manage and curate GUIDs.

        John Deck mentioned the BiSciCol project (http://biscicol.blogspot.com/)(note: we are both involved in the BiSciCol project) and the goal is to help prototype services and systems that may eventually grow into a sustainable system for managing GUIDs. We hope strategic partnerships and organizational thinking among the collections community might lead to something similar to CrossRef or DataCite. I don’t think GBIF is the immediate or even long term answer but perhaps GBIF in collaboration with other partners and CDL/EZID might move us much closer?

      • rogerhyam says:

        From their website:
        “GBIF’s mission is to make the world’s biodiversity data freely and openly available via the Internet.”
        “One of GBIF’s main purposes is to establish a global decentralised network of interoperable databases that contain primary biodiversity data.”
        —-
        GBIF don’t publish our data we do. GBIF indexes and aggregates it but it is also available from us or from other people. Last time I was talking to the guys at GBIF they were keen on building service registries.
        In this slide from one of John’s presentations about your project he describes GBIF as an aggregator not a publisher.
        http://image.slidesharecdn.com/3bitriplifiertalk-120821195805-phpapp02/95/slide-16-728.jpg
        —-
        Providing GUID services doesn’t seem that different from building a global network of interoperable databases – certainly not a “completely different mission”.
        —-
        GBIF have a long term continuity plan based on support from the governments of multiple countries.
        —-
        I think my proposition is a very reasonable one.

      • Andie says:

        ” I am less optimistic that the links will emerge without identifiers being managed, and worry that some of the expected benefits of digitisation (such as being able to track the provenance and use of data from specimens) will not be properly realised without some degree of curation of those links.”

        +1 on this Rod, and I think that’s the exact point we were trying to get at (successfully or not) with this post in the first place. While the technical/infrastructural details are important, a clear plan for curation and management — and a clear understanding who is responsible for what — re identifiers is crucial as well, and that’s something that’s underdiscussed.

        With the point above in mind — Roger, I would be more convinced of the feasibility of GBIF-as-identifier-provider if we had someone from GBIF participating in this thread, or more actively discussing the issue. From what I’ve read, GBIF has historically been pretty conservative in the amount of responsibility for collections data they assume. Whether they’re publishers or aggregators — they certainly don’t seem to want to be managers or curators.

      • Rob says:

        We should leave it to GBIF folks to weigh in here, just as Andrea has said. I used the word “publisher” for a very good reason — I think GBIF has moved to that model with the Integrated Publishing Toolkit. GBIF publishes Darwin Core Archives that it (or others) can harvest as an aggregator. Consider it semantics, but to my mind it has a big impact on how things move forward because DwC-As should at minimum get a dataset level GUID when published.

        As for “GBIF have a long term continuity plan based on support from the governments of multiple countries” — I am less optimistic that GBIF has the resources you think. This makes it all the more critical for GBIF to NOT re-invent the wheel when it comes to GUID services.

        We really can go round and round on this! I am not at all interested in arguing semantics or getting bogged back down in arguments of yesteryear. I am interested in pragmatic solutions among those WILLING TO DEVELOP AND TEST THOSE SOLUTIONS IN THE CRUCIBLE OF OUR COMMUNITY.

        Its going to be fun, interesting times but its going to require us to consider new models and look broadly at the landscape around us.

      • rogerhyam says:

        Sorry. I’m obviously upsetting people so will shut up.

      • Rob says:

        Oh, one other thing. Roger, you said “Providing GUID services doesn’t seem that different from building a global network of interoperable databases – certainly not a ‘completely different mission'” — there we do fundamentally disagree.

      • Rob says:

        Oh no no! Not upset or distressed. Just the opposite. Interested, engaged and curious to sort these issues in a way that leads to positive outcomes. No snark intended and you certainly deserved a longer explanation (I was late for a talk!) about why its not quite so simple, to my mind, to say that “Providing GUID services doesn’t seem that different from building a global network of interoperable databases”. If anything, accept an apology and lets continue the conversation!

  6. This is a wonderful discussion. I have my own 2 cents of course. First, on the first order of business, I believe physical objects in natural history collections need unique identifiers if we are going to refer to them in publications including for example the semantic web. If we do not then we are in an odd situation of not being able to make direct claims about the objects without ambiguity. For example, in a DNA database we would like to say “Object XYZ has DNA sequence ATTGCCC…” We do not want to say “the digital surrogate of Object XYZ has DNA sequence ATTGCCC…” In addition, unlike DOIs, we want to refer to a a class of objects such as all books on a particular print run but to one unique item in a collection (usually). We might sometimes want to generalize and say “All of species ABC has DNA sequence ATTGCCC…” but that is a very different statement. Likewise, if we have an image of the very same specimen, we want to say, “Object XYZ has Image 1239XY.” In the statement the identifiers are typed. OBJECT XYZ and IMAGE 1239XY”. In the semantic web at least the typing need not be part of the name and likely should not be part of the name. We really need to make two statements XYZ isa ObjectIdentifier and 1239XY isa Image. We could handle this with name spaces as well as long as we agree on this.
    We really should not be worry much in our discussion about if the statements being made are true or false in the real world. At least for the semantic web anyone can say anything about anything. This means we can disagree. That is good. We can choose to trust particular sources and have our computers perform inferences over those sources and ignore other sources. So, “authority A” can say, “Object XYZ has DNA sequence ATTGCCC…” and “Authority B”” can say, “Object XYZ has DNA sequence CCCGTTA…”. Authority C can say, Authority C trusts Authority A.

    So, iDigBio, we need unique identifiers for the objects you are digitizing! Some institutions are ready to mint their own and feed them into the digitization process but in most cases now global unique identifiers will not exist so at the time of digitization one must be crated and preserved, preferably in the home institution but not necessarily in the home institution. The home institution must however be able to retrieve the one unique physical object if anyone produced the identifier for that object… because they want to run a new fance nondestructive DNA sequence on it for example.

  7. I just read this thread and can’t believe it happened. Is this really 2013? The issue of whether http URIs can be used to identify non-information resources has been settled for something like 8 years. If anybody is confused about what a http URI represents, dereference it and look at its rdf:type . Because it’s an http URI you at least have the possibility to do that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s