This week in digitization: The good, the buggy, and the curious

This will be old news to many, but regardless: two big projects related to specimen digitization and biodiversity informatics launched in the past couple weeks.   Quick impressions on both below, focusing on the good, the buggy and a few items of curiosity.  Both projects are great, but — how will they fit into the broader landscape of existing resources, and into what niches?

1) Notes From Nature — a new Zooniverse project for the transcription of natural history collection ledgers.  This has been a long time in the making (more details here) and as of this writing, the two available collections (Herbarium specimens from SERNEC and insects from CALBUG) are already 26% and 21% transcribed, respectively.

The Good: As always, clean and intuitive interfaces from the Zooniverse team make transcription fast and easy.  Data entry screens are customized to each type of collection (e.g. plant labels often contain more detailed locality descriptions than insects, whereas insect labels often contain data about what host-organism they were found on).  Awesomely, all the code is available on Github ( in case other Museums want to set up their own transcription engines locally.  There is also an intriguing teaser buried at the very bottom of the Notes from Nature “About” page: “Interested in publishing your collection? Contact us.”

The Buggy: Maaaan, I’ve transcribed around 40 labels and my total isn’t showing up under my user profile. This bothers me more than I care to admit, though it’s primarily out of worry that my transcriptions aren’t being saved.

The Curious:  It would be great to learn more about how these data get back to the collections databases, and how exactly that handoff happens.  What do the transcribed files look like?  How is accuracy checked?  Do the museums have plans to make these records publicly available, or harvestable by aggregators like GBIF?

2) The patriotically-named Biodiversity Information Serving Our Nation (AKA BISON) biodiversity data portal out of USGS —  I know less about this project, other than what I’ve learned at various conference talks — however I’ve heard it referred to as the “federal version of iDigBio.”

The Good: On first look, really nice integration of specimen occurrence data with USGS map layers, and as Hilmar Lapp pointed out, there’s an API, which is great.

The Buggy:  There are no identifiers on these specimens — not even their local catalog numbers.  Per Stinger Guala in the G+ thread linked above, the data is there — it’s just not yet visible (though will be soon).  Perhaps there are reasons (a need for better formatting? a need for cleaner data?  a need for more server space?) that they’re not yet making this data visible yet — but it struck me as a pretty glaring omission.  While I realize that many researchers don’t spend a lot of time looking at  catalog numbers, I imagine that they’d be absolutely critical if one was integrating BISON data with that from other sources (say, something from another portal like GBIF). Also, how could any of these records ever be linked back to the source data or any other data out there?  Provenance = important, no?

The Curious: BISON is apparently the US node of GBIF — which I had assumed meant they would be providing GBIF with US data  — however, the data in BISON appears to invert that model and is a US-centric mirror of GBIF.  I hope that BISON becomes a platform through which US, federally owned and managed biocollections can be made publically discoverable, and would be interested to hear from BISON reps if there are any plans in place to do this.


About Andrea

Andrea is a Ph.D. student in Library and Information Science at the University of Illinois at Urbana-Champaign, and is supported by the Center for Informatics Research in Science and Scholarship.
9 Responses to This week in digitization: The good, the buggy, and the curious

  1. It would also be great for the data transcribers on Notes from Nature to know that their hard work is being put to good use e.g. you could list active science and conservation projects that are likely to use the data. Those specimens could be used for Red List threat assessments for example.

  2. Andrea thanks for the thoughts on BISON. Yes, catalog numbers will be there in the near future. We’re also working with providers to come up with a more reliable solution for record citation because catalog numbers are not guaranteed unique and many providers do not use them, but that will take awhile given how the network functions. On the role as the US GBIF node. One of the functions of the Node is to make GBIF data more available and relevant to the country that it represents. Hence the map layers, visualizations, provider stats etc. You’ll also notice that there are 1,766,275 points provided from BISON to GBIF. Those are new resources to GBIF and there are many times that number in the queue already. Please feel free to contact for more information at any time.

    • Andrea says:

      Hey Stinger, thanks much for replying, and clarifying more on what it means to be a ‘node’. One question – is there a quick way to see what data is new to BISON, as opposed to harvested from GBIF?

      • If you go to the Search page (the Search Tab) and go over to the right on the bar above the map hit “Refine Your Search”. In the provider list there you can select BISON. That gives you what BISON is sending out to GBIF and all of those would be new to the network.

  3. I really appreciate this blog. As more and more nodes, projects, and funding schemes for digitization of collections emerge, I am overwhelmed and wondering if there isn’t simply too much duplication of effort. An anonymous collection manager expressed this frustration back in 2010 on the Community NSF ADBC Blog at (scroll down to the comment). Has anyone created a flowchart of how all of these providers work together?

    To me, this looks like waste in the form of duplication of effort. Competition in the marketplace is generally a good thing, but this isn’t really a marketplace. None of these groups is going to sell their solution to the highest bidder. This is supposed to be a community effort to make biodiversity information that is currently hidden in closed data sets available to those who need it for research. Instead of so many essentially equal solutions, I would prefer to see a cooperative effort to perfect the project that is farthest along and ensure that those with data to contribute are aware of the project and have the ability to participate in it. There are still natural history institutions that are unaware of the existing search portals or if they are aware, they don’t have the staff, equipment, or both to get involved. The whole process is currently convoluted and confusing. Many of us need an “Easy” button, because we don’t have the time to wade through all of the technology and options. Some of us are lucky to have the time to care for the specimens that need to be digitized.

    In response to the “Why are we here?” post on the Community NSF ADBC Blog, Mary Barkworth points out the importance of public support.

    “It is important that there be public support of the initiative. This will make it easier to persuade legislators to support it and the development of related resources . This is not part of the initiative (except in the required “broader impacts”) but, unless we demonstrate the value of the resulting information to the public (ranging from JQ to private companies and educators), we shall have difficulty obtaining the ongoing support needed for maintenance of the national digital museum and the collections on which it relies.”

    Which brings me to my endless list of questions. What happened to the Community NSF ADBC Blog? Some excellent conversations were stared there in 2010 and they just end without explanation. Who is the point person or agency in this effort? If I select one of the portals today, what guarantee is there that it will be there tomorrow? Are efforts being made to contact smaller institutions and sell the benefits of participation in digitization and search portals to them? Once sold, is there assistance available to ensure that they have access to equipment and personnel needed to complete the process? Who is reaching out to the public or helping institutions reach out? In some ways, I think this might be the place to start, because we need the public to understand the importance of what are, more often than not, their collections.

    • Andrea says:

      Hi — I agree on a lot of your points, but with a few caveats — while I completely, completely understand the need and desire for an “Easy Button” (particularly for busy, underfunded, understaffed, underappreciated collections managers, and particularly in smaller institutions), the unfortunate reality is that there currently is no such button, and there won’t be one for a while, and there won’t be one without a lot more work and experimentation, and honestly — probably failures — but hopefully informative failures! I too have mourned the lack of a flow chart tying all these resources together (in an ASIST poster I pulled together last year I took a stab at creating one, though I am certain it’s flawed – – this might be something I could revisit and post here). However, I do think there are advantages to having multiple/redundant aggregators and systems. For instance, in that comment from the anonymous collections manager, she/he cries, “Down with the lassez-faire!” and points out that we’re not competing in a marketplace. Well… yes and no. I think “competing” to see who can build the best, most sustainable, and most accessible infrastructure has and will continue to help push technology and data publishing protocols/best practices forward. Multiple data publishers = multiple spaces to experiment with different things = multiple chances to get it right or prove something wrong.

      As far as the ADBC blog/funding stream — I believe that is what iDigBio eventually came out of ( (Rob would be better able to speak to this). I imagine the blog went away after the HUB was funded, but perhaps some of those conversations are continuing over at iDigBio? Or if not, perhaps you could get them started? 🙂

      • Thank you for the response. I guess what I am lamenting is that the “marketplace” isn’t truly free. There are a limited number of participants making decisions in somewhat isolated silos and some parts of the market are sitting in the dark or the shadows.

        Thank you for the links, I will read these as my time permits. I have participated in an iDigBio workshop, which led me to many of the questions that I posted here, some of which I posed directly to people at iDigBio. Regardless, the easy button is going to be a necessity because right now, just keeping up with current events in the world of digitization is a full time job that most collections don’t have funding to fill. This was apparent in a recent iDigBio listserv conversation on biodiversity informatics managers.

    • Rob says:

      I very much understand this frustration about duplication of effort and all the different resources out there. I doubt I can answer all the questions above but let me tackle a couple. Before I do, I took a look at your excellent blog and really resonate with the ideas there about collaboration, about serendipitous meetings. I have a hope that maybe those things can be facilitated by technology, but I also know the immense value I place in meeting in person. Looking forward to reading more of your work.

      As for “what happened” since 2010 – the answer is… a lot. The community blog was set up to collect ideas before a national digitization center – a HUB – was selected. The HUB is called iDigBio and its done a lot to get the community talking about this challenge of digitizing natural history collections. It also is developing tools to help get those digitized data mobilized. Its a tough job and I think its important we have the coordinating body doing the work.

      There are lots of different models to make something like digitization and mobilization happen. And I think Andrea’s point in the post is that there are some niches that are open and should be filled. BISON is focused on US-centric data. Other portals such as GBIF are more global. Other resources combine GBIF data and other sources of species distribution data and try to create more integrated models (e.g. Map of Life). And Notes from Nature fills a niche to help transcribe records so that they can one day be digitized. There in not One Application To Rule Them All. There are multiple applications that do need to “talk to one another”. In the best of all worlds, maybe the whole Internet is an open, linked database where its so much easier to discover content. I hope that becomes possible but for the here and now, I am good with a lot of different ideas coming forward and proliferating. Viva diversity. Viva biodiversity!

