A confluence of drawers

Confluence. n.  a.  the flowing together of two or more streams  b. the place of meeting of two streams  c. the combined stream formed by conjunction [Merriam-Webster online]

Drawer. n.   a sliding box or receptacle opened by pulling out and closed by pushing in [Merriam-Webster online]

Over the past years, at many collections digitization workshops, one’s head (or at least my head) can get turned around about neat idea this, or amazing technology that.  It can get a little theoretical or perhaps speculative-science-fiction-y fast.   But it begs the question: what are people doing in their collections, right now?  What I have learned is that when it comes to pragmatic choices and space/money/efficiency, there is a lot of reason to be excited and to see, yes, confluences.

I hadn’t really realized how much digitzation solutions are beginning to converge until I saw Vince Smith give a (great!) presentation at iEvoBio 2011 on digitizing collections at the Natural History Museum London (NHML).    I don’t want to over-paraphrase his talk, and the slides are excellent (from an earlier version of the talk: http://www.slideshare.net/vsmithuk/scalingup-collections-digitisation), but the gist of it was that at current rates, digitization would take a LONG time: thousands of years.  So the folks at NHML are working with a company called SmartDrive.  SmartDrive builds motorized cameras that move along a track above an object (such as a collections drawer), taking photos.  Vince has been working with them to develop a system to photograph collections drawers at high resolution (more on the company’s hardware, software and approach here:  http://www.smartdrive.co.uk/satscancollections.html) (note: not a pitch, just really interesting and great images of collections drawers!).

The important thing is that with this technology, high resolution, stitched-together images can be generated relatively quickly, scaling down the time it takes to image all the collections drawers from thousands of years to less than ten.  This still leaves “snakes in jars” (see our previous S2I2 post) but we’ll come back to those at some point soon.  What is intriguing is that rather than conflict, we are experiencing _confluence_ in an area where there has been a lot of wailing and gnashing of teeth about how we’ll likely end up with a billion (YES, a BILLION) different solutions.

So what about this “confluence?”   While in Australia hanging out with good friend Paul Flemons (note: currently fu manchu-less), he showed me a similar set up at the Australia National Museum. Again, the idea is to image collections drawers, this time using very high resolution cameras (100 MegaPixels).  Similar approaches using Gigapan (http://www.gigapan.org/) are being pioneered by Andy Deans at North Carolina State University (see their excellent and “insectlent” blog here).  And Paul Tinerella at the University of Minnesota (who is almost 100% likely to be farmer-goatee-ed at this moment) is using a similar solution to first scan many slides of mounted insects en masse, and then automating  the disassembly of these slides into single images of a specimen.  The specifics of how the cameras move over the drawers or a set of slides may be different, but the general idea is the same:

Capture a drawer or slide collection quickly –> disassemble the IMAGE into pieces –> Capture labels –> Move data further downstream → etc.

Confluence.  This is good.

So what does all this mean?  Well, there are still challenges, especially for insects, where the specimen often occludes the label from top view.  But assuming cameras can move all around specimens to generate photos, the answer is that there may be a fast method to capture LOTS of high resolution data in drawers.  Since Andie is spending part of her summer looking at a thousands of little clams stuffed in such drawers, and Rob has even worked on similar clams in the collection he curates, and since there are hundreds of other collections folks doing the same thing, this is a big step forward.

What challenges remain?  Tons.  How are we going to unlock data from a 500MB image of a drawer and use those data most effectively?  Argument:  Data needs to be machine readable and properly documented to maximize its use and re-use.  Period.  Images are not so good for that!  If biocollections data have further utility in new kinds of science, it likely relates to having the what, where, when (taxonomy, location, date) information readily available as simple text that is interoperable with other sources of environmental data. What is _excellent_ (so good I am waving my hands in the air with enthusiasm) is that many people are beginning to talk about similar solutions to this challenge of converting the image of a label to text.  That is the subject of our next blog posting, but to presage it, we’ll just say two words:  crowd sourcing.  And two other words:  Old Weather.  See if you can connect the dots! 

About Rob

Three "B's" of importance: biodiversity, bikes and bunnies. I get to express these "B's" in neat ways --- I bike to a job at the University of Florida where I am an Associate Curator of Biodiversity Informatics. Along with caretaking collections, I also have a small zoo at home, filled with two disapproving bunnies.
This entry was posted in collections management, digitization definition, SPNHC and tagged , , , , , . Bookmark the permalink.

10 Responses to A confluence of drawers

  1. We can already convert images of label to text. That’s not a challenge although we need to re-examine how our tools scale (in fact, they don’t very well but they do work). We have proved the concept times and times over. The challenge is more into NLP and parsing once we have the OCR/NHR output. It was a given that camera would get more and more sophisticated in the very short time, but if the goal is to capture text, then you don’t even need to go that high. OCR software works really well starting at 250-300 dpi. Of course we do want excellent images regardless of OCR. Rotating cameras are great and the fact is, technology is moving faster that we can catch up with. So, the issue here is that soon we will be able to solve the problem: how to capture all we need inside a drawer, and that’s in terms of getting great images and label data, even from tiny bits of paper. Then what? Tell our friend Brian Heidorn to get moving!

    • Rob says:

      Really glad to get the comment, Nico! I don’t think OCR is going to be the answer, though. Maybe. Certainly its part of the puzzle. But the solution I think will be much more cost effective, rapid and scalable is… well, its the HINT at the end of the blog posting. I can’t give it away yet.

  2. Well, I can’t wait to see your magic then. Whether you are using crowd sourcing in the cloud or on the ground, I think OCR is still valuable to get the data out in the mix. Post-partum data curation (as I call it) is another issue. Somehow I think crowd sourcing is only a temporary band-aid.

  3. Cyndy Parr says:

    Need a credible, citable reference on this stuff FAST! You know why.

    And BTW I get the Old Weather ref because I tweeted it a while back 🙂

  4. Pingback: #ievobio Day 2 – Carl Boettiger

  5. Rob says:

    2011 iEvoBio presentation by Vince Smith now on slide-share: http://www.slideshare.net/vsmithuk/no-specimen-software-left-behind

  6. Paul Flemons says:

    Hi all,
    Just to let you know we now have a crowdsourcing transcription site up through the Atlas of Living Australia – for transcribing insect labels – cicadas in the first instance – but this will be expanded out if the project proves to be successful.
    Check it out at volunteer.ala.org.au and distribute as you see fit – the more transcribers the better – if it proves successful it will be expanded so get transcribing – get all your students transcribing – get your kids – get your cats – get your rabbits – we need them all!!!.
    Paul Flemons

  7. Tenerife69 says:

    Muy buen post, me ha gustado, gracias. Good Post. Thank you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s