Quick recap: Last time around, we talked about a confluence of drawers. At every digitization meeting we’ve attended, there is a pall of gloom whenever discussion turns to digitizing things like entomological collections: drawers dense with delicate specimens that ever-so-inconveniently obscure their tags. There is so MUCH to do! It’s so difficult to do! How do we make progress? So we have a simple thesis: progress in imaging techniques and efficiency is no longer the issue. Lots of people are exploring not-so-complicated imaging approaches that take very high-resolution photos of entire drawers (be they of insects or clams or chipmunk skulls or Triassic fossils, etc). Doing so will likely yield around a thousand-fold increase in efficiency over current methods, leaving museums with hundreds of thousands of images of many tens of millions of objects. We assert that converting the information on labels from images to machine-readable text is essential – and we don’t think many folks would disagree with us. But how?
At the end of last post we left you with two sets of words: “crowdsourcing” and “Old Weather.” And many of you likely already figured out the denouement to our cliffhanger and are crying, “UGH, crowdsourcing, really?” But wait! Listen: There are two legitimate possibilities for converting the words on labels from images to good, old-fashioned, copy and paste-able text. One is Optical Character Recognition (OCR); and techies all over the world love and hate this approach. Love it because it’s so COOL – a magic machine that “reads” the shape of letters from an image and in doing so transforms pixels into movable type . The “hate” part: YOU SHOULD SEE SOME PEOPLE’S HANDWRITING! The other approach: good old-fashioned human transcription. So, Humans versus Machines, right? Just the kind of battle we love!
Which approach works best is probably situation-dependent, but we are going to be a little provocative (as opposed to necessarily right) and argue that in most cases – especially those involving the digitization of handwritten labels or notes – crowdsourced human transcription beats OCR. So, what’s the problem with OCR? What we have noticed in many, many talks and presentations is that OCR is rarely as simple to set up, run or do quality control on as it initially seems. Error rates remain high, requiring time and effort by well-paid professionals to tune the system and do quality control. The result is that we wind up spending a lot of time working around machines instead of with them. Crowdsourced human transcription, on the other hand, plays to humans’ and computers’ strengths; computers swiftly and quickly transport and store data, and humans use their discerning eyes to tell the computers what data to store.
More than just being a way to get labels into databases, keeping humans in the digitization loop – in a function other than error checking – has a lot of great side effects that OCR simply can’t match. Chris Lintott, one of the original Zooniverse (more on Zooniverse in just a second) PI’s gave a great talk at IDCC 2010 in Chicago last winter. A quick summary follows, but if you can spare half an hour we’d suggest giving his talk a listen. Dr. Lintott explains:
1) Human transcription increases the chances of serendipitous discovery – would a machine be able to call attention to Darwin’s marginalia?
2) Crowdsourcing can inspire impromptu collaboration amongst strangers – message boards give volunteers a place to do their own coordinating
3) Crowdsourcing necessarily means that you are staying engaged with a crowd – a group of unpaid strangers that care enough about your science to do it for free. In early polls of Galaxy Zoo volunteers, over 50% said they contributed because they just liked helping scienctists.
This latter point is huge. While the phrase “staying engaged with a crowd” sounds a bit frighteningly close to PR talk, in this case it just means you are keeping people excited about science! It means you are showing that your collections have scientific AND social worth. And it means that we are sharing our awesome, enviable vocations with folks that have the interest but maybe not the luck to work in museums themselves.
So, for those who didn’t already jump to the (correct) conclusion, our other two words at the end of last post – “Old Weather” – is an excellent example of productively fun human-computer interaction that links experts, volunteers, and data (and is brought to you by Zooniverse and Vizzuality). “Old Weather’s” mission statement is simple: users log on to, “Help scientists recover worldwide weather observations made by Royal Navy ships around the time of World War I. These transcriptions will contribute to climate model projections and improve a database of weather extremes. Historians will use your work to track past ship movements and the stories of the people on board.”
The team at Vizzuality, a company that really excels at producing web based applications that are deep, beautiful and functional, has made transcription of early 20th century ship’s logs… into a video game. The interface is brilliantly designed; it is very easy to get started; rewards systems are in place to keep people in engaged. Hundreds of thousands of people are now individually enjoying a game built around digitization. In less than a year, 555,905 pages of logs have been digitized. Ship’s log pages are a lot more detailed and heterogenous than specimen labels, so maybe getting tens or hundreds of millions labels captured is a possibility using smartly deployed crowdsourcing?
For many, crowdsourcing is a scary proposition; it requires a fundamental rewiring of how we deal with our data – one that forces us to be more open and inclusive, and to think beyond our physical labs, collection drawers and perhaps even institutional identities. And furthermore – we fully realize that it’s a solution not without drawbacks. Two that seem obvious to us include: 1) crowdsourcing requires organization and institutional support, and we are painfully aware than not every museum has these, and 2) Ownership issues and concerns of control over process and products. We are sure there are more, so please tell us what flaws you see.
Even given drawbacks, this sort of broad outreach is truly necessary if we want to meet grand challenges like the 100% digitization of natural history collections – and if we want to continue proving our collections’ worth in an age of budget cuts, recession, and the folding of previously untouchable symbols of American research and ingenuity. Crowdsourcing collections digitization gives us an opportunity to fulfill the fundamental promise of digitization: that it will improve access, use, and integration of biocollections! Excited? We are too, especially given that Zooniverse is actively seeking new denizens – er, projects.
To summarize: Skip OCR. Bring images to the crowd and make it fun. Better yet, bring images to the untapped resources attached to University museums and collections: swaths of Farmville-addicted undergraduates in lower division biology classes. Integrate these projects with life science classes AND with Facebook – there is real potential here!
Ok, crowd, tell us what you think! We’ll buy, well, not a space shuttle, but maybe something natural history-esque for the
50 25th (!!) commenter (comments MUST be relevant to the topic at hand!) Prizes also awarded for the funniest-relevant and deepest-relevant.