Old Weather’s Crowd and the Challenge of Digitization

Quick recap:  Last time around, we talked about a confluence of drawers.  At every digitization meeting we’ve attended, there is a pall of gloom whenever discussion turns to digitizing things like entomological collections: drawers dense with delicate specimens that ever-so-inconveniently obscure their tags.  There is so MUCH to do!  It’s so difficult to do!  How do we make progress?  So we have a simple thesis: progress in imaging techniques and efficiency is no longer the issue.   Lots of people are exploring not-so-complicated imaging approaches that take very high-resolution photos of entire drawers (be they of insects or clams or chipmunk skulls or Triassic fossils, etc). Doing so will likely yield around a thousand-fold increase in efficiency over current methods, leaving museums with hundreds of thousands of images of many tens of millions of objects. We assert that converting the information on labels from images to machine-readable text is essential – and we don’t think many folks would disagree with us. But how?

At the end of last post we left you with two sets of words: “crowdsourcing” and “Old Weather.”  And many of you likely already figured out the denouement to our cliffhanger and are crying, “UGH, crowdsourcing, really?”  But wait!  Listen:  There are two legitimate possibilities for converting the words on labels from images to good, old-fashioned, copy and paste-able text. One is Optical Character Recognition (OCR); and techies all over the world love and hate this approach.  Love it because it’s so COOL – a magic machine that “reads” the shape of letters from an image and in doing so transforms pixels into movable type . The “hate” part: YOU SHOULD SEE SOME PEOPLE’S HANDWRITING!   The other approach: good old-fashioned human transcription.  So, Humans versus Machines, right?  Just the kind of battle we love!

Which approach works best is probably situation-dependent, but we are going to be a little provocative (as opposed to necessarily right) and argue that in most cases – especially those involving the digitization of handwritten labels or notes – crowdsourced human transcription beats OCR.  So, what’s the problem with OCR?  What we have noticed in many, many talks and presentations is that OCR is rarely as simple to set up, run or do quality control on as it initially seems.  Error rates remain high, requiring time and effort by well-paid professionals to tune the system and do quality control.  The result is that we wind up spending a lot of time working around machines instead of with them.  Crowdsourced human transcription, on the other hand, plays to humans’ and computers’ strengths; computers swiftly and quickly transport and store data, and humans use their discerning eyes to tell the computers what data to store.

More than just being a way to get labels into databases, keeping humans in the digitization loop – in a function other than error checking – has a lot of great side effects that OCR simply can’t match.  Chris Lintott, one of the original Zooniverse (more on Zooniverse in just a second)  PI’s gave a great talk at IDCC 2010 in Chicago last winter.  A quick summary follows, but if you can spare half an hour we’d suggest giving his talk a listen.  Dr. Lintott explains:
1) Human transcription increases the chances of serendipitous discovery – would a machine be able to call attention to Darwin’s marginalia?
2) Crowdsourcing can inspire impromptu collaboration amongst strangers – message boards give volunteers a place to do their own coordinating
3) Crowdsourcing necessarily means that you are staying engaged with a crowd – a group of unpaid strangers that care enough about your science to do it for free.  In early polls of Galaxy Zoo volunteers, over 50% said they contributed because they just liked helping scienctists.

This latter point is huge.   While the phrase “staying engaged with a crowd” sounds a bit frighteningly close to PR talk, in this case it just means you are keeping people excited about science!  It means you are showing that your collections have scientific AND social worth.  And it means that we are sharing our awesome, enviable vocations with folks that have the interest but maybe not the luck to work in museums themselves.

So, for those who didn’t already jump to the (correct) conclusion, our other two words at the end of last post – “Old Weather” – is an excellent example of productively fun human-computer interaction that links experts, volunteers, and data (and is brought to you by Zooniverse and Vizzuality).  “Old Weather’s” mission statement is simple: users log on to, “Help scientists recover worldwide weather observations made by Royal Navy ships around the time of World War I. These transcriptions will contribute to climate model projections and improve a database of weather extremes. Historians will use your work to track past ship movements and the stories of the people on board.”

The team at Vizzuality, a company that really excels at producing web based applications that are deep, beautiful and functional, has made transcription of early 20th century ship’s logs… into a video game.  The interface is brilliantly designed; it is very easy to get started; rewards systems are in place to keep people in engaged. Hundreds of thousands of people are now individually enjoying a game built around digitization.  In less than a year, 555,905 pages of logs have been digitized.    Ship’s log pages are a lot more detailed and heterogenous than specimen labels, so maybe getting tens or hundreds of millions labels captured is a possibility using smartly deployed crowdsourcing?

For many, crowdsourcing is a scary proposition; it requires a fundamental rewiring of how we deal with our data – one that forces us to be more open and inclusive, and to think beyond our physical labs, collection drawers and perhaps even institutional identities.  And furthermore – we fully realize that it’s a solution not without drawbacks.  Two that seem obvious to us include: 1) crowdsourcing requires organization and institutional support, and we are painfully aware than not every museum has these, and 2)  Ownership issues and concerns of control over process and products.   We are sure there are more, so please tell us what flaws you see.

Even given drawbacks, this sort of broad outreach is truly necessary if we want to meet grand challenges like the 100% digitization of natural history collections – and if we want to continue proving our collections’ worth in an age of budget cuts, recession, and the folding of previously untouchable symbols of American research and ingenuity.  Crowdsourcing collections digitization gives us an opportunity to fulfill the fundamental promise of digitization: that it will improve access, use, and integration of biocollections!   Excited?  We are too, especially given  that Zooniverse is actively seeking new denizens – er, projects.

To summarize:  Skip OCR. Bring images to the crowd and make it fun. Better yet, bring images to the untapped resources attached to University museums and collections: swaths of Farmville-addicted undergraduates in lower division biology classes.  Integrate these projects with life science classes AND with Facebook – there is real potential here!

Ok, crowd, tell us what you think!   We’ll buy, well, not a space shuttle, but maybe something natural history-esque for the 50  25th (!!) commenter (comments MUST be relevant to the topic at hand!)  Prizes also awarded for the funniest-relevant and deepest-relevant.

Advertisements

About Rob

Three "B's" of importance: biodiversity, bikes and bunnies. I get to express these "B's" in neat ways --- I bike to a job at the University of Florida where I am an Associate Curator of Biodiversity Informatics. Along with caretaking collections, I also have a small zoo at home, filled with two disapproving bunnies.
This entry was posted in Uncategorized. Bookmark the permalink.

33 Responses to Old Weather’s Crowd and the Challenge of Digitization

  1. Judith says:

    I think the first point that would concern me, in an age of bean-counters and performance measures, would be turn-around time. My manager wants a count of how many new records I have completed this month. Do I tell him I posted 100 images to Zooniverse, you’ll get ’em when they’re done?

    • Andie says:

      Hey Judith! Well, I think that’s what we mean when we say this sort of project would require institutional/administrative support, to a certain extent, but I also think that there would be ways of working around the lag time between submission of images and transcription — either reporting records as “done” only after they’ve been created, or working with the Zooniverse folks to improve the turn around time for projects with deadlines. Does that sound like a workable solution?

      • Judith says:

        Danged if I know, I’m an old crone in this business who got the job because my handwriting was better than the other applicant’s. I could thus be trusted to write with India ink in the catalogue ledgers. You’re gonna have to give me a really good demonstration and a little while to wrap my head around the permutations!

        Then come up with a nice blanket solution for little crustaceans in little vials of alcohol…

  2. Michael Wilson says:

    Video game citizen science, so awesome words do not describe (Splarooom! is the best I can come up with to describe the feeling of playful synergy involved) Have you guys considered working with Jane McGonigal on this? http://blog.avantgame.com/ is her blog about creating games that actually attempt to improve the world.

  3. Most of the sucessful examples of crowdsourcing museum/science-based work that I’ve seen involve tapping pre-existing communities of either professional or avocational workers; this is partly an issue of expertise, which you want, but also implies prior investment in the field by the crowd. Using crowdsourcing as an genuine outreach tool would require building communities, a much more challenging task and one which may require more hands-on engagement by the museum staff, at least in the early stages. Unless, of course, you manage to invent the taxonomic equivalent of WoW as a hook….

  4. Melissa Barton says:

    I could have sworn there’s a natural history museum in the UK that has a crowdsourced label transcription project set up…but I can’t remember which one, and my googlefu is failing me.

    However, if it works for genealogists, I don’t see why it might not work for natural history museums; it would just be relying on natural history nerds instead of family history nerds rather than trying to reach people who need a game structure (although that’s not necessarily a bad idea). 🙂

    As someone who learned how to recognize TDA Cockerell’s handwriting from six feet away, I love the marginalia on collections labels (especially older ones); it’s not like the process itself isn’t interesting to some people.

    (This discusses a successful genealogy transcription project, as well as many other transcription projects, museum and otherwise.)

  5. Sally Shelton says:

    The USGS Patuxent crew is running what appears to be a very successfukl crowdsourcing program for digitizing years’ worth of bird observations mailed in from all over the country on postcards for the first half of the 20th century. It’s an astonishing bit of citizen science at its best, and the crowdsourcing is also impressive. Because there are two viewers on each scanned postcard, errors can be quickly caught. I’d like to see data on the efficacy of this vs. OCR. OCR errors on critical holographic databases, such as census records, are legion.

    So I would use crowdsourcing if, as Chris says, the crowd was largely professional or avocational in our field, and if it could be set up the same way as the bird project.

  6. Andy Bentley says:

    I think human transcription has a lot more promise than OCR ever will (having seen some of that gnarly handwriting first hand) and think that the integration of projects like Apiary (where numerous data entry people are used to “validate” a particular entry) would ally a large amount of the initial fears that anal-retentive collections people have about this sort of thing. There are indeed numerous examples out there of projects that have used human transcription to great effect. The major issue is getting these “humans” interested in doing something like this. I agree with other posters that turning this into a game has great potential. Something like Farmville where people only get to add animals to their farm/zoo/aquarium once they have transcribed the data associated with that specimen from a collection somewhere. If you have a particular affinity to a certain type of animal/plant/organism you can stock your farm/zoo/aquarium with as many of that thing as you like after you have transcribed that many label records…

  7. Rob says:

    After much thought, we have reduced the number of comments needed for PRIZES to 25! 25! Including this comment, its 24! So keep ’em coming!

  8. The University of Iowa (UI) Libraries is currently crowdsourcing transcription of its holdings of U.S. Civil War letters and diaries with great success. Check it out at: http://digital.lib.uiowa.edu/cwd/transcripts.html. Libraries are much further ahead with digitizing I think, so it’s definitely worth seeking their collaboration. I am hoping to set up field notebook crowdsourcing and paleontology specimen image access through the UI Libraries. For label transcription, I would happily involve local fossil enthusiasts who are familiar with the field locations, stratigraphy, and taxonomy.

  9. In addition to Old Weather and the USGS Bird Phenology Program (itself an amazing example for motivating and engaging volunteers), I just became aware of a specimen label transcription project at the Atlas of Living Australia (see the Australian Museum Cicada Expedition site or the introductory video.

    Balboa Park Online Collaborative has been using my own software to transcribe herpetology field notes, and I hope that we’ll roll out some tools for date-and-place tagging of species observations within narratives for a different project in a few months.

    There is a lot of work being done on volunteer motivation for these projects, and not all research points to gamification as the solution to engagement issues. Other motivating factors–connection to the mission, sense of doing real work, collaboration with fellow volunteers, and the immersive nature of (some) transcription work–may be more motivating, and at least a few of these factors may be in direct conflict with game features.

    • Rob says:

      These are GREAT comments, Ben, especially this issue of how to get the most people involved and discerning motivations. The “make it fun” model might work for “minimal effort” to connect with the community doing the work model. With more directed engagement, and developing maybe smaller, local, or virtual communities, maybe these other factors start to rise in importance. I don’t know the answers here, and perhaps the important thing is knowing which questions to ask in this rapidly emerging area.

      • Well, to a large extent I believe that the right questions to ask stem from the kind of data you wish to digitize and the kinds of things you want to do with it.

        Alexandra Eveleigh has been researching volunteer motivation and is the real expert here, but in my opinion you really need to look carefully at the manuscript materials themselves:

        In the case of the Julia Brumfield Diaries, the manuscript is a narrative. Transcription is a form of deep reading, and transcribing something like a diary can be an incredibly immersive experience, particularly if the volunteer can follow their own natural–usually chronological–workflow. Gamification practices like randomizing the images to be displayed (in order to get good double-blind coverage) disrupt this flow, whereas message-boards or annotation tools can foster a community of volunteers checking each others’ work and researching each others’ problems. The structure of a diary also allows volunteers to keep track of their progress easily, without the need of extrinsic scorecards and leader boards.

        On the other end of narrative axis you find single-word OCR correction like DigitalKoot or ReCAPTCHA. In these cases there is no narrative flow to be preserved, no intrinsic motivation for a volunteer to move on to the next word, and indeed no way for a volunteer to tell whether they’ve met their own target for a transcription session. Extensive, explicit gamification (including graphical lemmings) is the solution DigitalKoot has adopted, while ReCAPTCHA has inserted itself as a hurdle to many third-party webforms. Tasks are randomized with no harm to volunteer motivation in that case.

        Both OldWeather and the North American Bird Phenology Program are located somewhere in between these extremes. Ships’ logs are not terribly immersive — it’s hard to identify with a midshipman making observations, especially when that midshipman changes with each watch. However, they have a chronological structure which the Zooniverse/Vizzuality folks have managed to enhance through their mapping tools. A volunteer can follow the course of the ship over time, immersing themselves in the ship itself as a sort of protagonist in a narrative. Again, this could be destroyed through randomization, however it’s possible that a sufficiently compelling game could yield the same quantity of transcription for the institution (although not for the volunteers!) The NABPP, on the other hand, is digitizing records that track a single ornithologist’s observations of a single species over the course of a single season. This does lend itself to randomization, but that doesn’t mean that gamification is the right strategy. In fact, the observation cards are quite immersive–I say this as someone who’s not a bird-watcher–and a volunteer can put themselves in the position of the observer very easily. I found the experience akin to a sort of time-travel — peeking in on a single year, observing a single bird, in a single place far away from my home. While there are leader boards and such, the USGS folks do a great job engaging with that experience: each of their newsletters contains a profile of an observer (sketching a picture of the sort of people who filled out the manuscript index cards) and also a profile of a volunteer (connecting volunteers with each other in a non-competitive way). I doubt that a purely game-like implementation would achieve the success that either of these programs have seen.

        There are other axes in addition to immersiveness: family diaries have a limited (but highly motivated!) pool of interested people to draw volunteers from, whereas Civil War diaries may suffer from an abundance of volunteers. Similarly, the Citizen Science projects that connect their mission to the study of hot-button issues like climate change may command greater motivation than those which have no purpose which can be easily articulated to the public. No project can succeed if it’s perceived to waste volunteers’ time, so the tasks must be appropriate (thus having users correct OCR is more likely to succeed than having users transcribe easily-OCRed text from scratch). And of course the nature of the data itself–whether structured (e.g. census records), free-form (e.g. correspondence), or a mix (e.g. field notes) will also inform design decisions regarding engagement.

  10. Nico Cellinese says:

    Rob, I like your post and I feel your excitement. However, I see crowdsourcing as one of the solutions rather than ‘THE’ solution. I disagree on how you portray it as being the opportunity ‘to fulfill’. It is indeed an opportunity that integrated with other approaches will certainly provide an ‘enormous’ contribution. OCR cannot be discounted and it does work even if not in all cases. With handwriting you can’t use OCR because you have to rely on alternative software like NHR and that is also getting more sophisticated too. Typing directly from imaging will not avoid data curation issues, so the same way you need to check OCR outputs, you will need to check crowdsourcing outputs. Additionally, crowds are renown to be inconsistent. People get tired after the initial excitement and you will get a constant flow of people coming in and moving out. Nothing about this is bad, just issues that need to be considered, and I like the idea of gaming, although that would requires an infrastructure we don’t have in place yet and as you state, institutional support. Therefore, I would favor capitalizing on every approach we have available and that potentially works rather than putting all our eggs into one basket alone. I would not skip OCR as you advocate, but just add crowdsourcing as another player in the field. The more the better.

    • Rob says:

      Hey Nico, good you found the post provocative (as opposed to right)! Since we lack data on the costs and benefits of OCR versus transcription using crowdsourcing in our domain, its hard to really say that one is always better than other, or, more importantly, when to deploy OCR and/or crowdsourcing. They also aren’t mutually exclusive and Andrea has passed along (by way of one of her colleagues) a great example of crowdsourcing to correct OCR’d documents (http://www.dlib.org/dlib/march10/holley/03holley.html)! We can talk about monetary costs all we want, but what has me excited about crowdsourcing is stuff Andie added to the post — the intangible benefits of having a more participatory process. Even if OCR were to win the day on costs when they were all tallied up (and I mean ALL of them), I wonder if I still might be aligned to crowdsourcing unless there was a huge disparity in costs. My guess is these disparities are NOT huge, or favor crowdsourcing, especially at scale. Anyway, I think we might both have avenues to explore here, so it’ll be fun to compare notes one day.

    • “People get tired after the initial excitement and you will get a constant flow of people coming in and moving out.”

      I’ve got to disagree here. I think that most projects I’m aware of have found that a solid core of volunteers stick around and continue contributing after any initial waves of publicity have passed. You’re 95% correct in that the vast majority of people dabble a bit and move on. However, it’s the tiny fraction who make the work a part of every day who you want to encourage, or at least avoid driving off.

      • Nico Cellinese says:

        Great! Then I cant wait to see how that tiny fraction scales across large digitization efforts.

      • Well, in addition to the often-quoted statistic about the huge proportion of edits to Wikipedia that are made by a tiny fraction of users, there’s also the experience of the Bird Phenology program: this chart shows that half of the 400K+ observation cards were transcribed by the top 4 or 5 volunteers. I think that a power law distribution is going to be pretty typical of any project, and have even seen it in my own tiny efforts.

  11. Nico Cellinese says:

    My point is that we can do both and do not have to choose one over the other. Skipping OCR would not be a good idea because in some use cases it may be better than crowdsourcing and viceversa. I’m not even talking about cost. If they were equally expensive I would still implement both approaches in parallel. Get a chunk of data out to the crowd and while that gets done, get the machines to work too. I guess I’m a multitasker.

    • Rob says:

      I like the model where crowdsourcing is (in most cases) in the workflow of digitization someplace but not the only solution for initial capture. Good cases are being made for initial OCR where it is best deployed, and then crowdsourcing to help do error checking and perhaps data enhancements. For example, I love the idea of georeferencing (assigning latitude, longitude and error) as part of the crowdsourcing step. I do still have concerns about how well OCR will perform on drawers, even with typewritten labels that might be partially obscured, not flat, and deeply shadowed by specimens. Much easier for “flat” scannable docs such as herbarium sheets or printed pages, perhaps.

      • Nico Cellinese says:

        Indeed! And that’s why I think it depends on the use cases. OCR won’t work that well on labels pinned under insects but may work just great on printed ledgers or herbarium sheets. Handwriting is an issue for both NHR and crowds. For example old herbarium labels are hard to decipher by non experts. Not only for the often illegible calligraphy but also because of old foreign place names that are also hard to interpret. I just think that every use case needs to be evaluated and dealt with the most apprpriate approach which may indeed consist of both OCR and crowdsourcing and maybe something else too we don’t know yet.

  12. Hi all,

    It is is great to see all this interest on this idea. First of all I work at Vizzuality which works with Zooniverse in OldWeather.

    The idea is not brand new as has been commented. There are several projects doing this already for different kind of collections, for example Herbaria@Home (http://herbariaunited.org/atHome/).

    Now the great thing about GalaxyZoo and Zoouniverse is the capability to bring a very big community and keep them motivated to do the work. To do so basically you need to take care of interfaces (think of gamification but also user engagement) and a lot of love for the community (forums, showing them the science behind it, etc.).

    The regarding the quality of the data we are finding that the results of crowdsourcing actually beat the results from “experts” doing the work. People get tired, but if you manage to get tenths of different people to digitize the same, the balanced result is better. Also motivation is a great factor, in Oldweather for 97% of data items digitized, 3 persons have written exactly the same thing, so people really take care of writing what is there.

    The Citizen Science Alliance (CSA http://www.citizensciencealliance.org/) is the organization behind OldWeather,PlanetHunters and lot of those GalaxyZoo projects. They just opened a call for proposals for citizen science projects and I know some about biological collections label transcriptions had been submitted.

    The question is… we as a community can coordinate enough to use citizen science as a service to digitize faster our collections? Can we agree on an infrastructure where money from collection is spent on digitalizing collections and the transcriptions is kept by projects like this? Of course there will be other layers of data cleaning after the transcriptions are done (like taxonomy reconciliation, location reconciliation, etc.) and of course other methods like OCR might help.

    I will be talking, hopefully, at TDWG about this, and we are looking for partners to start a project like this. We need:

    -Some funding, CSA has already some.
    -Institutions partners that can bring data
    -Scientist looking to get data quickly and do some science with it.
    -People that want to help on the community.

  13. Pingback: The unintended benefits of crowdsourcing OCR text correction

  14. Lise Summers says:

    Sometimes, it is not OCR vs crowdsourcing, but OCR and crowdsourcing. The National Library of Australia’s microfilm newspapers project looked at OCR of digitised newspapers, realised the errors and problems, and away they went. http://trove.nla.gov.au/newspaper

  15. Amanda Neill says:

    I, too, have put down OCR as returning generally disappointing results for anything other than clearly-typed modern labels. But within any label-deciphering system as it gains users, especially if shared as a community open-source development project, we should be able to rapidly build the training into the product and serve up a better-trained OCR engine update with every product update. In fact every downloader should have to return a small set of cleaned OCR results back for training to “buy” continued access.

    But, mostly I just want the prize for being number 25 🙂

    • Rob says:

      WOO! OK, SO AWARDS: Amanda you have indeed made the 25th comment, and we are also giving awards to Ben Brumfield for deepest-relevant and Judith Price for funniest-relevant. WE ARE AWARDING ACTUAL PRIZES, which will be explained in separate emails very shortly. Please do let us know if you like the awards and THANKS TO ALL THE COMMENTORS for the excellent discussions. Keep them coming! The awards are just gravy.

  16. Amanda Neill says:

    Thanks for the giant trophy for being number 25!!! How did you get it here so fast?

  17. Hello –

    I am working on the Old Weather project and I’d like to share with you what we (the crowd) has discovered. I actually work in document management – beginning with transcribing documents for litigation, to transcribing comments from the public on Federal regulatory actions, so I have some professional experience also –

    1) The interface is very well designed. One does not need any experience with document transcription to work on the project.
    2) QA is handled in a very elegant way – 3 people transcribe each log page’s entries. At the beginning of the project, it was 5 people, but the agreement rate was so high, the number was lowered to 3.
    3) There is no way a computer could read these pages – the handwriting varies from very easy to read print, to almost incomprehensible script. Also, some of the pages have faded ink, stains, etc, that a computer just could not work through.
    4) The crowd works very efficiently – I think the designers of the project are a little surprised by how quickly we are working through the log pages.
    5) This has become the most important part of the project to most of us – you find you quickly develop an attachment to the ships you work on – it is almost as if you are on them. We have discovered a shared humanity with these long gone sailors – they go to the movies, attend lectures on evolution, witness eclipses, give concerts, lose things overboard, and on, and on. We even have a thread in the forum for listing deaths on board and we take a moment to remember these men.
    6) Lastly, within the project, we have developed a very close knit community – many of my fellow transcribers have become friends – even though I doubt we will ever meet face to face. We have explored things mentioned in the logs with further research, we tell jokes to each other, talk out the music we like, and in general, have made this a part of our everyday lives.

    This is truly one of the most fascinating things I have ever done –

    yours –

    Kathy Wendolkowski

    • Andie says:

      Hey Kathy,

      A little belatedly, thanks so much for your comment! Do you mind if we quote a bit from it for our next post?

      Andie

      • Kathy says:

        Yes, feel free to use my comment – I’m flattered that you wish to do so –

        yours-

        Kathy Wendolkowski

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s