An Ode to Founders and a Field Notes Challenge: Part 1

Junius Henderson was the founder and first curator of the University of Colorado (CU) Museum of Natural History where Rob works.  Because Rob is the Invertebrate Curator of Zoology, and his training is in malacology (not “bad ecology” or “evil” but the study of molluscs such as squids, clams, snails), he has always been pleased that he can trace a direct taxonomic line back to Henderson, who was first and foremost one of the great descriptive malacologists working in western North America.  One hundred years later, brick and mortar testaments to CU’s Founders remain throughout campus: the Henderson building, where the CU Museum of Natural History exhibits are housed, and the Ramaley building (named after compatriot Francis Ramaley), home to the majority of the ecology and evolutionary biology department.

Junius Henderson kept copious field notes describing his many collecting trips; these were compiled into eleven volumes, and are archived in the museum.  The notes start in 1905 with this entry:

“Boulder, Colorado. July 28, 1905. Saw Say Phoebe and Siskins, Robin, Flicker.“  

Another very early entry reads,

“Expenses Florissant trip, 2 tickets to Denver Dr. Ramaley and I —-$2.00. Saw a Kingbird and Robin on way to depot…  Went to City Park and heard band and saw moving pictures including ‘Stage Robbery’ which, to say the least, was not an elevating spectacle, nor helpful to venturesome boys, apt to be carried away with the wildness of such a life.” [emphasis added for the benefit of any venturesome readers]

Twenty-two years and ten notebooks later, here is one of the last entries:

“Virginia Dale, Colo., Wednesday, June 15, 1927. Cloudy, foggy, rainy, cold morning, with a strong northwest wind. Started at 8 a.m. At edge of Laramie basin, speedometer 9728; Laramie 9747; Rock River 9787, at noon for lunch: Medicine Bow 9804; Ft. Steele 9848, about (speedometer slipped off just before reaching there); Rawlins 9864. Roads mostly gravelled and good; but in some places clay, and soft and slippery. Cleared about middle of afternoon and warmer this evening in camp at Rawlins.”

Fast forward another century (give or take): shortly after Rob’s arrival to CU in 2000, the now retired Curator of Paleontology, Peter Robinson mentioned he had personally transcribed ALL ELEVEN VOLUMES and saved each notebook as a separate Word document.  This is a best-case scenario for transcription in many ways; Peter is an expert with deep experience in natural history and paleontology, so his transcriptions of esoteric species names and locations are likely as accurate as they could possibly be.  While there are no scans of Henderson’s notes (yet), Peter did add some annotations (always using double parentheses) such as, “((at some later date Henderson wrote an emphatic ‘NO.’ at this place in the notebook))” to let readers know where they should refer back to the original notes.  So one disappointment is that Peter often added this annotation “((Drawing in field book))” to the notebook, which one cannot (yet) view.

Rob has made use of these notes in his research at CU; in 2003, he headed out on a summer-long collecting expedition as part of a State of Colorado survey of molluscs and crayfish in Western Colorado. Henderson’s field notes provided invaluable context and information about past collecting trips.  Henderson’s notes aren’t just part of the scientific record, however; they’re also a vivid image of the American West in a moment of swift change, as his modes of transportation transition from stagecoach to trains to automobiles, and his travels take him along new routes and through new towns and cities.  In our last post we talked about how we can best do work at the intersection of the sciences and the humanities; rich corpora of field notes like Henderson’s are exactly the media that tie these seemingly disparate disciplines together.

So why are we telling you all this?  Because we think that:

a) Henderson’s meticulousness and Peter Robinson’s hard work provide a remarkable resource that should be publicly available, and;
b) we’ve talked a lot about how to digitize, what to digitize, why to digitize, but we haven’t done quite as much work discussing what to do once you’ve digitized.  In other words, say you’ve transcribed 1000 pages of field notes.  Now what?

So over the last week we’ve been working on just this question of “Now What” using  Henderson’s field notes, with the following goals and caveats for this project:
1) We want to make the notes publicly accessible, easily discoverable, and preferably bundled with appropriate descriptive, structural and preservation metadata;
2) We want do so using the least restrictive licensing available (and we appreciate the support and encouragement of CU Museum Director Patrick Kociolek and Peter Robinson to do so);
3) We want to make use of some of the automated data extraction tools we’ve stumbled across over the last couple of months to do things like link names of taxa, places, people and dates to other sources of biodiversity knowledge;
4) We want to produce at least one Nifty Thing as a result of this project — like a map on Google Earth showing Henderson’s travels;
5) We don’t want to spend more than five hours each on this.  This is because we’re both super busy, and we also like the idea of figuring out what substantial products can be produced on a budget of no money and close-to-no time.

Rob’s student Gaurav Vaidya has also been working on this project with us, focusing on possible wikipedia-oriented solutions, and we’re all nearing/exceeding the end of our respective 5 hour allotments (even when excluding time spent looking up movies from the 1900′s and pictures of Say’s Phoebe).   In the interim, here (in text and Word formats) is the first notebook of Henderson’s for your perusal and to get you thinking and doing. In posts that follow, we will report some of our next steps with the full corpus along with releasing the other notebooks.  More soon!

Posted in Henderson Project | 3 Comments

Where do the digital humanities and eScience intersect? — Crosspost with VertNet

This special post was co-written with David Bloom, VertNet Coordinator and crossposted (with some minor mods) at the Vertnet Blog.  

First and foremost, digitization of natural history collections and tools to make these digitized records available, such as VertNet, support global biodiversity research.  We suspect that the majority of use of digitized records will be to generate products such as species distribution models and change assessments, and to answer questions about what is in any given museum collection.  However, in the broader context of academic endeavor, these data could also serve as a unique link between the digital sciences and the digital humanities.  Work in the digital humanities includes everything from crowdsourcing manuscript transcription to humanistic fabrication to data mining — work that is not so dissimilar in method, description, or data type from that in the digital sciences.

Biological collections aren’t the only organizations engaged in massive digitization efforts; libraries and archives have been digitizing and making their materials discoverable and interoperable for decades as well.  As a result of these efforts, an unprecedented number of research materials from a wide range of domains are now available for free on the Web.  Just as VertNet does for biodiversity data, the University of Illinois’ Digital Collections and Content project does for cultural heritage records, the Australia National Library’s Trove for newspapers, articles, and music.  The Hathi Trust makes more than 9 million books available — and the list goes on.  Digitization allows these materials to be recombined and analyzed quickly and (relatively) easily in new ways.

Our question is a simple one:  Where do the digital humanities and e-science overlap and interconnect?  One method of digital investigation that caught our attention is the mapping of novels and other historic texts; researchers take prose text and mine it for mappable units.  Erin Sells and her students, for instance, have used this method to create dynamic maps of Virginia Woolf’s Mrs. Dalloway, which incorporate “pictures, sounds, videos, and the text itself into the map.”  Similarly, in the Google Ancient Places project, researchers mine archaeological and historical texts to create databases of georeferenced ancient locales which can then be mapped.  Though these researchers are working with novels, they’re producing data in formats similar to those used for species occurrence records in databases such as VertNet.

This made us think: what sorts of questions could we ask of a data set composed of all kinds of georeferences — not just species occurrence records, but locations from history or works of fiction as well?  If students of the humanities can create maps with such texture using similarly organized data sets, could they build on this richness by including analysis of the natural world as it existed at the time described in the novel?  Perhaps searching on the VertNet portal (or GBIF or ALA) could provide a detailed list of vertebrate species and, with a little more work, the associated ranges of these species.  Suddenly, the map of Mrs. Dalloway’s world, and the atmosphere of Clarissa’s party, can be enriched not only with human influence and creation, but by the natural environment, too.  Conversely, data from diaries or other digitized sources could be mined for data about distributions of now-extinct species.  Could these data be used as observations and published as records along with those from natural history collections?

We hope that VertNet will support interdisciplinary research in the science and the humanities by providing new avenues for deeper readings, and new ways to reconstruct real and imagined worlds.  Where are the specimens that Lewis and Clark found on their expeditions and how do those link up with their journals (online already!!)?  What about whale species described by Melville?   How accurate are James Fenimore Cooper’s depictions of the animals Hawkeye and Cora encountered as they traveled through the Great Lakes?  What does this accuracy or inaccuracy tell you about Cooper as an author?  What about Thoreau’s notebooks of life at Walden Pond, and how have this iconic landscape and its animals and plants changed since his stay?

We also hope that other folks have more ideas about what new combinations of data and domains of inquiry are possible now that so many different sources of knowledge have been digitized.  How can eScience support and enrich the digital humanities and vice-versa? What happens when images of specimens* mix with drawings from the literature? Point-radius georeferences, for example, are easy enough to pull together from different sources — what further visualizations could be created with the combination of journals, books, and catalog ledgers?  What further ways can we use data and smarts to bridge gaps between the sciences and the humanities?

SYTYCD is offering the inaugural Thinky People’s Digitizaton Challenge (THIPDIC).   This first THIPDIC will go to the person or people who provide our favorite comment showing how digital science and the digital humanities intersect.  Any cool examples?  Any deeper thoughts about how this happens?  Any cute pictures of animals reading book?  Winners will be celebrated the world over and will be eligible for a (modest) prize, offered by Rob (don’t worry, it’ll be something interesting and of actual value).  You may now talk amongst yourselves.

* gigapan snakes in jar!

Posted in Uncategorized | 6 Comments

Zombies versus Unicorns at TDWG (or, a recap of citizen science talks)

So You Think You Can Digitize was in the Big Easy for TDWG 2011 last week. Summarizing the whole meeting is best left for friends Nico and Gaurav, who have longer attention spans than us. Nor should you miss the Unicorn Magic from friends at VertNet. Instead, we’ll focus our efforts on a set of talks in the citizen science session.

Batting first: Enlisting the Use of Educated Volunteers at a Distance: Or, Why Crowdsourcing and Citizen Science Will NOT Create Nightmare Zombies That Will Destroy Us All.

Presented by us! Slides are here. This talk developed organically out of the last few SYTYCD posts, but also gave us an opportunity to push a bit further on some trickier concepts we’ve been cogitating on for the last few months.  Particularly:1)  We presented some neat (and we think relevant) education literature that shows that knowledge may be constructed more quickly through peer discussion in the classroom. We argued that volunteers communicating and using existing resources to vet records is analogous to students talking to their neighbors in the classroom. What do you think? Discuss!

2)  We also argued that the creation of these large crowdsourcing interfaces and applications (e.g. Old Weather, Atlas of Living Australia,) necessarily forces “articulation work” — that is, the work explaining what one group of people wants done by another group of people (e.g. curators by web developers, collections managers by volunteers).  A fundamental concern of citizen science is about how to best connect the people collecting or annotating data back to the scientists who use them.  Using web applications to facilitate this connection forces both the citizen scientists and the experts to understand the data and encode that understanding into those apps.  For a standards group like TDWG, this act of encoding is particularly iimportant to consider and understand; we need to remember that standards aren’t just ways of passively creating databases with consistent field names, but are means of facilitating communication and shared sense of mission between people as well.

Notes: We might still have some work articulating articulation work.  Also, our best intentions to collect data on how easily people can use existing web resources to more effectively digitize foundered on the rocks of too little time to get through some, uh, minor logistics issues (in particular, IRB Human Subject approvals – facepalm).  However, we still hope to do this in the future.

Batting in the 2-Spot:  Crowd sourcing record transcription to unlock historical species data from natural history collections.

Andrew Hill, Vizzuality wunderkind and semi-erstwhile PhD student at CU Boulder with Rob, discussed Vizzuality’s rapid development of citizen science projects like “Old Weather” and a new one for NASA called NEEMO. Andrew showed that citizen scientists work together in the spirit of both cooperation and competition by relating how he and company owner Javi De La Torre kept vying for the top scoring spot in NEEMO — only to be blown away by a NASA employee who was also working/playing. It is an interesting line, at least from our perspective, where elements of competition and collaboration can both be optimized in developing citizen science applications. We here at SYTYCD have tended to focus on cooperation and narrative — not on game-ification and competition — but maybe there is a middle ground that yields the best of both worlds, and maybe the broadest appeal. Perhaps competition works better for some demographics and cooperation for others. Andrew also announced that Vizzuality is likely going to be involved, in some capacity, in developing a citizen science project for natural history transcription. We love this plan and can’t wait to hear more.

Batting Third:  Crowd-sourcing: perpetual valuable resource or a passing shower of dubious worth?

Paul Flemons, who Rob thinks looks just a teensy bit like Samuel Vimes (famous fictional cop), presented his work with the ALA’s “Australian Museum Cicada Expedition” while deftly weaving in musings about the long-term value of crowdsourcing as a digitization tool.  One thing we particularly liked seeing was a frequency plot showing the  “long tail”  of transcription efforts.  That is, most volunteers who drop by the site will only transcribe one or two records; however, there are a few extraordinarily dedicated folks who will transcribe much larger numbers — hundreds or thousands of records.  Why?  Well this gets to incentives — really, all the talks in the session ultimately touched on this essential topic.  Is it possible to build a citizen science tool that shifts that long tail to be shorter and stouter so that more people are willing to transcribe more records?   Paul ended his talk saying he wasn’t entirely sure about the future of crowdsourced transcription for natural history collections — he is still not sure that we have the critical mass of volunteers needed to transcribe EVERYTHING, or that the links between the volunteer work and science are always full exposed.

After seeing all the talks and the excellent demonstrations by Beth Mantle, Katja Schulz, and Tony Kirchgessner, we are more optimistic than Paul. One reason for optimism: we overheard comments like “Wow!  this session was amazingly well attended” and, to paraphrase, “this might actually work.” So, yeah, TDWG was indeed great, even if one of us who isn’t Rob did get suckered into Co-Chairing the Citizen Science Interest Group.  And yes, we do indeed still think we can digitize.

Speaking of digitization, we have been following the crowd-sourcing thread for a long time now, and next posts may swing back around to other topics of interest in the broader realm of natural history digitization.  With the ramping up of Thematic Collections Networks and the iDigBio HUB, the hard work of digitizing and the even harder work of innovating is just getting started….

Posted in Uncategorized | Leave a comment

Crowdsourcing, Deep Reading, and Narrative: Part 3

Ok, so it’s no Berlin Trilogy, but the reason we wanted to break this up into three posts was so we could take the time to really tease out some subtler points about the value of citizen science approaches for natural history digitization.  The main points we want to make are:

  1. Citizen Science projects connect like-minded people and by so doing, create work communities that enhance both volunteer experience AND data quality;
  2. These tools not only connect people but also allow them to become part of the narrative of discovery, and by so doing enhance quantity and quality of the data produced.
  3. Contextual knowledge that helps explain what is being transcribed, and why, enriches the experience for citizen scientists and leads to improved data quality.

More below.

POINT THE FIRST:  There are smart amateurs out there, dammit (and we’ve relied on them for years).  It might be a “duh”, but citizen science is a FANTASTIC way to connect a task with people who are well suited to perform it.  This isn’t just about efficiency — it is about finding people who deeply ENJOY the work, who want to be involved, and who want to discover other like-minded folks and talk about the experience.  Natural history work has relied on volunteers for decades, if not centuries — in field work, in cataloguing, and, yes, in in-house transcription projects.  Online citizen science is a natural evolution from in-museum volunteerism.

We think translating the volunteer experience from a museum collection to an online community could have striking benefits both for volunteers and museum collections — benefits that aren’t possible in a local environment.  First benefit: access to an expanded pool of up-to-date knowledge that can help with difficult tasks — say, a hard-to-read label or an unfamiliar locality (we’ll explain this more in Point 3).  Second benefit: rapid access to innovations and ideas from the “hivemind.”  With enough fairly basic infrastructure (read: message boards) volunteers can (and will) talk amongst themselves to share the newest, fastest ways of solving problems and tackling tasks — be they transcription or, say, the identification of a whole new class of interstellar object.

POINT 2:  The narrative necessary to keep volunteers engaged in a project is also necessary to create good, usable data and to start the process of scientific inquiry.   Part of the charm of Old Weather is the sense of being on the boats, tracking where they went, and learning about the people on them.  But this sense of being on a journey is more than window dressing — it provides volunteers with valuable context that allows them to make better decisions while transcribing and correcting data.  The combination of mapping functions, access to other ship’s logs, and access to other people working on the same project provides volunteers with the context necessary to quickly “enter” the narrative and begin working with the data.

The benefit of engaging with history, of “entering” the narrative, is, ironically, that it allows us to release data from the very history in which they are bound (i.e. catalog ledgers), and then used in new contexts.   We argue that this process of collectively unlocking and then re-assembling biodiversity data into new contexts also increases their fitness for use; data are only considered usable after they’ve been checked (or referenced, to borrow from Latour*) against other data.  It is this process that ultimately leads to scientific discoveries.

We want to emphasize here the “science” part of citizen science, and its ability to build new, collective knowledge.  Can we show transcribers the fruit of others’ digitization efforts to build this collective knowledge?  How might this be accomplished for natural history ledger/specimen transcription?  One obvious idea is a collaborative map showing new records as they are digitized by all participants.  Such a map would show gaps in what we know about biodiversity and how those are being filled by efforts of transcribers.  Such a map could also link to other scientific projects that utilize the “what/where/when” data being transcribed in order to document species distributions.   For example, Rob routinely uses these data for just that purpose — documenting species distributions and how they are changing in response to rapidly accelerating environmental change.  What do you think!?

POINT 3:   We are uniquely fortunate in our discipline to have excellent reference materials in electronic format.  For example, The Encyclopedia of Life (EOL) is a remarkable resource that continues to fulfill the promise of its name.  Linking contextualizing resources such as the EOL directly into the transcription workflow enhances user experience and data quality.   It could help volunteers decipher hard-to-read taxon names (e.g. does that label say Pinus torreyanus or Pinus torreyana?  Bacillus or Bacillus?).  Furthermore, EOL’s value extends past facilitating data quality improvements because it has a built-in reward system: citizen scientists can create personal collections of the species they have transcribed, along with information about those taxa (e.g. where they are located, what they sound like).  These collections can be a sharable legacy of their work while also providing a link back to natural history and the excitement of discovery.

We have two closing thoughts: in our original post we couched our assertion that crowdsourcing trumps machine-only transcription techniques in the caveat that we were being “a little provocative as opposed to right.”  Well, after reading comments, researching, and cogitating we are increasingly convinced that we might just be flat out right.  Crowdsourcing science works.  Period.

Second closing thought:  All of our blog postings start as labyrinthine Google Docs, and there is always a long list of ideas, partial paragraphs, etc. that don’t get incorporated into the posting at hand.  Well, we’ve taken many of the (better) spare pieces and parts — PLUS ANY COMMENTS WE GET SO PLEASE COMMENT — and molded ideas (with attribution to you, dear readers) into a talk at the annual TDWG (Oct 16-21, New Orleans) meeting.  Talk title: “Enlisting the Use of Educated Volunteers at a Distance: Or, Why Crowdsourcing and Citizen Science Will NOT Create Nightmare Zombies That Will Destroy Us All”.  Yeah!  Hope to see you there!

*Recommended reading for this week (and thanks to the CIRSS Tuesday Reading Group for it):

Latour, B. (1999). Circulating reference: Sampling the soil in the Amazon forest. Pandora’s Hope. Harvard University Press. Retrieved from: http://www.citeulike.org/group/3621/article/636580

Posted in crowdsourcing | 1 Comment

Crowdsourcing, Deep Reading, and Narrative: Part 2

Deep-reading, narrative and the idea of the community-participation-in-solitude as part of “transcribing the past” were key themes that emerged last post.  This idea of incorporating shared narratives into digitization work — or rather, taking advantage of existing narrative — is perhaps less foreign to the humanities than it is to the sciences.  Our evidence for this?  The litany of (seemingly successful!) crowdsourced transcription projects involving diaries, histories, newspapers, and even menus.

Originally, we planned to present an annotated linkography/bibliography just within this post, but then decided it would be better kept as an updated-as-often-as-possible stand-alone page.  So if you click over to our Digitization Bibliography, you’ll see a nice condensed list everything we’ve come across in the last few weeks.  As we learn of new projects, we will add them to this document, and maybe announce new additions on twitter or something (@robgural and @an_dre_a_ respectively).

We are painfully aware that this list is far from complete, and we are painfully aware that many of you may know more about current work than we do.  With that in mind, we’d like to do our own bit of crowdsourcing and ask/beg you to leave a comment pointing us to anything we’ve missed.  So: what glaringly obvious project have we left off?  Have you just started a project that you want us to know about?  Is there an already existing list of digitization projects that we should’ve found in the first place (hi, Wikipedia)?  LET US KNOW!!!  We are particularly keen to hear about any field note digitization projects we don’t know about.

Many thanks to everyone that commented and emailed with links already!  Next post: what does it all mean?

Posted in crowdsourcing | 2 Comments

Crowdsourcing, Deep Reading, and Narrative: Part 1

Last post we wound up with a lot of feedback, and happily, much of it far beyond statements declaring fealty to either crowdsourcing or OCR, but rather, consisting of amazing discourse about the best ways to make use of crowd and computers alike.  We’re very appreciative — and humbled!  We knew so little going in, and got gentle and great feedback saying that, DUH, there is much happening out there.   We also got some insightful comments from folks simply sharing their own experiences converting various kinds of scans into movable type — both as volunteers and project coordinators.

We get the strong sense that our community is excited and ready to move forward.  But one thing we might be able to do is to “gear up” more collectively.  Not piecemeal, with 1000 different projects all working in isolation, but maybe in ways where the transcription experience, the data produced, and the science generated from those data are all enhanced.  Furthermore, we are now fully convinced, based on the excellent feedback we received, that there are real, concrete social and educational benefits to crowdsourcing that have henceforth been under-explored and perhaps unappreciated — again, built around thinking more collectively.  We want to now dig a little deeper, using the comments we received as groundwork for further inquiry.

So, we have a trilogy of blog posts planned: Post 1 will discuss some of this feedback in depth. Post 2 will LIST some of the new (to us) projects that we were forwarded via email and in the comments — particularly some projects that may be outside the normal bounds of natural history or museum work.  And Post 3 will summarize all of this material and any new comments into something concrete about practice.  If we do our job well, there will be lots of fodder from the community to develop into next steps e.g. collaborations, proposals, angry letters to the editor.   If not, well… you can always go back to watching beluga whales getting serenaded by a mariarchi band.   Onwards to comments.

As in the last post, we do hope (nay, live!) for your comments and feedback.  Throughout this post we will be asking more questions than offering answers; if you have sharable thoughts, please, don’t hesitate to post them!  This whole community thing only works when we talk to each other.

From Chris Norris:

“Most of the successful examples of crowdsourcing museum/science-based work that I’ve seen involve tapping pre-existing communities of either professional or avocational workers… Unless, of course, you manage to invent the taxonomic equivalent of WoW as a hook.”

Chris brings up two excellent points here: 1) for many disciplines, there are already crowds of skilled workers, and 2) Narrative and rewards (even virtual) are addictive.  While a taxonomic MMORPG is likely ill advised for our purposes (though “Spore” did see modest success) there are narratives to natural history work that could work as powerful motivators in digitization.
We hope Chris does not mind, but we’d like to expand his first point a bit: we argue that we cannot forget these pre-existing communities that may already be eager to help us — communities on which we already rely and have relied for decades, in the form of lab volunteers, field voluneers, and docents.

Andy Bentley similarly picked up on this gamification/incentivization/reward thread:

“The major issue is getting these “humans” interested in doing something like this. I agree with other posters that turning this into a game has great potential. Something like Farmville where people only get to add animals to their farm/zoo/aquarium once they have transcribed the data associated with that specimen from a collection somewhere. If you have a particular affinity to a certain type of animal/plant/organism you can stock your farm/zoo/aquarium with as many of that thing as you like after you have transcribed that many label records…”

What do you, dear readers, think of this sort of scenario?  Is this enough of a narrative hook? Label digitization would require considerably more work per reward — careful work — than something like Farmville.  Would too much gamification lead to a decrease in real digitization, as people attempt to cheat (or “game”) the system for easier rewards?  What real-life rewards do people get out of transcription endeavors?

This leads us to some comments from Ben Brumfield, our winner for “Most Insightful” comment last post (all comments were insightful, but Ben wrote us an incredibly thoughtful novella of a reply), which touch on what might be the rub here:

“There is a lot of work being done on volunteer motivation for these projects, and not all research points to gamification as the solution to engagement issues. Other motivating factors – connection to the mission, sense of doing real work, collaboration with fellow volunteers, and the immersive nature of (some) transcription work–may be more motivating, and at least a few of these factors may be in direct conflict with game features.”

“In the case of the Julia Brumfield Diaries, the manuscript is a narrative. Transcription is a form of deep reading, and transcribing something like a diary can be an incredibly immersive experience, particularly if the volunteer can follow their own natural – usually chronological – workflow. Gamification practices … disrupt this flow, whereas message-boards or annotation tools can foster a community of volunteers checking each others’ work and researching each others’ problems.”

“Both Old Weather and the North American Bird Phenology Program are located somewhere in between these extremes. Ships’ logs are not terribly immersive — it’s hard to identify with a midshipman making observations, especially when that midshipman changes with each watch. However, they have a chronological structure which the Zooniverse/Vizzuality folks have managed to enhance through their mapping tools.”

We are impressed with the idea of “deep reading” — “the slow and meditative possession of a book” (Birkets 1994) (what a lovely phrase!) – as applied to science as opposed to literary texts. Much of what Ben describes has to do with finding narrative momentum within the otherwise tedious process of transcription.  Diaries have their own built-in, chronological momentum, and ships’ logs have a sort of spatial momentum, which, as Ben points out, Old Weather has capitalized on via mapping tools.  As transcribers read ships’ logs, they are cartographically carried across the ocean, and thus encouraged to imagine a life in a different time, on the high seas, as opposed to a life in the here and now, mayhaps in a cubicle.

Is there something about the real-re-imagined, of being both a part of making, and apart from, a history?  And is there something about re-imagining the formerly real collaboratively, as opposed to alone?  What narrative momentum can be found for transcribing natural history collections labels?  Can we tie this to exploration and the natural wonder of discovery?

From Javi de la Torre:

“The question is… [can] we as a community coordinate enough to use citizen science as a service to digitize faster our collections? Can we agree on an infrastructure where money from collection is spent on digitalizing collections and the transcriptions is kept by projects like this?”

Javi makes a great point and we want to dissect it a bit; we think there are multiple ways to think about community, coordination and rewards.  Different projects clearly need to make a coordinated effort to develop best practices in transcription.  But we also need to think about this as a challenge to bring people together across pages and images and projects in order to discuss their activities, share their knowledge, and feel a sense of community.  Such community building may be especially important because, despite the name, crowdsourcing is — in many other ways — a solitary task performed via keyboard and computer screen.  Individual, localized projects need to be linked in the same way that individuals are linked by working on a project.

We are inclined to argue that there will be necessary heterogeneity and diversity in digitization initiatives; the breadth, depth and scope of currently existing projects shows that there’s more than one way to digitize a notebook, and that small, local efforts can be just as effective as large national ones — we’ll go further into examples of these in our next post.  But we do think we can definitely stand to be more coordinated in few areas, particularly in our use and implementation of metadata standards, and maybe more importantly, in our community’s continued support of, and conversations about, different digitization initiatives.  We argue that the value of the experience and the quality of the data improve when community develops around these projects.

We’ll try to unpack these thoughts in more detail in an upcoming post, but for now want to close with a comment from Kathy Wendolkowski, a volunteer on “Old Weather” who captures this idea of community-in-solitary perfectly:

“Within the project, we have developed a very close knit community – many of my fellow transcribers have become friends – even though I doubt we will ever meet face to face. We have explored things mentioned in the logs with further research, we tell jokes to each other, talk out the music we like, and in general, have made this a part of our everyday lives.”

Next post, Part 2: Digitization efforts from the other side of the aisle, as it were – geneology, digital humanities, and more.

Posted in crowdsourcing | 1 Comment

Old Weather’s Crowd and the Challenge of Digitization

Quick recap:  Last time around, we talked about a confluence of drawers.  At every digitization meeting we’ve attended, there is a pall of gloom whenever discussion turns to digitizing things like entomological collections: drawers dense with delicate specimens that ever-so-inconveniently obscure their tags.  There is so MUCH to do!  It’s so difficult to do!  How do we make progress?  So we have a simple thesis: progress in imaging techniques and efficiency is no longer the issue.   Lots of people are exploring not-so-complicated imaging approaches that take very high-resolution photos of entire drawers (be they of insects or clams or chipmunk skulls or Triassic fossils, etc). Doing so will likely yield around a thousand-fold increase in efficiency over current methods, leaving museums with hundreds of thousands of images of many tens of millions of objects. We assert that converting the information on labels from images to machine-readable text is essential – and we don’t think many folks would disagree with us. But how?

At the end of last post we left you with two sets of words: “crowdsourcing” and “Old Weather.”  And many of you likely already figured out the denouement to our cliffhanger and are crying, “UGH, crowdsourcing, really?”  But wait!  Listen:  There are two legitimate possibilities for converting the words on labels from images to good, old-fashioned, copy and paste-able text. One is Optical Character Recognition (OCR); and techies all over the world love and hate this approach.  Love it because it’s so COOL – a magic machine that “reads” the shape of letters from an image and in doing so transforms pixels into movable type . The “hate” part: YOU SHOULD SEE SOME PEOPLE’S HANDWRITING!   The other approach: good old-fashioned human transcription.  So, Humans versus Machines, right?  Just the kind of battle we love!

Which approach works best is probably situation-dependent, but we are going to be a little provocative (as opposed to necessarily right) and argue that in most cases – especially those involving the digitization of handwritten labels or notes – crowdsourced human transcription beats OCR.  So, what’s the problem with OCR?  What we have noticed in many, many talks and presentations is that OCR is rarely as simple to set up, run or do quality control on as it initially seems.  Error rates remain high, requiring time and effort by well-paid professionals to tune the system and do quality control.  The result is that we wind up spending a lot of time working around machines instead of with them.  Crowdsourced human transcription, on the other hand, plays to humans’ and computers’ strengths; computers swiftly and quickly transport and store data, and humans use their discerning eyes to tell the computers what data to store.

More than just being a way to get labels into databases, keeping humans in the digitization loop – in a function other than error checking – has a lot of great side effects that OCR simply can’t match.  Chris Lintott, one of the original Zooniverse (more on Zooniverse in just a second)  PI’s gave a great talk at IDCC 2010 in Chicago last winter.  A quick summary follows, but if you can spare half an hour we’d suggest giving his talk a listen.  Dr. Lintott explains:
1) Human transcription increases the chances of serendipitous discovery – would a machine be able to call attention to Darwin’s marginalia?
2) Crowdsourcing can inspire impromptu collaboration amongst strangers – message boards give volunteers a place to do their own coordinating
3) Crowdsourcing necessarily means that you are staying engaged with a crowd – a group of unpaid strangers that care enough about your science to do it for free.  In early polls of Galaxy Zoo volunteers, over 50% said they contributed because they just liked helping scienctists.

This latter point is huge.   While the phrase “staying engaged with a crowd” sounds a bit frighteningly close to PR talk, in this case it just means you are keeping people excited about science!  It means you are showing that your collections have scientific AND social worth.  And it means that we are sharing our awesome, enviable vocations with folks that have the interest but maybe not the luck to work in museums themselves.

So, for those who didn’t already jump to the (correct) conclusion, our other two words at the end of last post – “Old Weather” – is an excellent example of productively fun human-computer interaction that links experts, volunteers, and data (and is brought to you by Zooniverse and Vizzuality).  “Old Weather’s” mission statement is simple: users log on to, “Help scientists recover worldwide weather observations made by Royal Navy ships around the time of World War I. These transcriptions will contribute to climate model projections and improve a database of weather extremes. Historians will use your work to track past ship movements and the stories of the people on board.”

The team at Vizzuality, a company that really excels at producing web based applications that are deep, beautiful and functional, has made transcription of early 20th century ship’s logs… into a video game.  The interface is brilliantly designed; it is very easy to get started; rewards systems are in place to keep people in engaged. Hundreds of thousands of people are now individually enjoying a game built around digitization.  In less than a year, 555,905 pages of logs have been digitized.    Ship’s log pages are a lot more detailed and heterogenous than specimen labels, so maybe getting tens or hundreds of millions labels captured is a possibility using smartly deployed crowdsourcing?

For many, crowdsourcing is a scary proposition; it requires a fundamental rewiring of how we deal with our data – one that forces us to be more open and inclusive, and to think beyond our physical labs, collection drawers and perhaps even institutional identities.  And furthermore – we fully realize that it’s a solution not without drawbacks.  Two that seem obvious to us include: 1) crowdsourcing requires organization and institutional support, and we are painfully aware than not every museum has these, and 2)  Ownership issues and concerns of control over process and products.   We are sure there are more, so please tell us what flaws you see.

Even given drawbacks, this sort of broad outreach is truly necessary if we want to meet grand challenges like the 100% digitization of natural history collections – and if we want to continue proving our collections’ worth in an age of budget cuts, recession, and the folding of previously untouchable symbols of American research and ingenuity.  Crowdsourcing collections digitization gives us an opportunity to fulfill the fundamental promise of digitization: that it will improve access, use, and integration of biocollections!   Excited?  We are too, especially given  that Zooniverse is actively seeking new denizens – er, projects.

To summarize:  Skip OCR. Bring images to the crowd and make it fun. Better yet, bring images to the untapped resources attached to University museums and collections: swaths of Farmville-addicted undergraduates in lower division biology classes.  Integrate these projects with life science classes AND with Facebook – there is real potential here!

Ok, crowd, tell us what you think!   We’ll buy, well, not a space shuttle, but maybe something natural history-esque for the 50  25th (!!) commenter (comments MUST be relevant to the topic at hand!)  Prizes also awarded for the funniest-relevant and deepest-relevant.

Posted in Uncategorized | 33 Comments