Crowdsourcing, Deep Reading, and Narrative: Part 3

Ok, so it’s no Berlin Trilogy, but the reason we wanted to break this up into three posts was so we could take the time to really tease out some subtler points about the value of citizen science approaches for natural history digitization.  The main points we want to make are:

  1. Citizen Science projects connect like-minded people and by so doing, create work communities that enhance both volunteer experience AND data quality;
  2. These tools not only connect people but also allow them to become part of the narrative of discovery, and by so doing enhance quantity and quality of the data produced.
  3. Contextual knowledge that helps explain what is being transcribed, and why, enriches the experience for citizen scientists and leads to improved data quality.

More below.

POINT THE FIRST:  There are smart amateurs out there, dammit (and we’ve relied on them for years).  It might be a “duh”, but citizen science is a FANTASTIC way to connect a task with people who are well suited to perform it.  This isn’t just about efficiency — it is about finding people who deeply ENJOY the work, who want to be involved, and who want to discover other like-minded folks and talk about the experience.  Natural history work has relied on volunteers for decades, if not centuries — in field work, in cataloguing, and, yes, in in-house transcription projects.  Online citizen science is a natural evolution from in-museum volunteerism.

We think translating the volunteer experience from a museum collection to an online community could have striking benefits both for volunteers and museum collections — benefits that aren’t possible in a local environment.  First benefit: access to an expanded pool of up-to-date knowledge that can help with difficult tasks — say, a hard-to-read label or an unfamiliar locality (we’ll explain this more in Point 3).  Second benefit: rapid access to innovations and ideas from the “hivemind.”  With enough fairly basic infrastructure (read: message boards) volunteers can (and will) talk amongst themselves to share the newest, fastest ways of solving problems and tackling tasks — be they transcription or, say, the identification of a whole new class of interstellar object.

POINT 2:  The narrative necessary to keep volunteers engaged in a project is also necessary to create good, usable data and to start the process of scientific inquiry.   Part of the charm of Old Weather is the sense of being on the boats, tracking where they went, and learning about the people on them.  But this sense of being on a journey is more than window dressing — it provides volunteers with valuable context that allows them to make better decisions while transcribing and correcting data.  The combination of mapping functions, access to other ship’s logs, and access to other people working on the same project provides volunteers with the context necessary to quickly “enter” the narrative and begin working with the data.

The benefit of engaging with history, of “entering” the narrative, is, ironically, that it allows us to release data from the very history in which they are bound (i.e. catalog ledgers), and then used in new contexts.   We argue that this process of collectively unlocking and then re-assembling biodiversity data into new contexts also increases their fitness for use; data are only considered usable after they’ve been checked (or referenced, to borrow from Latour*) against other data.  It is this process that ultimately leads to scientific discoveries.

We want to emphasize here the “science” part of citizen science, and its ability to build new, collective knowledge.  Can we show transcribers the fruit of others’ digitization efforts to build this collective knowledge?  How might this be accomplished for natural history ledger/specimen transcription?  One obvious idea is a collaborative map showing new records as they are digitized by all participants.  Such a map would show gaps in what we know about biodiversity and how those are being filled by efforts of transcribers.  Such a map could also link to other scientific projects that utilize the “what/where/when” data being transcribed in order to document species distributions.   For example, Rob routinely uses these data for just that purpose — documenting species distributions and how they are changing in response to rapidly accelerating environmental change.  What do you think!?

POINT 3:   We are uniquely fortunate in our discipline to have excellent reference materials in electronic format.  For example, The Encyclopedia of Life (EOL) is a remarkable resource that continues to fulfill the promise of its name.  Linking contextualizing resources such as the EOL directly into the transcription workflow enhances user experience and data quality.   It could help volunteers decipher hard-to-read taxon names (e.g. does that label say Pinus torreyanus or Pinus torreyana?  Bacillus or Bacillus?).  Furthermore, EOL’s value extends past facilitating data quality improvements because it has a built-in reward system: citizen scientists can create personal collections of the species they have transcribed, along with information about those taxa (e.g. where they are located, what they sound like).  These collections can be a sharable legacy of their work while also providing a link back to natural history and the excitement of discovery.

We have two closing thoughts: in our original post we couched our assertion that crowdsourcing trumps machine-only transcription techniques in the caveat that we were being “a little provocative as opposed to right.”  Well, after reading comments, researching, and cogitating we are increasingly convinced that we might just be flat out right.  Crowdsourcing science works.  Period.

Second closing thought:  All of our blog postings start as labyrinthine Google Docs, and there is always a long list of ideas, partial paragraphs, etc. that don’t get incorporated into the posting at hand.  Well, we’ve taken many of the (better) spare pieces and parts — PLUS ANY COMMENTS WE GET SO PLEASE COMMENT — and molded ideas (with attribution to you, dear readers) into a talk at the annual TDWG (Oct 16-21, New Orleans) meeting.  Talk title: “Enlisting the Use of Educated Volunteers at a Distance: Or, Why Crowdsourcing and Citizen Science Will NOT Create Nightmare Zombies That Will Destroy Us All”.  Yeah!  Hope to see you there!

*Recommended reading for this week (and thanks to the CIRSS Tuesday Reading Group for it):

Latour, B. (1999). Circulating reference: Sampling the soil in the Amazon forest. Pandora’s Hope. Harvard University Press. Retrieved from: http://www.citeulike.org/group/3621/article/636580

Advertisements
Posted in crowdsourcing | 2 Comments

Crowdsourcing, Deep Reading, and Narrative: Part 2

Deep-reading, narrative and the idea of the community-participation-in-solitude as part of “transcribing the past” were key themes that emerged last post.  This idea of incorporating shared narratives into digitization work — or rather, taking advantage of existing narrative — is perhaps less foreign to the humanities than it is to the sciences.  Our evidence for this?  The litany of (seemingly successful!) crowdsourced transcription projects involving diaries, histories, newspapers, and even menus.

Originally, we planned to present an annotated linkography/bibliography just within this post, but then decided it would be better kept as an updated-as-often-as-possible stand-alone page.  So if you click over to our Digitization Bibliography, you’ll see a nice condensed list everything we’ve come across in the last few weeks.  As we learn of new projects, we will add them to this document, and maybe announce new additions on twitter or something (@robgural and @an_dre_a_ respectively).

We are painfully aware that this list is far from complete, and we are painfully aware that many of you may know more about current work than we do.  With that in mind, we’d like to do our own bit of crowdsourcing and ask/beg you to leave a comment pointing us to anything we’ve missed.  So: what glaringly obvious project have we left off?  Have you just started a project that you want us to know about?  Is there an already existing list of digitization projects that we should’ve found in the first place (hi, Wikipedia)?  LET US KNOW!!!  We are particularly keen to hear about any field note digitization projects we don’t know about.

Many thanks to everyone that commented and emailed with links already!  Next post: what does it all mean?

Posted in crowdsourcing | 2 Comments

Crowdsourcing, Deep Reading, and Narrative: Part 1

Last post we wound up with a lot of feedback, and happily, much of it far beyond statements declaring fealty to either crowdsourcing or OCR, but rather, consisting of amazing discourse about the best ways to make use of crowd and computers alike.  We’re very appreciative — and humbled!  We knew so little going in, and got gentle and great feedback saying that, DUH, there is much happening out there.   We also got some insightful comments from folks simply sharing their own experiences converting various kinds of scans into movable type — both as volunteers and project coordinators.

We get the strong sense that our community is excited and ready to move forward.  But one thing we might be able to do is to “gear up” more collectively.  Not piecemeal, with 1000 different projects all working in isolation, but maybe in ways where the transcription experience, the data produced, and the science generated from those data are all enhanced.  Furthermore, we are now fully convinced, based on the excellent feedback we received, that there are real, concrete social and educational benefits to crowdsourcing that have henceforth been under-explored and perhaps unappreciated — again, built around thinking more collectively.  We want to now dig a little deeper, using the comments we received as groundwork for further inquiry.

So, we have a trilogy of blog posts planned: Post 1 will discuss some of this feedback in depth. Post 2 will LIST some of the new (to us) projects that we were forwarded via email and in the comments — particularly some projects that may be outside the normal bounds of natural history or museum work.  And Post 3 will summarize all of this material and any new comments into something concrete about practice.  If we do our job well, there will be lots of fodder from the community to develop into next steps e.g. collaborations, proposals, angry letters to the editor.   If not, well… you can always go back to watching beluga whales getting serenaded by a mariarchi band.   Onwards to comments.

As in the last post, we do hope (nay, live!) for your comments and feedback.  Throughout this post we will be asking more questions than offering answers; if you have sharable thoughts, please, don’t hesitate to post them!  This whole community thing only works when we talk to each other.

From Chris Norris:

“Most of the successful examples of crowdsourcing museum/science-based work that I’ve seen involve tapping pre-existing communities of either professional or avocational workers… Unless, of course, you manage to invent the taxonomic equivalent of WoW as a hook.”

Chris brings up two excellent points here: 1) for many disciplines, there are already crowds of skilled workers, and 2) Narrative and rewards (even virtual) are addictive.  While a taxonomic MMORPG is likely ill advised for our purposes (though “Spore” did see modest success) there are narratives to natural history work that could work as powerful motivators in digitization.
We hope Chris does not mind, but we’d like to expand his first point a bit: we argue that we cannot forget these pre-existing communities that may already be eager to help us — communities on which we already rely and have relied for decades, in the form of lab volunteers, field voluneers, and docents.

Andy Bentley similarly picked up on this gamification/incentivization/reward thread:

“The major issue is getting these “humans” interested in doing something like this. I agree with other posters that turning this into a game has great potential. Something like Farmville where people only get to add animals to their farm/zoo/aquarium once they have transcribed the data associated with that specimen from a collection somewhere. If you have a particular affinity to a certain type of animal/plant/organism you can stock your farm/zoo/aquarium with as many of that thing as you like after you have transcribed that many label records…”

What do you, dear readers, think of this sort of scenario?  Is this enough of a narrative hook? Label digitization would require considerably more work per reward — careful work — than something like Farmville.  Would too much gamification lead to a decrease in real digitization, as people attempt to cheat (or “game”) the system for easier rewards?  What real-life rewards do people get out of transcription endeavors?

This leads us to some comments from Ben Brumfield, our winner for “Most Insightful” comment last post (all comments were insightful, but Ben wrote us an incredibly thoughtful novella of a reply), which touch on what might be the rub here:

“There is a lot of work being done on volunteer motivation for these projects, and not all research points to gamification as the solution to engagement issues. Other motivating factors – connection to the mission, sense of doing real work, collaboration with fellow volunteers, and the immersive nature of (some) transcription work–may be more motivating, and at least a few of these factors may be in direct conflict with game features.”

“In the case of the Julia Brumfield Diaries, the manuscript is a narrative. Transcription is a form of deep reading, and transcribing something like a diary can be an incredibly immersive experience, particularly if the volunteer can follow their own natural – usually chronological – workflow. Gamification practices … disrupt this flow, whereas message-boards or annotation tools can foster a community of volunteers checking each others’ work and researching each others’ problems.”

“Both Old Weather and the North American Bird Phenology Program are located somewhere in between these extremes. Ships’ logs are not terribly immersive — it’s hard to identify with a midshipman making observations, especially when that midshipman changes with each watch. However, they have a chronological structure which the Zooniverse/Vizzuality folks have managed to enhance through their mapping tools.”

We are impressed with the idea of “deep reading” — “the slow and meditative possession of a book” (Birkets 1994) (what a lovely phrase!) – as applied to science as opposed to literary texts. Much of what Ben describes has to do with finding narrative momentum within the otherwise tedious process of transcription.  Diaries have their own built-in, chronological momentum, and ships’ logs have a sort of spatial momentum, which, as Ben points out, Old Weather has capitalized on via mapping tools.  As transcribers read ships’ logs, they are cartographically carried across the ocean, and thus encouraged to imagine a life in a different time, on the high seas, as opposed to a life in the here and now, mayhaps in a cubicle.

Is there something about the real-re-imagined, of being both a part of making, and apart from, a history?  And is there something about re-imagining the formerly real collaboratively, as opposed to alone?  What narrative momentum can be found for transcribing natural history collections labels?  Can we tie this to exploration and the natural wonder of discovery?

From Javi de la Torre:

“The question is… [can] we as a community coordinate enough to use citizen science as a service to digitize faster our collections? Can we agree on an infrastructure where money from collection is spent on digitalizing collections and the transcriptions is kept by projects like this?”

Javi makes a great point and we want to dissect it a bit; we think there are multiple ways to think about community, coordination and rewards.  Different projects clearly need to make a coordinated effort to develop best practices in transcription.  But we also need to think about this as a challenge to bring people together across pages and images and projects in order to discuss their activities, share their knowledge, and feel a sense of community.  Such community building may be especially important because, despite the name, crowdsourcing is — in many other ways — a solitary task performed via keyboard and computer screen.  Individual, localized projects need to be linked in the same way that individuals are linked by working on a project.

We are inclined to argue that there will be necessary heterogeneity and diversity in digitization initiatives; the breadth, depth and scope of currently existing projects shows that there’s more than one way to digitize a notebook, and that small, local efforts can be just as effective as large national ones — we’ll go further into examples of these in our next post.  But we do think we can definitely stand to be more coordinated in few areas, particularly in our use and implementation of metadata standards, and maybe more importantly, in our community’s continued support of, and conversations about, different digitization initiatives.  We argue that the value of the experience and the quality of the data improve when community develops around these projects.

We’ll try to unpack these thoughts in more detail in an upcoming post, but for now want to close with a comment from Kathy Wendolkowski, a volunteer on “Old Weather” who captures this idea of community-in-solitary perfectly:

“Within the project, we have developed a very close knit community – many of my fellow transcribers have become friends – even though I doubt we will ever meet face to face. We have explored things mentioned in the logs with further research, we tell jokes to each other, talk out the music we like, and in general, have made this a part of our everyday lives.”

Next post, Part 2: Digitization efforts from the other side of the aisle, as it were – geneology, digital humanities, and more.

Posted in crowdsourcing | 1 Comment

Old Weather’s Crowd and the Challenge of Digitization

Quick recap:  Last time around, we talked about a confluence of drawers.  At every digitization meeting we’ve attended, there is a pall of gloom whenever discussion turns to digitizing things like entomological collections: drawers dense with delicate specimens that ever-so-inconveniently obscure their tags.  There is so MUCH to do!  It’s so difficult to do!  How do we make progress?  So we have a simple thesis: progress in imaging techniques and efficiency is no longer the issue.   Lots of people are exploring not-so-complicated imaging approaches that take very high-resolution photos of entire drawers (be they of insects or clams or chipmunk skulls or Triassic fossils, etc). Doing so will likely yield around a thousand-fold increase in efficiency over current methods, leaving museums with hundreds of thousands of images of many tens of millions of objects. We assert that converting the information on labels from images to machine-readable text is essential – and we don’t think many folks would disagree with us. But how?

At the end of last post we left you with two sets of words: “crowdsourcing” and “Old Weather.”  And many of you likely already figured out the denouement to our cliffhanger and are crying, “UGH, crowdsourcing, really?”  But wait!  Listen:  There are two legitimate possibilities for converting the words on labels from images to good, old-fashioned, copy and paste-able text. One is Optical Character Recognition (OCR); and techies all over the world love and hate this approach.  Love it because it’s so COOL – a magic machine that “reads” the shape of letters from an image and in doing so transforms pixels into movable type . The “hate” part: YOU SHOULD SEE SOME PEOPLE’S HANDWRITING!   The other approach: good old-fashioned human transcription.  So, Humans versus Machines, right?  Just the kind of battle we love!

Which approach works best is probably situation-dependent, but we are going to be a little provocative (as opposed to necessarily right) and argue that in most cases – especially those involving the digitization of handwritten labels or notes – crowdsourced human transcription beats OCR.  So, what’s the problem with OCR?  What we have noticed in many, many talks and presentations is that OCR is rarely as simple to set up, run or do quality control on as it initially seems.  Error rates remain high, requiring time and effort by well-paid professionals to tune the system and do quality control.  The result is that we wind up spending a lot of time working around machines instead of with them.  Crowdsourced human transcription, on the other hand, plays to humans’ and computers’ strengths; computers swiftly and quickly transport and store data, and humans use their discerning eyes to tell the computers what data to store.

More than just being a way to get labels into databases, keeping humans in the digitization loop – in a function other than error checking – has a lot of great side effects that OCR simply can’t match.  Chris Lintott, one of the original Zooniverse (more on Zooniverse in just a second)  PI’s gave a great talk at IDCC 2010 in Chicago last winter.  A quick summary follows, but if you can spare half an hour we’d suggest giving his talk a listen.  Dr. Lintott explains:
1) Human transcription increases the chances of serendipitous discovery – would a machine be able to call attention to Darwin’s marginalia?
2) Crowdsourcing can inspire impromptu collaboration amongst strangers – message boards give volunteers a place to do their own coordinating
3) Crowdsourcing necessarily means that you are staying engaged with a crowd – a group of unpaid strangers that care enough about your science to do it for free.  In early polls of Galaxy Zoo volunteers, over 50% said they contributed because they just liked helping scienctists.

This latter point is huge.   While the phrase “staying engaged with a crowd” sounds a bit frighteningly close to PR talk, in this case it just means you are keeping people excited about science!  It means you are showing that your collections have scientific AND social worth.  And it means that we are sharing our awesome, enviable vocations with folks that have the interest but maybe not the luck to work in museums themselves.

So, for those who didn’t already jump to the (correct) conclusion, our other two words at the end of last post – “Old Weather” – is an excellent example of productively fun human-computer interaction that links experts, volunteers, and data (and is brought to you by Zooniverse and Vizzuality).  “Old Weather’s” mission statement is simple: users log on to, “Help scientists recover worldwide weather observations made by Royal Navy ships around the time of World War I. These transcriptions will contribute to climate model projections and improve a database of weather extremes. Historians will use your work to track past ship movements and the stories of the people on board.”

The team at Vizzuality, a company that really excels at producing web based applications that are deep, beautiful and functional, has made transcription of early 20th century ship’s logs… into a video game.  The interface is brilliantly designed; it is very easy to get started; rewards systems are in place to keep people in engaged. Hundreds of thousands of people are now individually enjoying a game built around digitization.  In less than a year, 555,905 pages of logs have been digitized.    Ship’s log pages are a lot more detailed and heterogenous than specimen labels, so maybe getting tens or hundreds of millions labels captured is a possibility using smartly deployed crowdsourcing?

For many, crowdsourcing is a scary proposition; it requires a fundamental rewiring of how we deal with our data – one that forces us to be more open and inclusive, and to think beyond our physical labs, collection drawers and perhaps even institutional identities.  And furthermore – we fully realize that it’s a solution not without drawbacks.  Two that seem obvious to us include: 1) crowdsourcing requires organization and institutional support, and we are painfully aware than not every museum has these, and 2)  Ownership issues and concerns of control over process and products.   We are sure there are more, so please tell us what flaws you see.

Even given drawbacks, this sort of broad outreach is truly necessary if we want to meet grand challenges like the 100% digitization of natural history collections – and if we want to continue proving our collections’ worth in an age of budget cuts, recession, and the folding of previously untouchable symbols of American research and ingenuity.  Crowdsourcing collections digitization gives us an opportunity to fulfill the fundamental promise of digitization: that it will improve access, use, and integration of biocollections!   Excited?  We are too, especially given  that Zooniverse is actively seeking new denizens – er, projects.

To summarize:  Skip OCR. Bring images to the crowd and make it fun. Better yet, bring images to the untapped resources attached to University museums and collections: swaths of Farmville-addicted undergraduates in lower division biology classes.  Integrate these projects with life science classes AND with Facebook – there is real potential here!

Ok, crowd, tell us what you think!   We’ll buy, well, not a space shuttle, but maybe something natural history-esque for the 50  25th (!!) commenter (comments MUST be relevant to the topic at hand!)  Prizes also awarded for the funniest-relevant and deepest-relevant.

Posted in Uncategorized | 33 Comments

A confluence of drawers

Confluence. n.  a.  the flowing together of two or more streams  b. the place of meeting of two streams  c. the combined stream formed by conjunction [Merriam-Webster online]

Drawer. n.   a sliding box or receptacle opened by pulling out and closed by pushing in [Merriam-Webster online]

Over the past years, at many collections digitization workshops, one’s head (or at least my head) can get turned around about neat idea this, or amazing technology that.  It can get a little theoretical or perhaps speculative-science-fiction-y fast.   But it begs the question: what are people doing in their collections, right now?  What I have learned is that when it comes to pragmatic choices and space/money/efficiency, there is a lot of reason to be excited and to see, yes, confluences.

I hadn’t really realized how much digitzation solutions are beginning to converge until I saw Vince Smith give a (great!) presentation at iEvoBio 2011 on digitizing collections at the Natural History Museum London (NHML).    I don’t want to over-paraphrase his talk, and the slides are excellent (from an earlier version of the talk: http://www.slideshare.net/vsmithuk/scalingup-collections-digitisation), but the gist of it was that at current rates, digitization would take a LONG time: thousands of years.  So the folks at NHML are working with a company called SmartDrive.  SmartDrive builds motorized cameras that move along a track above an object (such as a collections drawer), taking photos.  Vince has been working with them to develop a system to photograph collections drawers at high resolution (more on the company’s hardware, software and approach here:  http://www.smartdrive.co.uk/satscancollections.html) (note: not a pitch, just really interesting and great images of collections drawers!).

The important thing is that with this technology, high resolution, stitched-together images can be generated relatively quickly, scaling down the time it takes to image all the collections drawers from thousands of years to less than ten.  This still leaves “snakes in jars” (see our previous S2I2 post) but we’ll come back to those at some point soon.  What is intriguing is that rather than conflict, we are experiencing _confluence_ in an area where there has been a lot of wailing and gnashing of teeth about how we’ll likely end up with a billion (YES, a BILLION) different solutions.

So what about this “confluence?”   While in Australia hanging out with good friend Paul Flemons (note: currently fu manchu-less), he showed me a similar set up at the Australia National Museum. Again, the idea is to image collections drawers, this time using very high resolution cameras (100 MegaPixels).  Similar approaches using Gigapan (http://www.gigapan.org/) are being pioneered by Andy Deans at North Carolina State University (see their excellent and “insectlent” blog here).  And Paul Tinerella at the University of Minnesota (who is almost 100% likely to be farmer-goatee-ed at this moment) is using a similar solution to first scan many slides of mounted insects en masse, and then automating  the disassembly of these slides into single images of a specimen.  The specifics of how the cameras move over the drawers or a set of slides may be different, but the general idea is the same:

Capture a drawer or slide collection quickly –> disassemble the IMAGE into pieces –> Capture labels –> Move data further downstream → etc.

Confluence.  This is good.

So what does all this mean?  Well, there are still challenges, especially for insects, where the specimen often occludes the label from top view.  But assuming cameras can move all around specimens to generate photos, the answer is that there may be a fast method to capture LOTS of high resolution data in drawers.  Since Andie is spending part of her summer looking at a thousands of little clams stuffed in such drawers, and Rob has even worked on similar clams in the collection he curates, and since there are hundreds of other collections folks doing the same thing, this is a big step forward.

What challenges remain?  Tons.  How are we going to unlock data from a 500MB image of a drawer and use those data most effectively?  Argument:  Data needs to be machine readable and properly documented to maximize its use and re-use.  Period.  Images are not so good for that!  If biocollections data have further utility in new kinds of science, it likely relates to having the what, where, when (taxonomy, location, date) information readily available as simple text that is interoperable with other sources of environmental data. What is _excellent_ (so good I am waving my hands in the air with enthusiasm) is that many people are beginning to talk about similar solutions to this challenge of converting the image of a label to text.  That is the subject of our next blog posting, but to presage it, we’ll just say two words:  crowd sourcing.  And two other words:  Old Weather.  See if you can connect the dots! 

Posted in collections management, digitization definition, SPNHC | Tagged , , , , , | 10 Comments

What does “digitization” mean to you?

In our last post, Rusty Russell at the Smithsonian (and PI on the really neat Field Book Project) left a great question in the comments — what does digitization mean to you?  This is an important question!  What does digitization mean at all??  “Digitization” is one of those unfortunate words that sounds a wee bit dated, as if inspired by technology from Tron*, or taken from this sentence 30 years ago: “better connect that floppy drive and modem to your terminal to help with digitization.”  Furthermore, in terms of etymology, it doesn’t seem to mean what it sounds like: hey, we are not engaged in the process of creating integers or using our fingers.  Nevertheless, it’s the term we use.  So what does it mean?

Fundamentally, to us, “digitization” means getting collections data into digital formats that promote easy discovery, access, and use.  More abstractly (Rob’s in a poetic mood), it means taking the static, hard-edged ink of typewriter ribbon or ball-point pen and converting it into fluid, digital formats – electrons that move easily through air and wires, to be converted back to us as reflected light.   And, because the bar at which we are drinking (lemonade!) is open to library and information scientists and biodiversity types, the focus here is on natural history data sources.

Practically, digitization in our domain can involve a number of activities:

  • Getting catalogs off paper and into databases or spreadsheets.
  • Imaging specimen labels for import into databases (note: interesting two-step here.  Is an image of a label enough? Or is digitization also converting this into machine-readable text?  Is text enough?  Or ought it to be compliant with schemas such as Darwin Core?).
  • Imaging of actual specimens to include with or link to database records (e.g. everything from basic photography to more advanced imaging techniques, like what the folks at Cultural Heritage Imaging do).
  • Scanning field notes, linking field notes back to collections (Andie is particularly interested in this right now).
  • Data clean up or curation – maybe most notably, as is done with georeferencing (Rob’s note: is this really part of digitization?  It isn’t converting analog to digital but enhancing the digital.) (Andie’s rebuttal: it’s definitely part of data curation, and since on-going curation is a necessary part of digitization, it is therefore maybe part of digitization?).

Finally, it comes with a few caveats:

  • Digital natural history collections cannot replace a physical collection, but physical collections need their data to remain meaningful (see last post)
  • Digitization is an on-going process; paper records aren’t going away anytime soon (see… pretty much any kind of field work.  Laptops are lovely but often unwieldy, and often not water/dust/tar/drop proof.  But paper, to an extent, is!).  There will always will be a sometimes frustrating, but totally necessary, interchange between paper and electronic records.  Or are we wrong and paper is already obsolete and we just don’t know it?  You tell us!

So, that’s a very quick version of what Digitization Means to Us.  What does it mean to you?  What have we left out of these lists, and this definition?  What kinds or levels of digitization should be required or expected of a collection? Could we agree that digitizing labels from specimens is a minimum digitization component for natural history collections?**  Comment away….
_________________________________________________________________________

*Andie may have seen Tron: Legacy a little too recently, and may also have liked it a little more than is respectable.  Rob unfortunately kept falling asleep on airplanes when Tron: Legacy was showing, but that doesn’t mean he won’t like it!

**  Rob was taken by a quote from from Ilerbaig, J. (2010). Specimens as Records: Scientific Practice and Recordkeeping in Natural History Research. American Archivist, 73(2), 463-482.  The paper discusses Joseph Grinnell’s very meticulous manner of creating natural history records and posits labels as being of central importance:  “All of these elements formed  an integrated system, ‘a complex information storage and retrieval network’ at the center of which was a labeling tag attached to each specimen, linking it to specific places in the other elements of the system.”  Thanks to Tiffany Chao for the reference.

Posted in collections management, digitization definition | 6 Comments

SPNHC 2011 and necessary heterogenity

Believe it or not, Rob and Andie still Think We Can Digitize, but unfortunately had to take a break from blogging due to Rob’s travels taking him antipodal, Mapping Life in Australia, and Andie’s writing of end of term papers and not sleeping.  But all that is done — only to be replaced with new travels and continued lack of sleep, as Andie gears up for a trip Westward and Rob is somewhere in Kenya.  Last week we did manage to once again co-locate, in San Francisco, for the annual SPNHC conference, where we renewed our vows to blog about digitization like no one has blogged before!  Yeah!   So here goes:

This was Andie’s first SPNHC, and despite her prior experience working with a diverse, data-rich and, well, messy, museum collection and paleontological excavation, she was still impressed with the broad spectrum of Collections Folks represented by the SPNHC community: conservators, paleontologists, biodiversity informaticians, imaging specialists, archivists, biologists, population geneticists and more.  However, we were both left with the feeling that not only is our community diverse, but it has a lot on its collective plate — and dealing with priorities remains challenging and perhaps occasionally frustrating.  How does one digitize a collection while one is simultaneously wondering whether or not there is Secret Deadly Arsenic lurking in taxidermied pelts?  How does one have _time_ to digitize a collection when one must also catalog and re-vial a massive number of marine invertebrates, many of which need to be double-vialed because they’re so delicate that even thin paper tags can damage their wee little arthropody limbs*?  And finally — maybe most frustratingly — how does one adequately staff a museum that can only afford two full-time collections staff — but in actuality, really needs at least four specialists in wildly different domains (e.g. conservation, collections management, database administration, systematics, imaging, etc)?  The care of digital objects requires a very, very different skillset from the care of the physical; few folks are well-versed or even necessarily interested in both.

This heterogeneity of skillsets and even interest: this is SPNHC’s — and natural history collections’ — greatest strength… and weakness.  We are this extraordinarily diverse group of humans, tasks, objects and data; and there is ever-present pressure for us to split off into smaller, more manageable, more homogeneous groups.   But here’s the thing: this heterogeneous mass — that _is_ our community; natural history collections aren’t just objects and they aren’t just data — they’re both.   Worse yet, both objects and data exist in the even broader context of knowledge out there.  So natural history collections staffers are stuck accepting and ultimately embracing dualities or pluralities.  Frustrating!  Requires too much school!  But, alas: must be done, somehow.

This doesn’t necessarily mean that fossil preparators need to learn best practices in database architecture, but it does mean that fossil preparators need to be aware that database architecture is A Thing and is just as important to a collection’s sustainability as the appropriate application of acryloid.  And conversely, it also means that metadata specialists and database designers need to remember that technical specs MUST be easily interpreted and implemented by the non-technical people who have no choice but to struggle to learn to implement them!  And that fossil preparators have the right to ignore their eager but often times a bit overzealous techie friends and enjoy doing their task well.

In summary: SPNHC served as a reminder that data-minded/digitizing folks need to stay ever connected to the physical objects they are digitizing, and physical object folks need to remember that natural history specimens are next to useless without their well-curated, accessible data, or easily queried collections catalogues.   Increasingly, accessibility of data means that it must be digitized and published in ways that those data can be discovered and used most effectively by the most number of people.  We think this means that NH collections folks need to stop thinking of digitization of their collections as a task above and beyond normal everyday curation; but rather, as something that must be done along with writing out catalogue ledgers and maintaining stringent pest control.  Keynote speaker Craig Moritz spoke of a vision for a 21st century Museum.  We think that Museum is going to be an awesome, if perhaps just a bit busily schizophrenic, one.  It’s a place we’d both like to work, because we think we can digitize.

One last note: Andie was particularly heartened by a strong showing of smaller, not necessarily grant funded digitization projects in the Friday AM poster session (though Andie is maybe biased because she spent that Friday AM standing next to her small, not grant funded digitization project).  We think this shows evidence of digitization activities being increasingly incorporated into everyday collections management.  Readers — what did you think?  Were you at SPNHC?  Tell us your stories!

* per LACMNH’s Emma Freeman’s talk

Posted in collections management, data deluge, SPNHC | 7 Comments