The Value of Digitization for Libraries and Humanities Scholarship

John Unsworth

Dean, Graduate School of Library and Information Science
University of Illinois, Urbana-Champaign

The British Library, May 13, 2008


from the Blake Archive

Digitization and Libraries

"Digitization" implies the production of a digital surrogate for a physical object. Obviously, we don't speak of "digitizing" something that's already digital. And in the context of our discussion today, it is the digitized, not the born-digital, artifact that is most important, because the most common kind of digital artifact in library collections today is a digital surrogate for a physical artifact. For that reason, too, the most important questions about the value of digital artifacts, at the moment, are questions having to do with the artifact as surrogate. Chief among those questions are:

Many of these questions are treated in some detail in Paul Conway's contribution to The Handbook for Digital Projects: A Management Tool for Preservation and Access, where he writes:

"The Preservation Purposes of the Digital Product [include efforts to]. . . . Protect Originals. . . . Represent Originals. . . . [and] Transcend Originals. . . . In a very small but increasing number of applications, digital imaging holds the promise of generating a product that can be used for purposes that are impossible to achieve with the original sources. This category includes imaging that uses special lighting to draw out details obscured by age, use, and environmental damage; imaging that makes use of specialized photographic intermediates; or imaging of such high resolution that the study of artifactual characteristics is possible."

While Conway makes clear the promise of the digital surrogate, the risk posed by these surrogates is presented by Angelika Menne-Haritz and Nils Brübach, in "The Intrinsic Value of Archive and Library Material":

"the loss of testimony is endangered, not only through . . . physical degeneration . . . but also through the unconscious destruction of evidence as to the context and circumstances of their origin, which can occur during their conversion and must therefore be prevented by a previous analysis of . . . intrinsic value."

The problem to which Menne-Haritz and Brübach refer is not unique to digital surrogates, by any means: bad editions in printed form pose the same threat, and indeed the early history of printing is, in part, a history of the loss or destruction of manuscript materials "replaced" by printed versions—the sources for which are now both undocumented and unrecoverable. In any case, these German archivists present the most reductive view of the value of digital surrogates, saying

"The loss of evidential value and permanent accessibility inherent in digital forms and textual conversion [by OCR] exclude them as a preservation medium. They can only be employed in addition to preservation on film in order to increase the ease of use," ("The necessity of criteria for conversion procedures" in "The Intrinsic Value of Archive and Library Material")
and, at another point, flatly stating that:
"digital imaging is not suitable for permanent storage." ("Imaging" in "The Intrinsic Value of Archive and Library Material.")

A preservation program based entirely on film, with digital surrogates used only for distribution of photographic images, may not be practical in all cases, though—and it is at this point that we must confront the differing missions of libraries and archives. Archives may well decide that issues of evidential value rule out "digital forms and textual conversion," whereas libraries might reasonably feel, in certain cases, that their mission of preserving and providing access to (fungible) information is adequately served by providing digital surrogates.

In fact, it is probably impossible to give a single answer to the question "What is the value of a digital surrogate?" since the answer depends, to a large extent, on the nature of the original and the conditions of its use. Therefore, as a means of determining the value and appropriate use of digital surrogates for library holdings, it may be useful to divide the original materials into those that are rare and those that are not, and to divide them further into those that are frequently used and those that are infrequently used. There would be, then, four possible cases:

1. Materials that are not rare and that are frequently used:

In this case, we can assume that preservation of the original is not a particularly high priority (since the original is not rare); nevertheless, digital surrogates for such an object might be worth producing and providing, for several reasons:

The first two are obvious and uncontroversial benefits. The third is potentially problematic, even if the object in question is not rare, because it is not obvious that digital surrogates provide all the functionality, all the information, or all the aesthetic value of originals. Therefore, while it may be sensible to recommend that digital surrogates be used to reduce the cost and increase the availability of library holdings that circulate frequently, the decision to deaccession a physical object in library collections and replace it with a digital surrogate should be based on a careful assessment of the way in which the original object (or objects of its kind) are used by library patrons. It is not necessary that the digital surrogate possess all the qualities and perform all the functions of the (not rare) original, but it is necessary that the digital surrogate answer to the identifiable needs and expectations of those who frequently used the original.

2. Materials that are not rare and that are infrequently used:

My guess is that this is the category into which the NCSE serials fit. For example, The English Woman's Journal, though not available in the library at my home university, is available in 57 libraries worldwide (according to WorldCat) and in two libraries in the state of Illinois. Many libraries now store infrequently used books (and other materials) in long-term storage facilities. Those materials are retrievable and available to library patrons, including by interlibrary loan, but only after a wait of two or three days. With such materials, digital surrogates might:

Again, the first two are clear and uncontroversial benefits, and the third comes with the caveat, as in 1., that the digital surrogate should answer to the identifiable needs and expectations of those who (in)frequently used the original. At some point, of course, especially with infrequently used materials that are not rare, libraries might reasonably be expected to evolve a calculus that balances functionality with actual use, in order to help decide when digital surrogates that provide most of the functionality of originals are acceptable.

There is one other point that needs to be raised, especially here, where we are discussing the component of library collections that has the least "market value." Libraries, as an institutional and cultural community, need to consider whether these infrequently used and commonly held materials are, in fact, being preserved in a concerted and deliberate way in their original form by any one (or more than one) library. If they are not, the sources for digital surrogates that are common today could easily become rare, or non-existent, tomorrow. This is the substance of Nicholson Baker's objection to libraries discarding their newspaper holdings. If there are fifty libraries that hold the same issues of the same newspapers in original form, at great expense and with limited use, then it is difficult to make the case that all of them should pay to house, shelve, reshelve, and preserve the originals, but if forty-nine of those libraries, over time, have replaced their physical holdings with digital surrogates, one certainly hopes that the fiftieth library would be aware that its physical holdings were now rare, and therefore subject to considerations outlined in cases 3 and 4, below.

3. Materials that are rare and are frequently used:

In this case, the principal (and very obvious) benefits of digital surrogates are:

Few would argue that truly rare materials should be replaced by digital surrogates: digital technology, and techniques of digitization, are so new, and are still developing so rapidly, that we can't have any confidence we've devised the best method for extracting and digitally representing information from any analog source (whether it is a printed page, an audio tape, or a film strip). Nonetheless, digital surrogates could, in many cases, stand in for rare and frequently used materials, and could thereby aid in the preservation of originals.

4. Materials that are rare and are infrequently used:

On the face of it, these materials seem the least likely to be represented with digital surrogates, if only because digitizing is expensive. On the other hand, if the cost of housing a rare but infrequently used object rises high enough, then digitizing and deaccessioning that object may become an attractive possibility. Here again, as in 2, above, one hopes that libraries, as a community, are aware of the lastness, the actual or potential rarity, of even those materials used infrequently today. Tomorrow, those may very well be the most valuable of artifacts, perhaps for users, or uses, that one could not predict today.

Having considered these four alternate conditions, let us revisit the questions with which we opened this discussion of digital surrogates, and try now to provide some answers to those questions:

When can a digital surrogate stand in for its source?

When it answers to the needs of users.

When can a digital surrogate replace its source?

If the source is not rare.

When might a digital surrogate be superior to its source?

In cases where remote or simultaneous access to the object is required, or when software provides tools that allow something more or different than physical examination. When the record of the digital surrogate finds its way into indexes and search engines that would never find the physical original.

What is the cost of producing and maintaining digital surrogates?

The cost of producing digital surrogates depends, among other things, on the uniformity, disposability, and legibility of the original. The cost of maintenance depends on frequency of use and the idiosyncracy of format, but beyond that it depends on technological, social, and institutional factors that are difficult or impossible to predict—which is an important reason for being cautious when one chooses to replace a physical object (the maintenance costs for which are known) with a digital surrogate (the maintenance costs for which are, to some extent, unknown).

What risks do digital surrogates pose?

The principal risk posed by digital surrogates is the risk of disposing of an imperfectly represented original because one believes the digital surrogate to be a perfect substitute for it. Digital surrogates also pose the risk of providing a partial view (of an object) that seems to be complete, and the risk of decontextualization—the possibility that the digital surrogate will become detached from some context that is important to understanding what it is, and will be received and understood in the absence of that context.

Digitization and Humanities Scholarship

Up to this point, we've been looking at the cost-benefit analysis of digitization from a library perspective. I would like to consider it also from the perspective of the humanities scholar. Certainly, libraries collect and preserve (and winnow, and deaccession) materials in order to serve some other purpose: in the case of academic libraries, scholarship is one such purpose, and nowhere in scholarship are libraries more important than in the humanities, where for centuries the library has been the laboratory, but the costs and the benefits of what goes on that laboratory may look different to the person who runs the lab than they do to the person who uses it.

The most obvious benefit of digitization, for the humanities, is access to primary source materials. The aggregation of these resources, in digital form, constitutes a new kind of resource for humanities scholarship and teaching. For example, a web site that offers you six of Blake's differently illuminated printings of "The Marriage of Heaven and Hell," the originals of which reside in four different institutions, provides an opportunity for comparative scholarship and teaching that isn't available from the synthetic Erdman edition of Blake, or even from the Blake Trust/Princeton University Press Illuminated Books of Williams Blake (volume 3: The Early Illuminated Books).

This sort of benefit from improved access to (digitized) primary resources is the one that was first obvious, and that is now most widely understood among scholars. This benefit will obviously accrue to users of the Nineteenth-Century Serials Edition, and so I'd like to talk for a couple of minutes about the new opportunities for traditional scholarship that are generally created by the conversion of primary resources to digital form.

At present, any scholarship involves the use of some digital tools—for example, a library catalogue, Google, etc.. Furthermore, certain resources you might find through such tools, in the course of your research, might themselves be available as full-text electronic resources. There are now many primary resources in digital form on the web—some commercial and licensed, some non-commercial and free. In fact, although individual objects of our attention might be categorized as digital or analog, scholarship itself is now a continuum, in which all activity falls somewhere between those two points, and almost nothing is completely non-digital, or non-analog.

So, what new opportunities for scholarship are presented by the existence of digital primary resources? Our habits of research in the humanities, and particularly in literary study, can be affected—sometimes renovated, sometimes mooted—by several kinds of novelty:

Digital primary resources are already quite interesting in the first way—the digitization of cultural heritage materials has made much more available many rare materials, and many underutilized materials as well. There are many in humanities departments who have embarked on digital research projects in the past ten years and have faced the problem of having to create their own digital primary resources first, in order to do enable scholarship, but that situation is really changing now—there are now some very substantial collections of primary materials that were, in their predigital form, difficult to find, difficult to get to, or difficult to use. All of this opens new possibilities for archival research projects, especially for graduate students, who generally lack travel budgets.

New perspectives on familiar materials are also available, as a result of the creation of digital primary resources. As an example here, I return to The William Blake Archive, which presents full-color images, newly transcribed texts, and editorial description and commentary, on all of Blake's illuminated books, with non-illuminated materials (manuscript materials, individual plates and paintings, commercial engravings, etc.) now coming on line. The Blake Archive makes it practical to teach Blake as a visual artist, by the simple fact of the economics of image reproduction on the web, and this is a fundamental change from the way I was taught Blake, through Erdman's text-only synthetic edition (which is also, by the way, available on the site).

There's another, deeper sense in which digitization makes available new perspectives on familiar materials, and this is, to begin with at least, a deeper impact on a smaller number of people, because it has to do with scholarly involvement in the process of digitization itself. Here as well the NCSE project is an excellent example. With respect to the humanities, objects of study can be images, texts, sounds, maps, performances, concepts, three-dimensional objects. When we make a digital surrogate for any one of these, we always believe that our aim is to represent it as accurately, as faithfully as possible, with the least possible interference, or noise, in the process—but when, as scholars, we deal with these digital surrogates, or produce our own, we learn that there's no such thing as an innocent act of representation: every representation is an interpretation.

Simple questions, like "is this poem a separate work, or is it part of a larger set of poems?" can be unavoidable—in markup, for example—and they can also raise issues that are critical to understanding the work in question. However we decide the question, we are both informed and constrained by our own decisions, when subsequent and related issues arise. Likewise, with images, when we digitize, we choose file-type, compression, color-correction, and other settings based on what we consider valuable and significant in the image—and when our chosen strategy is applied across a large body of images, or when others come to our digital surrogate with purposes we hadn't shared or predicted, we are bound to confront the fact that our surrogate has been shaped by the perspective from which it was produced. In this sense, the real value of digitization for humanities scholarship is that it externalizes what we think we know about the materials we work with, and in so doing, it shows us where we have overlooked, or misunderstood, or misrepresented significant features of those materials.

No better example of this could be found, I think, than the documentation on the "work in progress" page of the NCSE project, which lays out the "working periodical template," "periodical snapshots," "datamap," "concept map," "segmentation policies," and "metadata schema" that have evolved during the course of the project team's effort to analyze and represent its six 19th-century serials. I'll quote just briefly from the description of the datamap here, in order to explain what I mean by this deeper impact of digitization. The datamap is a map of "data fields" in which the content of the NCSE primary materials will be represented, and it maps the relationships between those fields. Once an initial sketch of the map was prepared, it was tested against the primary sources in "a page turning exercise in which the team assimilated new data fields occurring in the source materials into the map and also reconfigured the map as appropriate." The team that went through this exercise noted that "this work required interpretation at every stage, our abstract conceptualisation of the source materials becoming increasingly concretely represented in the map as it was developed." Even so, the data don't always obey the map:

The creation of the map has flagged up some potential challenges in the way in which our data might be rendered. As is evident from the map there are instances where relationships between its fields skip levels. (e.g. department items) and some items 'float' and can exist at almost any level (e.g. price). The dilemma facing ncse is thus whether to enforce an artificial framework upon the sources (top-down) or to attempt to adapt the framework to the sources (bottom-up).

For me, this is very reminiscent of the exercise of developing the original SGML Document Type Definition for the Rossetti Archive, in doing which we went through an iterative process of modeling the components of Rossetti's paintings and poetry, an exercise that forced an explicit discussion of the nature of these materials, the relations between their parts, and the rules that could be deduced to govern the markup that would represent these. I guarantee that, in both of these cases, unless digitization of the materials had been involved, and unless the scholar-expert had been party to that digitization, these discussions would never have taken place, and this explicit specification of the scholar's understanding of the materials would never have emerged.

If the first value of digitization for humanities scholarship is that it makes rare materials more available and more useful to many people who use the digitized representation, a second and deeper value of digitization for humanities scholarship is that it externalizes interpretation, re-presents it to us in the form of the surrogate, and forces us, as humanities scholars, to confront and evaluate our beliefs and understanding concerning the object of digitization, as well as our perspectives and purposes with respect to it. Of course, it can only have this effect if the scholar is actually involved in the process of digitization, at some level: otherwise, what would be self-criticism and self-understanding becomes simply the criticism of the shortcomings of a non-specialist. On the other hand, as NCSE also demonstrates, a successful effort of this sort requires more than just subject-area specialists: it requires the collaboration of librarians, programmers, markup experts and others. To take just one of the possible conjunctions here, scholars can learn a great deal from the expertise of librarians in cataloging and classification, in information organization, in preservation and access. By the same token, librarians can learn a great deal about the peculiar and idiosyncratic characteristics of individual works, or authors, or movements, or literatures, by working with specialists who know—or think they know—all the features and fine points of that material. Similar complementarities are at work in successful collaborations with programmers, markup experts, publishers, etc..

Next up, and less obvious to scholars at this point, is the opportunity to apply new and computational methods to large digital collections like those being produced by Google Books and the Open Content Alliance. These collections, which aggregate materials on a scale never previously encountered, call out for new methods of discovery and analysis. The methods in question are new to literary studies, to be certain, but not so new to computational linguistics, nor to people who do automated learning in other contexts. Text mining is not really our topic today, but I will just close by making a couple of observations that might be considered as advice for the future development of NCSE.

In my experience, over the last three years of aggregating library- and scholar-produced collections of digitial primary resources for the purpose of text-mining, texts that are prepared with the notion that they will always be used in the same way, for browsing and searching, in the same environment for which they were originally prepared, have a tendency to leave certain kinds of information implicit--it's implicit elsewhere in the system, and not explicit anywhere in the text itself. Once you start to aggregate these resources and combine them in a new context and for a new purpose, you find out, in practical terms, what it means to say that that their creators really only envisioned them being processed in their original context--for example, the texts don't carry within themselves a public URL, or any form of public identifier that would allow me to return a user to the public version of that text. They often don't have a proper Doctype declaration that would identify the DTD or schema according to which they are marked up, and if they do, it usually doesn't point to a publicly accessible version of that DTD or schema. Things like entity references may be unresolvable, given only the text and not the system in which it is usually processed. The list goes on: in short, it's as though the data has suddenly found itself in Paddington Station in its pajamas: it is not properly dressed for its new environment. It's not enough, either, to say that TEI provides interoperability, or even to say that your text-preparation practices are fully documented, though both of those things help.

In our text-mining projects, NORA, WordHoard, and MONK, the team I work with has spent considerable time developing automated strategies for converting source materials from various collection-specific formats into a common analytic format we're calling TEI-A (A for analytic). At first, frankly, I hoped that this wouldn't be necessary, but it is--text-mining requires a very granular level of access to the text (words, or even lemma, with normalized spelling), and it requires a consistent method for chunking text into statistically comparable fragments (like chapters or paragraphs). If you aren't able to support either of these requirements, then your results will be noisy, unreliable, and in the end not very useful, no matter how sophisticated your tools. However, if you can support such requirements, there are all kinds of interesting patterns to explore in large collections--from changes in vocabulary over time, to grammatical patterns by gender, to the rise and fall of various trendsin literature of different sorts, to the mapping of relationships among characters, or among authors, or among concepts.

NCSE is an excellent example of a humanities digitization project that has made a very useful and interesting set of materials available, in a very high-quality digitized form, for research and teaching, and inasmuch as scholars and students can satisfy their needs by searching and browsing these six serials, all's well. But inevitably, both researchers and students will come up with reasons to want to aggregate the resources you are providing with other materials, at some point—perhaps not for text-mining, but for some other purpose (a study of writing for and about women, in serial form, in the 1850s, for example). When that request for aggregation comes, it may be simply to search and browse this collection along with others, or it may be a request for the underlying xml, but in either case, decisions that have been carefully considered and arrived at, with respect to the chunking, naming, rendering, and conceptualization of this collection will be raised again, by the broader context. And that's OK, I think, although right now, the prospect of having to reopen any of these questions is probably a horrifying thought to the NCSE team. For scholarly communities, the benefit of doing this will be the same as the individual scholar derives from engaging with digitization in the first place: it will make explicit the extent and the limits of shared understanding, shared ontology, and shared purpose within that community, with respect to the objects of its attention.

In closing, I want to return to the library perspective for a moment. The aggregation that I'm talking about here has the potential to provide a longer-term preservation benefit, as well the immediate occasion for self-understanding in scholarly communities: if library collections are taken out of their native technological and intellectual context in order to be used in new ways, especially in ways that go beyond what was envisioned by their creators, then their weaknesses, idiosyncracies, oversights, and lacks will be exposed in a way they never would be in their home environment. This is key, actually, to the long-term survivability and usefulness of these collections.


References

Some of the foregoing was originally drafted by the author for The Evidence in Hand: Report of the Task Force on the Artifact in Library Collections, published in November, 2001, by the Council on Library and Information Resources: http://www.clir.org/pubs/reports/pub103/pub103.pdf

Menne-Haritz, Angelika and Nils Brübach. "The Intrinsic Value of Archive and Library Material." Digitale Texte der Archivschule Marburg Nr. 5: http://www.uni-marburg.de/archivschule/intrinsengl.html

Sitts, Maxine K. Ed.. Handbook for Digital Projects: A Management Tool for Preservation and Access. First Edition. Northeast Document Conservation Center, Andover, Massachusetts, 2000: http://www.nedcc.org/digital/dighome.htm


Author's home page