Second-Generation Digital Resources in the Humanities

John Unsworth

DRH 2000 – September 10, 2000 – Sheffield, UK

 

Thank you very much for inviting me to provide the keynote address for this year’s Digital Resources in the Humanities conference—it’s a conference I have enjoyed attending in years past, and I believe it is unique in focusing on the intersection of scholars, publishers, and libraries involved in the creation of digital resources in the humanities.  It also provides an excellent venue for the subject I would like to discuss today—second-generation digital resources in the humanities—because at the heart of this subject is the need to orchestrate some experimentation with digital resources that circulate across all three of these communities.  And although my subject-matter originates in one of those three, namely scholars who produce new, born-digital scholarship based on digital primary resources, my point today is that all of us need a better understanding of the features and characteristics that these creations must have in order to satisfy the needs of scholars, be published by publishers and/or collected and maintained by libraries. 

 

First, though, allow me a few pre-emptive rejoinders before I begin: I know that the kind of things I’m calling “second-generation digital resources” have existed for some time, and I also know that some of the characteristics that distinguish them can be found in what I’ll be calling first-generation digital resources, but I would nevertheless maintain that

1)                           Most scholar-produced digital resources now in existence have finessed the second-generation problem by producing their own digital primary resources rather than basing their work on digital primary resources produced by libraries or publishers;

2)                           Although we have Digital Object Identifiers and Dublin Core and some other standards, we have very little, if any, experience with the actual circulation of digital resources through the various stages in the life-cycle of scholarly information—and of that little experience none, as far as I know, has been acquired in coordinated experimentation involving all parties to that cycle; and

3)                           The presence of the scholar as author or editor of a digital resource raises new problems for publishers and libraries, just as the absence of publishers or libraries may raise new problems for the author. 

 

So, what are “second-generation digital resources in the humanities?”  They are, I would say, “originally digital scholarly (or other) creations that call digital primary resources (produced and maintained by others) into play.”  That is to say, they are:

1)      Born digital rather than digitized;

2)      Complex, potentially multi-author, potentially very large collections of multimedia, including structured data, possibly in SGML/XML or databases—not just primary records, but also commentary, annotation, editorial apparatus, and other “secondary” materials;

3)      Produced on the basis of, in response to, and/or using digital primary resources.

For that matter, what are digital “primary” resources?  I would say they are “either digital surrogates for physical artifacts or born-digital ‘evidence’ for a secondary resource.” Obviously, then, once second-generation digital resources exist, they may find themselves playing the role of digital primary resources in someone else’s scholarship. 

 

So, who produces these second-generation resources?  Scholars and artists are the people I have in mind, for the most part.  Publishers and libraries (the major producers of digital primary resources to date) generally don’t produce resources of the second-generation sort—libraries don’t, because they’re not in the business of generating, but rather of collecting, original scholarly output; publishers don’t, because they are in business—which is to say that the only digital resources they are likely to call into play are ones that they already own and control (for example, cross-referencing within one publisher’s ejournals).  Scholars are generally more promiscuous in their range of reference.

 

Who needs to worry about the next generation of digital library activity, in which scholars compose new, originally digital materials that make use of the substantial collections of digital primary resources now becoming available?

1)      Publishers do, because they need to capture this emergent type of scholarly publishing or risk losing their relevance to the scholarly community, and

2)      Libraries need to worry about it too, because they need to collect these new publications, or risk losing a generation of scholarly output, but

3)      Authors need to worry most of all, because if the problems I’m about to enumerate are not effectively addressed, they will have to continue building their own libraries—their own collections of digital primary resources—and acting as their own publishers, as most scholars now producing born-digital scholarly research have generally been doing. 

 

Now that we’re all worried, what are we worrying about?  What new problems do second-generation digital resources raise? 

 

First, by definition these are multilateral publications: they are not conceived, designed, and produced under the sole control of one entity, as most digital primary resources have been, so far.  This characteristic of the second-generation digital resource means that many different perspectives will be in play, most notably the perspectives of authors, whose commitment to the subject-matter may, often will, drive them to demand that it receive special, idiosyncratic, and expensive treatment.  Moreover, if there are several authors involved, it is entirely possible that their perspectives may not agree with one another, and a certain amount of time and effort will be devoted to reconciling these differing editorial or authorial perspectives—one with another, and each with the perspectives and requirements of library and/or publisher.  Furthermore, since many people or institutions will have had a hand in creating these multilateral resources, the question of legal, economic, and intellectual rights and responsibilities with respect to a resource will be quite complex. 

 

Second new problem: these second-generation resources tend to be dynamic, in several important senses:

1)      They often continue to develop over time, perhaps by adding new material to existing structures, in existing forms, but perhaps also by changing those basic structures and adding new forms

2)      They exist in a dynamic medium, which itself changes protocols, standards, data formats, etc. more or less constantly.  Change-proof SGML must now become change-proof XML, and so on.

3)      They are, in a sense, disbound—their parts can move and change separately from one another, so they need to be managed individually, unlike, say, the pages in a book. 

 

What’s the institutional impact of all of this? 

 

For publishers, it means first and foremost that they’ll have to start thinking more like libraries, or working in a new way with libraries, because if authors indulge in promiscuous reference—in which a publication of Johns Hopkins points to (or worse, incorporates) a primary resource published by Oxford—then authors will expect the object of reference to stay put, and readers will expect to be able to follow the link to it or find it as an element of its secondary context.  And it also means, for publishers, that it becomes difficult to find economies based on scale or on re-use, since authors want everything designed from scratch and the medium seems to require everything to be redesigned every three years or so.  Mostly, though, it means that, for publishers, authors are an even bigger nuisance than they used to be—more involved and more demanding, with a role less compartmentalized than it used to be.

 

For Libraries, second-generation digital resources also have a number of institutional implications.  

 

First, as with publishers, libraries find that they are locked into their systems of dissemination by virtue of outside reference to items in them: if you change the shelving location of some books and their call numbers stay the same, patrons (after stumbling around the stacks for a while) will find those books; if you relocate a few gigabytes of digital primary resources and you have not in any way abstracted names from locations, then the scholar whose publication incorporates those resources will find they have dropped out of her publication, and will protest.  In other words, new kind of cataloguing problems arise—and not only the problems of uniform, stable, and abstract resource names, but also problems having to do with the description of new data types, the management of multipart objects, and the potentially difficult and nuanced conditions of use that come with these objects.

 

Second, for libraries, wholly new collections-development problems arise, and new criteria need to be developed for selecting born-digital scholarly publications.  Along these same lines, new criteria need to be developed to select for preservation; obviously, new methods of preservation will be required as well. 

 

Third, libraries will find themselves dealing with the author as patron, or the patron as author.  Especially in cases where the library acts as publisher of a second-generation digital resource, libraries will find themselves grappling with the same demands that authors otherwise refer to publishers.  In other words, perhaps for the first time in Library history, the author is a nuisance. 

 

But I think authors will have to be forgiven for being a nuisance, because theirs is the greatest suffering of all.  First, assuming that the possibilities of the new medium tempt them into engaging it, they find themselves launching ten-year projects in a medium that seems to change every ten minutes; meanwhile, the academic institutions that evaluate their work seem impervious to change.  Next, they find that in order to record what they know in a manner insulated from the rapid change in mechanisms of presentation, they are required to express themselves in a formal language, and with a degree of disambiguation, that is unfamiliar, unwieldy, and often unwelcome (because the death of ambiguity is also the death of nuance, metaphor, and poetry).  On the other hand, the medium seems to elicit scholarly hubris: an infinite amount of material can be stored here, sorted, analyzed and retrieved instantly, provided only that decades of grinding labor have made it all possible, that all the rights have been cleared, and that no fatal mistakes were made, years ago, when the project was designed.  Finally, they suffer unshielded exposure to the mechanisms of production and distribution.  As book-authors, they never needed to know how books were bound or why they were one size and not another: publishers worried about those things.  Now—especially when the author self-publishes or works with a library to produce a resource—the author has to engage directly in the standard practices that make it possible for information to be transported, stored, retrieved, and preserved: this is what it really means for authors to become “content providers.”   And even if the author has a publisher, few things about the relationship with that publisher will be clear in advance, and many of the traditional publisher-functions may become author-functions, as the new relationship takes shape.

 

So, what do we need to do in order to prepare the ground for second-generation digital resources in the humanities? 

 

We need to do some end-to-end projects that involve authors, publishers, and libraries in a coordinated (and documented) joint effort.  Too many of the digital resources that now exist, first- or second-generation, are produced with input from only one point on the life-cycle of scholarly information—produced by a library, or a publisher, or a “free-range” scholar, working without the benefit of other perspectives.  We need more collective experience, based on actual content, and more mutual understanding, based on ongoing conversations, if we’re going to produce resources, and tools, that satisfy the requirements of contributing parties at each point in that cycle. 

 

Failing that, we need to do more bilateral projects that involve, say, libraries and scholars or scholars and publishers.   I deliberately downplay the library/publisher dyad, because there are already some examples of that kind of collaboration, and because the digital resources now produced by publishers and sold to libraries tend to be uniform and not very dynamic—collected bodies of literature (the English Poetry Database) or perhaps digital surrogates for print journals, and these resources are also (often) not managed directly by the library.  Library-publisher collaborations are important, obviously—there are a number of issues that need to be worked out at that intersection—but even though scholarly use is the purpose of library- and publisher-created collections, scholars are generally not involved in the creation of these collections.  Meanwhile, scholar-produced resources that have no affiliation with library or publisher greatly outnumber those that do, and few if any have affiliations with both.

 

What’s at issue here is basic scholarly infrastructure—the real “information superhighway” system that needs to be built in order for research and teaching in the humanities to survive the advent of the Web.  We haven’t built it yet, and it needs to be built.  We haven’t really even designed it: yes, we’ve designed some of the standards we’ll need to build it, and we’ve developed some of the methods we’ll need to operate it, but we are very far from being able to send something from one end of it to the other. 

 

Up to this point, I’ve posed the problems in abstract and general terms, in order to make it clear that these are global rather than local issues.  Now I’d like to make it clear that this problem of second-generation digital resources is not just an abstraction, but really does exist, by telling you a true story—the picaresque tale of how I and some others knocked around in the social landscape of information until we came to recognize the absence of infrastructure as a problem, and then how we tried, and failed, and tried, and succeeded, in our attempts to find a patron in the quest for more chivalry and better bridges.  Sort of a Don Quixote with grant proposals.   The comparison with Quixote is apt because, even though the story has a happy ending, its topic emerges from a string of idealistic failures—failure to negotiate publishing contracts, failed grant proposals, and failed attempts to bring representatives of the all the players in the creation of digital resources in the humanities to one table, or to one project. 

 

So, our story begins with a visit to the Washington, DC offices of the publisher Chadwyck Healy, to discuss the possibility of their publishing an edited collection of the writings of Emily Dickinson and her sister Susan, a project then (and still) underway at IATH, directed by Martha Nell Smith, of the University of Maryland.  That deal eventually fell through, though not before months of negotiations had taken place with a rapidly changing series of acquisitions editors.  The next set of negotiations also fell through, with a different putative publisher where the editorial staff was stable, but the publisher as a whole was being “vertically integrated,” a corporate euphemism for “assimilated by the Borg”.  But at any rate, on the way home from that DC trip, Daniel Pitti and I discussed the problem of single-perspective resource creation, and the lack of any life-cycle approach to the creation, publication, collection, maintenance, and use of digital resources.  Even with the best of intentions, we didn’t think we’d be able to predict all the processes and purposes that IATH resources would need to answer to, once they left the author/incubator stage and progressed into publishing and/or libraries.  Would the Rossetti Archive, with all of its thousands of images, have the kind of rights-management information that a publisher might want?  Would the Whitman project, which draws on materials held in UVa Special Collections and digitized and transcribed in the Etext Center, be able to incorporate those materials rather than replicating them?  Could it feed back into the library’s descriptions of, or information about, those objects?  Would any of our projects have a hook on which a library could hang consistent versioning or archival refresh-date information? 

 

That conversation eventually led to an NSF proposal, in the second round of the Digital Libraries Initiative (DLI2).  This proposal took a number of people a good deal of time to prepare, and so it was more than a little disappointing when it failed, in spite of having what must have been the longest coherent acronym-producing title in that competition (for those of you who collect TLAs—three-letter acronyms—that would be an APT—acronym-producing title).   We called it SPECTRA (“Supporting Persistent Electronic Communities’ Thematic Research Archives”).  SPECTRA sounded really good: not only was it a seven-letter APT, it also conveyed the idea of a spectrum of participants, and it had a kind of clandestine undertone—SPECTRA sounds like one of those evil spy organizations from a sixties TV show, doesn’t it?  This seemed certain to appeal to Big Science in the post-cold-war era.  Indeed, it was a proposal with something for everyone: it was to involve

eminent humanities scholars, computer scientists, research incubators, libraries, and publishers.  Its goals [were] to investigate, demonstrate, and document the standards and practices that will be necessary if next-generation digital libraries are going to:

·        support the intellectual communities that have traditionally formed around thematically coherent bodies of scholarship, and

·        seize new opportunities for expanding the horizons of humanities research, creating distributed collections, managing digital rights and permissions, and preserving digital archives.

SPECTRA claimed it was appropriate to a second round of Digital Libraries investigation because

·        it [would focus] on the difficult work of integration and coordination across institutional and practical boundaries, without which digital libraries will be less, not more, useful than their paper predecessors, and

·        it [would bring] together leading representatives of the different parties to scholarly production, dissemination, and preservation, and [leverage] existing relationships among those partners to ensure a highly motivated cooperative effort.

 

Incidentally, part of this proposal touched on a subject I have said little about, namely the scarcity of tools for analysis of digital resources in the humanities, and the scarcity of resources available in forms that suit those tools that do exist.  This problem—I still maintain—is part of the larger one I’ve spent most of my time on today, but until we have worked out the interoperability of digital resources across collections, or adopted the mechanisms and methods for stable reference to digital objects, or figured out how authors can say what they need to say about digital resources without driving publishers to the poorhouse and librarians to the loony bin, it’s going to be difficult to demonstrate the grand possibilities for analytical tools adapted to large, distributed, authored, digital resources. 

 

So I’ll skip the NSF-JISC proposal, also failed, or the NEH Research and Demonstration proposal, also failed, both of which broke that aspect of SPECTRA out into a separate proposal, with new partners putting in new hours on new drafts with new APTs.  Instead, I’ll just say that even as we composed the initial, $4 million NSF proposal, with its imperial charts and tables, its Byzantine budget, and its stirring statement on Human Subjects, we did consider the possibility that the odds of any proposal being funded in this competition were quite long, and we considered other possible sources of funding for this idea, to which we had by now become somewhat attached.   The first such source that came to mind was the Andrew W. Mellon Foundation, because this foundation had been so active in helping libraries to produce first-generation digital resources in the humanities, and because it had shown itself to be interested in the economics of digital-resource creation, something that publishers also have to think about.  And it was the Mellon Foundation that eventually funded a part of the research we’d originally proposed to NSF—the part involving scholars, research incubators (IATH), and the library.   In fact, before we got around to contacting them, William Bowen contacted us, and asked to come see the Institute and find out about what we were doing.  In retrospect, this seems like a scene of divine intervention, with Walter Matthau playing the part of God.  In any case, out of that intervention, and with the help of Don Waters, we evolved a proposal to the Foundation that concentrated on only part of the problem we had outlined as SPECTRA, reducing SPECTRA to more manageable dimensions, not only by bracketing some of the publisher problems, but also by framing it in a local context, with scholars, incubator, and library all in one place, and all already working with one another.   We don’t have any illusions that this local and limited approach to the problem is going to produce solutions that will instantly snap into place in other local contexts, but it is a place to start.

 

And we have started, with three projects selected from among the many “free-range” scholarly publications, or second-generation digital resources, at the Institute, to be analyzed and ingested into a new digital library system in partnership with the Digital Library Research and Development Group, headed by Thornton Staples, formerly of IATH.  Both IATH and the Library have several staff working on the project in a technical group, and we have organized other, larger groups—one to produce policy and best-practices guidelines based on our experience, and another to review both the technical implementation and the policy recommendations on behalf of the faculty and administration. 

 

The digital library architecture that we’re working with is called FEDORA (a six-letter acronym: perhaps success lies in multiples of three?), which stands for “Flexible Extensible Digital Object Repository” and comes originally from research done by Carl Lagoze and others at Cornell.  If any of you saw Thorny Staples’ presentation at the ALLC/ACH conference in Glasgow, you will have seen some of the particulars of that architecture: if you didn’t, you can have a look at his article, co-authored with Ross Wayland, in the July/August 2000 issue of DLIB magazine.  We think FEDORA holds great promise.

 

The projects we’ve chosen as our test subjects in this first phase of the research are Marion Roberts’ Salisbury Cathedral project, Jerome McGann’s Rossetti Archive, and the Pompeii Forum project, directed by John Dobbins with assistance from Kirk Martini and others.   These three were chosen because each presents a different aspect, either in kind or in scale, of the initial problem of collecting the born-digital scholar-produced resource. 

 

Salisbury is essentially a collection of digital images, with some metadata and some related textual resources.  It is marked up in the Encoded Archival Description DTD, and perhaps its most complex feature is that its metadata places the individual photographs in spatial context, in effect mapping them onto a footprint of the building.  Well, there is another complexity, actually: with the exception of some HTML for the related textual materials, the whole thing is currently delivered through Dynaweb, an SGML-aware web server that compiles its text into a binary for indexing and uses proprietary style-sheets, since it significantly predates XSL.  So, though Dynaweb is standards-based in the sense that it is designed to work with SGML data, it is just the sort of monolithic, impenetrable, and legally problematic kind of wrapper that will make it difficult to integrate publications like Marion’s, at the item or the collection level, into library catalogues and indexes.  On the other hand, Marion’s project has the virtue of being relatively small and relatively uniform—which is why it seemed a good place to start.  We’re experimenting with several different “collections-development” strategies with Salisbury, then, in order to get a sense of how they compare to one another.  First, we’ll take a very low-end approach (let’s call it Whacking Salisbury), and see what we can extract from the Dynaweb server over the Web, as though we had no direct access to the data; as we extract pieces, we’ll try to automate the capture of their relationships to one another, and express those relationships in a way that allows the library’s systems to reproduce the structure on demand.  In other words, in this approach, we are rebinding the Salisbury project’s HTML output in a uniform library binding.  We’ll also experiment with a higher-end approach, one that assumes we have direct access to all the underlying.  In this approach, we’ll probably ingest the EAD data, perhaps converting it in the process to a generic DTD designed by the library, and we’ll attempt an automated, partly re-usable conversion of HTML to TEI-Lite and of the Dynaweb stylesheets to XSL stylesheets.  These two passes should give us an idea of what can and can’t be automated, what problems attend the unattended collection of digital resources, and whether we can wrap arbitrary materials in library-developed bindings.

 

The Rossetti Archive will benefit from some of this work, because it is also currently delivered through Dynaweb (in fact, if you’re familiar with the project but haven’t looked at it lately, you should: there’s recently been an extensive addition to the materials available online, following another important failure, this one with a publisher—but that’s another story).   The problem of the Rossetti Archive is a problem of scale: there are thousands of SGML files and thousands of images.  We recently got a new Sun server at the Institute, but on the old one—a SparcServer 1000—it routinely took two or three days to parse the Rossetti Archive.  On the other hand, the Rossetti materials are extremely uniform, albeit idiosyncratic as to DTD.  Here, the objects of the experiment are two: to see what problems large-scale projects raise for library systems, and to see what problems are raised for those systems by actively developing projects.  With respect to this second question, the Rossetti Archive will be migrating its materials into Astoria, an SGML-based document management system, and we will develop mechanisms for exporting “editions” of the archive out of that system and directly into library systems, in order to understand better the dynamic-object problem.  I also expect that, with respect to our first-year focus on cataloguing issues, the scale of the Rossetti Archive will turn up some interesting patterns, exceptions, and caveats.

 

The Pompeii Forum project has many interesting features: it includes a number of data types that go beyond text and image, such as QTVR, CAD models, audio, and maps.  Many of these are data types that are new to the library, at least from a cataloguing point of view.  Also, of the three projects named here, it has the least explicit structure: it is essentially, and pretty thoroughly, a Web-based project in which the relationships among its parts are implied by links rather than specified in any abstract way.  It is also a project that, like Marion’s, focuses on a particular location and structure, and like Salisbury it has (in addition to its more exotic elements) many photographs of that location and structure.  Some of those photographs are even catalogued in a database, which provides another opportunity to test the structural inference (or more mundanely, web-whacking with metadata) utility that we’re developing and testing with Marion’s project. 

 

We have been careful, all along, to say that the results we expect from these experiments are not shrink-wrapped solutions to the problems of second-generation digital resources, but a better understanding of the problems they present.  Subsequent stages of the project are likely to look at the ways in which library systems can feed elements into digital scholarly publications, how scholars and librarians can collaborate in the description, production, and maintenance of digital resources, and how library systems can support some of the more complex “behaviors” these second-generation resources may include (for example, customized analytical tools delivered along with the publication).   As we plod through these real-life problems, we’ll be consulting with some of you, trying to find out what you have learned from your experiences, trying to produce a set of policies that libraries can realistically use with faculty or other “content-providers” in addressing the real costs, limits, and conditions of collecting and preserving digital resources, and trying to produce a set of examples that will help all of us understand, in absolutely concrete terms, the breadth and significance of the infrastructure issues we all face.

 

Now the exhortio: it’s the beginning of a new millennium, and it’s also the beginning of a new phase in the world of digital resources in the humanities.  I don’t have any doubt that the problems I have identified here will be—in fact, already are—something we all need to worry about.  We have opportunities to collaborate, and we can find funding to support such collaborations.  In spite of our early defeats with the NSF and others, I am convinced that these organizations can be made to see that creating the infrastructure to support the entire life-cycle of digital scholarly information is something they must help us do now—and further, that this infrastructure must be designed with all parties at the drafting table, not least because designing it is as much a social as it is a technical problem.

 

At Virginia, we have started on a small corner of the problem, having been reduced after a number of attempts to working within the boundaries of one institution and with only part of the spectrum of interests involved.   As I hope I’ve made clear, though, there is a great deal more work to be done at this three-way intersection of interests—the very intersection this conference represents—and it is neither too late nor too soon to agree that we should do it. 

 

And finally, some unreconstructed Quixotism.  In research areas that require funding, we tend to let funding bodies set the terms of our research agendas—but what if we set our own agenda?  What if we made a coordinated petition to these and other organizations to include this research agenda as one of their own?  What if we committed our own resources to it as well, by deciding to attend to these questions of infrastructure for second-generation digital resources up front, even in projects not primarily focused on those questions, with each resource-producing party making some attempt to consult, however informally, with representatives of the other categories?   And what if we established mechanisms, in each of these three communities, for determining whether or not a given digital resource meets the needs of that community, perhaps in the form of guidelines that would help to produce digital resources that were scholar-friendly, library-friendly, publisher-friendly? 

 

After I leave Sheffield, I’m going to New York City for an annual meeting of the Modern Language Association’s Committee on Scholarly Editions, a group that has, for some decades, reviewed and consulted on the methodology, scholarship, and apparatus of scholarly editions in print.  Over the past several years, that committee has been drafting guidelines for electronic scholarly editions.  This past November, it presented a draft of those guidelines for review by editors of existing electronic scholarly editions and an editor of the standard—TEI—most often used in those editions.  As a result, at our meeting later this week, we’ll be discussing a radical restructuring of the Committee’s guidelines, intended to take into account some broad commonalities and some significant differences in editorial traditions across disciplines, as well as the enduring goals and rapidly changing particulars of editing in an electronic environment.  I think it would be a great boon to this committee if it heard from libraries and publishers as well as from editors, and if the perspectives, best practices, and basic requirements of those communities were incorporated into this committee’s reviewing and consulting process.

 

Surely, the same is true at other points in the spectrum: there are similar bodies, with similar reviewing and consulting functions, among publishers and libraries, and they must also be grappling with the need to establish standards, guidelines, and best practices for themselves: wouldn’t those efforts be more worthwhile if they were pursued in conjunction with one another?   Guidelines established in the absence of real cases are liable to be unrealistic, so perhaps we could select some of our best examples, from each community, and consider how they meet, or fail to meet, the needs of other communities, as well as considering how the guidelines in those communities apply, or fail to apply, to these examples?

 

I’ll leave you, then, with those suggestions: that we should agree to petition for funding to support research in this area, and that we should voluntarily seek one another out in the absence of funding.  We’ve been a generation of do-it-yourselfers; in the second generation, I think we will have to give up the notion that every library can be a publisher, every publisher a library, and every scholar both.