"Cyberinfrastructure and Open Standards, Methods, and Communities"
John Merritt Unsworth
Annual Convention of the Modern Language Association
Chicago, IL, December 27, 2007
First, a brief definition, for those of you who haven't been reading white papers on cyberinfrastructure for the last three or four years. "Cyberinfrastructure" is a combination of technical and social systems that support research in a particular domain; as was noted in the Introduction to the American Council of Learned Societies' report on cyberinfrastructure for the humanities and social sciences, cyberinfrastructure is a digital extension of "the infrastructure of scholarship" that "was built over centuries" and "includes diverse collections of primary sources in libraries, archives, and museums; the bibliographies, searching aids, citation systems, and concordances that make that information retrievable; the standards that are embodied in cataloging and classification systems; the journals and university presses that distribute the information; and the editors, librarians, archivists, and curators who link the operation of this structure to the scholars who use it."
In discussing the topics in my title (cyberinfrastructure and open standards, methods, and communities), I'm going to take as my example one standards community that I know fairly well, and that is probably known to many of you also, and that is the Text Encoding Initiative, or TEI. The TEI has been around for a long time now -- a little more than 20 years -- so there's a good deal of history to consider, and it is a history that exemplifies the challenges of cyberinfrastructure, its problems and possibilities, and the fundamental importance of community in meeting those challenges, solving those problems, and realizing those possibilities. By the way, the TEI web site (www.tei-c.org) is an excellent source of information on the history of the TEI, beginning with the "history" page, under "About" on the main menu (or at http://www.tei-c.org/About/history.xml).
One of the first things that needs to be said when discussing TEI as a standard is that the TEI orthodoxy has always made a distinction between "guidelines" and "standards:" in the strict sense, a standard has to be blessed by some internationally recognized standards organization like ISO (the International Organization for Standardization). One part of TEI is actually an ISO standard, namely Feature Structures, ISO 24610-1:2006 (which "provides a format for the representation, storage and exchange of feature structures in natural language applications concerned with the annotation, production or analysis of linguistic data"). SGML, the grammar in which the TEI Guidelines were originally expressed, is an ISO standard (ISO 8879:1986) as well, and XML is technically considered a subset of SGML, and technically referred to as a W3C "recommendation" or "specification" with respect to the SGML standard. The distinction between "standard," on the one hand, and "guideline" or "recommendation" or "specification" is actually an important one, since bodies like ISO impose significant organizational overhead, in the interest of achieving international consensus, and (partly because of that emphasis on consensus) they tend to regard stability of the standard as more important than other considerations (flexibility, currency, customization, etc.). Also, in the strict sense, a "standard" presupposes that there is a body to enforce adherence, and that adherence to at least some part of the standard is not voluntary. In one of its earliest documents, the minutes from the first meeting of its advisory board in 1989, Nancy Ide is reported as saying:
The intention is to produce "guidelines," not a "standard": compliance will be wholly voluntary, although we hope that wide acceptance will make the TEI recommendations a de facto standard, and relevant national and international standards will be taken into account during development.
(TEI Document abm01, at http://www.tei-c.org.uk/Vault/AB/abm01.gml)
There is an important space between the chaos of idiosyncratic practices and standardization in the ISO sense for a community that promotes best practices. In a text-encoding community, those best practices need to be embodied in both human-readable guidelines and a formal, machine-readable grammar (for example, SGML or XML tagging) that can be parsed and validated against some statement (e.g., a DTD or schema) that stipulates how that grammar is to be applied in a set of instances. And the most significant product of that community's activity is not so much a hard-and-fast 'standard' from which no one will deviate, but rather a process that brings many minds to bear on articulating the nature of the community's object of interest. In the case of the TEI, the most important, most impressive outcome of 20 years of activity is an extremely detailed ontology of text, considered from literary and linguistic perspectives. There are, in fact, few disciplinary communities that have done as much as the TEI has done to produce both the formal expression and the human-readable documentation of the ontology of their subject matter. Along the way, the TEI has also produced infrastructure for the production and maintenance of its guidelines, history of its own organizational development, and mechanisms for participation, each of which is, in its own way, a key element of cyberinfrastructure, as it gets embodied in and carried out by communities.
Historically speaking, the TEI has as its point of origin a meeting at Vassar College in 1987, convened by the Association for Computers and the Humanities (the ACH), funded by the National Endowment for the Humanities, and attended by " thirty experts in the field of electronic texts, representing professional societies, research centers, and text and data archives" (http://ota.ahds.ac.uk/documents/creating/chap5.html). According to one account,
Those attending the conference agreed that there was a pressing need for a common text encoding scheme that researchers could use when creating electronic texts, to replace the existing system in which every text provider and every software developer had to invent and support their own scheme (since existing schemes were typically ad hoc constructs with support for the particular interests of their creators, but not built for general use). At a similar conference ten years earlier, one participant pointed out, everyone had agreed that a common encoding scheme was desirable, and predicted chaos if one [were] not developed. At the Poughkeepsie meeting, no one predicted chaos: everyone agreed that chaos [had] already arrived.
The document that emerged from this meeting is referred to as "The Poughkeepsie Principles" and it proposes the development of a common encoding scheme, and specifies two functions for that scheme: "to recommend a format for interchange of texts, and to recommend principles and practices for the encoding of new texts" (TEI Document abm01). According to the first 'official release' of the TEI Guidelines (also called P3), "The Guidelines formulated in this document are intended for use in interchange between individuals and research groups using different programs and computer systems over a broad range of applications. Since they contain an inventory of the features most often found useful for text processing, the Guidelines also provide help to those creating texts in electronic form. They can also be used for the local storage of text which is to be processed with multiple software packages requiring different input formats" (P3, Introduction). So, sharing of resources among researchers, an essential characteristic of cyberinfrastructure, was also a defining goal of the TEI, from the outset. In the Poughkeepsie Principles, a target audience was also identified:
Existing archives have large investments in their existing schemes and will have no motive for converting the storage format of their holdings. But they are keenly interested in reducing the number of other formats from which and into which they must translate their texts, by helping develop and support a single common format for interchange." (TEI Document abm01)
In a practical sense, the TEI did function as an interchange format for libraries who were riding out the rapid and relatively undisciplined development of HTML in the early days of the web, with the result that, over time, TEI became the format in which most library-based electronic text collections were actually stored—at which point, though the target audience remained the same, the logic of participation shifted to emphasize the interest that archives should have in seeing the standard maintained and developed in ways that would protect the investment represented in those text collections.
The other audience envisioned in the Poughkeepsie Principles was "Scholars working to encode new texts -- many of them novices in computing with no investment at all in any existing scheme," and in this case the logic or participation was that such scholars would "benefit from having some guidance about what textual features to encode and how to encode them." And indeed, many scholars have used the TEI to encode their texts, but the many examples of scholars who have ignored or departed from the practices recommended by the TEI does suggest that scholars are perhaps less interested in interchange and sharing than they are in accurately, if idiosyncratically, expressing their particular views of the material with which they are working.
Finally, at this first meeting in Poughkeepsie, it was agreed "that the syntax of the new encoding scheme should conform to the Standard Generalized Markup Language (SGML) unless it proves necessary to deviate from SGML to handle requirements of the research community" (TEI Document abm01). There were limitations to what one could express in SGML that were recognized from the outset (for example, not all documents can be reduced to neatly nesting hierarchies of logical elements), but SGML was the best available standard in which to express the TEI's recommendations, and the limitations of SGML were more likely to be felt by scholars than by librarians.
The tension between the requirements of the library community and the scholarly community are a theme in the history of the TEI, perhaps best recognized in the emergence and widespread adoption of TEI Lite, "a specific customization of the TEI tagset, designed to meet '90% of the needs of 90% of the TEI user community,'" according to the TEI Guidelines, which go on to say that, "due to its simplicity and the fact that it can be learned with relative ease, TEI Lite has been widely adopted, particularly by beginners and by big institutional projects that rely on large teams of encoders to markup their documents." In fact, it is not only that a simpler tag set provides fewer opportunities for variations in encoding practice that makes it useful for large institutional projects—it is also important that encoding across a large collection should limit itself to tagging reliably recurring features, so as to simplify management of the collection. This "lowest common denominator" approach is fundamentally different from the scholar's interest in distinguishing, specifying, describing, and interpreting the features of what is likely in most cases to be a significantly smaller and more homogenous collection of materials from a single author or era. As long as markup has needed to be embedded in the encoded document, the opportunity for interchange across this particular boundary of purpose has been limited: a library text could be further encoded into its scholarly version, but reversing that process (say, to export corrections to the text back to the library version) is not likely to happen, and having the library text underlie the scholarly text without being absorbed into it simply wasn't possible, because it would require technical and social mechanisms that haven't existed. However, since the most recent release of the TEI Guidelines has incorporated the idea of standoff markup (see section 16.9), at least the technical side of the problem has been addressed.
In my title, there is a modifier attached to "standard," namely "open." An "open standard" means a non-proprietary standard, which means it is freely available, and it is not produced by a commercial entity as part of a software system that is closed to outside developers. While there is a great deal of ideology accreting around words like "open," on a dispassionate examination things are rarely either open or closed. There are many widely used standards that are proprietary (gif, pdf) and there are non-proprietary standards that are effectively closed to some who would like to participate (for example, the World-Wide Web Consortium, in which membership costs $10,000/year). In the case of the TEI, inclusiveness as well as openness have been goals from the earliest days, but these goals have not always been perfectly realized. During the first ten years, up until the end of the 1990s, participation in discussion was open, through subscription to TEI listservs, but participation in governance was by invitation from those who were already involved. Partly, this is the heritage of a grant-funded effort: as Susan Hockey noted in the mid-1990s,
Funding of approximately $1,000,000 has been provided, over the six years of the TEI's development work, by the U.S. National Endowment for the Humanities, Directorate General XIII of the Commission of the European Union (as it is now called), and the Andrew W. Mellon Foundation. The TEI has also received substantial indirect support from the host institutions of participants in the project." (Susan Hockey, TEIJ16, March 21, 1996)
Funders such as these require principal participants to be identified in advance, which means those participants need to be invited to be part of the application for funding, which means they cannot easily self-identify. At the outset, the organizational structure of the TEI had three parts: an advisory board, a steering committee, and editors. The advisory board consisted of fifteen individuals representing scholarly societies and like organizations, including the Linguistic Society of America, the American Historical Association the International Federation of Library Associations and Institutions, the Association for Documentary Editing the Dictionary Society of North America the Association for Computing Machinery's Special Interest Group for Information Retrieval, the Modern Language Association of America, the Association for Computing Machinery's Electronic Publishing Special Interest Group, the American Society for Information Science, the American Anthropological Association, the Association Internationale Bible et Informatique, the Canadian Linguistic Association, the American Philological Association, and the Association for History and Computing. This group actually had a rather limited role, disseminating information about the TEI to their members and voting up or down on various carefully proscribed topics. In the early years, the advisory board met only twice, in 1989 to approve the goals of the project to produce a set of guidelines, and in 1993 to approve the results. Most of the organizational heavy lifting was done by the steering committee, with two representatives from each of the three sponsoring organizations, which were the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics. This group met about every three months during the early years. The work of composing the guidelines themselves was overseen by two editors, appointed by the steering committee (for life, more or less), and a lot of the detail work was done by working groups, appointed by the steering committee, usually with a focus on particular types of materials. Fifteen of these working groups (with a total of nearly 200 members from universities around the world) contributed to P3, the first official release of the guidelines. These 200 were accrued through a kind of snowball process, where original participants knew other people who knew other people, and so even though their number was fairly large, and to them the TEI must have seemed like an open community, if you weren't part of this social network, it wasn't clear how to become part of it, and governance and editorial positions were pretty well sewn up in the original proposals for funding.
At about the time of the TEI's 10th anniversary, though, these grants were completed and sources of funding were drying up. At the same time, TEI continued to be in an active oscillation with standards and technologies developed in the wider world. For example, HTML was developed two years after that 1987 meeting at Vassar; P2, which might be regarded as a beta release of P3, came out in the same year that Mosaic, the first graphical web browser, appeared; the first version of P4 appeared in 1999, a year after XML 1.0 was released. Clearly, with all that was going on, it wasn't going to be possible to simply declare TEI a finished task and walk away—but at the same time, it was increasingly difficult to get funding agencies to support standards development, which may have been in part because they wondered whether something as massive and domain-specific as TEI would simply be overtaken by lighter-weight, more general-purpose schemas like XHTML, driven by commercial funding and business opportunities.
At this juncture, the TEI made a very important decision: rather than simply looking within its current community boundaries for a solution, it issued a public request for proposals, for a business plan that would allow TEI to continue its work. The result of that RFP was exactly one proposal, but it was a proposal that wouldn't have been submitted unless a public call had been made. The proposal, from the University of Virginia and the University of Bergen, was to incorporate the TEI as a non-profit membership organization, with an executive board elected by members or representing host institutions, and a technical council elected by members with a chair appointed by the board. The technical council is essentially similar to the earlier steering committee, and the executive board is more of a decision-making body than its predecessor, the advisory board. The big difference in shifting TEI to a membership organization was to make the route to participation more obvious and more open: you participate by becoming a member, and members elect most of the key posts in the organization.
As the TEI's own history of itself says,
The goal of establishing the TEI Consortium was to maintain a permanent home for the TEI as a democratically constituted, academically and economically independent, self-sustaining, non-profit organization. In addition, the TEI Consortium was intended to foster a broad-based user community with sustained involvement in the future development and widespread use of the TEI Guidelines. In both of these goals the creation of the Consortium has proven a positive step. http://www.tei-c.org/About/history.xml
TEI was incorporated as a 501(c)3 in 2000, and now membership in the TEI is by organizations (not by individuals, who can be 'subscribers' but not voting members). The financial barriers to participation are addressed, in this new structure, by keying membership fees to the size of the group to be represented, and then adjusting that fee according to the category of world economy in which that organization operates, so that membership fees range from $100/year to $5000/year (host institutions, of which there are four—about to be five, commit to $10,000/year). At present, it lists 81 member organizations from 19 countries; many of these are libraries, but a good number are scholarly projects or humanities computing centers as well.
Still, what's been a constant across various organizational models and funding sources is that this is a community of interest committing volunteer labor (with the traditional exception of the editors) to producing and maintaining guidelines for their own use, and making those freely available to others as well. The first official release of the guidelines, P3, was available in searchable form at the University of Michigan and the University of Virginia; P4 was available as html, xml, and PDF from the TEI web site; the current P5 release can be had in all those forms, plus its ODD source, from SourceForge as well as from the TEI web site, and it is licensed under GPL, the Gnu Public License, with a note that says "Copying and redistribution is permitted and encouraged." The result of this twenty-year effort is the largest and most detailed ontology in the humanities, incredibly flexible and extensible in its application, and highly innovative from a technical point of view. TEI has contributed significantly to the development of XML (the editor of the XML spec is Michael Sperberg-McQueen, former editor of TEI), informing things like XML's notion of linking, for example. TEI's implementation of namespaces (so that you can embed TEI documents in documents marked up in other XML schemas like METS or EAD); cutting-edge internationalization (with translation of every element description into five languages -- French, Spanish, German, Chinese, Japanese, plus a web-based interface for submitting translations of examples and other materials, and an infrastructure to support translation into many more languages, due for release right about now); a schema-generator tool that TEI developed, called Roma, that allows you to generate customized schemas with their documentation, for the purposes of a particular project (and now adapted to allow production of schemas and documentation in multiple languages); ODD, the literate programming language and infrastructure that has been in development since the earliest days of TEI, which now itself uses a TEI schema, and in P5, for the first time, is documented within the TEI guidelines; and many other innovations. Moreover, the text that describes these innovations and explains their use is surprisingly readable, much more approachable than people who haven't read it expect it to be, and at times witty, learned, and illuminating.
If cyberinfrastructure is a mixture of social and technical systems that allow sharing of digital resources, and the construction of tools for research, then the TEI represents some of the most important humanities cyberinfrastructure that has been created to date. It is exemplary from a technical standpoint and, increasingly, as a financially sustainable community-based activity. It has lasted, already, for a human generation, and through many generations of hardware and software (think of desktop computing circa 1987, the era of the Intel 386 and the first Apple Macintosh). Still, there are problems, and they are the problems that one finds with any real-world cyberinfrastructure. The reports of the TEI Task Force on SGML to XML Migration (NEH funded, 2002-2003), give some ideas of the issues: in large encoding projects that employ multiple encoders, or even in small ones that employ few encoders over a long period of time, it proves very difficult to enforce consistent application of standards, guidelines, and house styles. These inconsistencies tend to emerge most clearly at moments of transition—when the collection is being migrated to a new generation of markup, for example, or when it has been taken out of the environment for which it was originally created and processed in a new way—for example, as in my own research projects, when texts are moved out of the library or scholarly projects in which they were created, where they were processed mainly by search engines and software that rendered them for browsing, and moved into an aggregation with other texts, from other contexts, to be used for text-mining and analysis. We could find earlier historical analogs for many of these cracks in cyberinfrastructure, for that matter—in the copying of medieval manuscripts, for example, or in the migration from manuscript to print.
A second typical issue in the application of cyberinfrastructure in the real world, beyond accidental departure from the shared norm, is the deliberate insistence on doing things differently. Here I will use myself as an example, and quote from a paper that Daniel Pitti and I wrote in 1998, called "After the Fall: Structured Data at IATH" (the Institute for Advanced Technology in the Humanities). We wrote that
IATH was founded in 1992, and from the beginning, SGML was used in its projects. Also from the beginning, TEI was at least considered, though frequently not used, in developing those projects. There are several factors that led to rejecting TEI in favor of locally developed DTDs at IATH. First, in the early days of the Institute, people (technical staff and fellows alike) generally did not understand either the depth or the breadth of TEI. We found it difficult to decipher and to apply, and we (along with everyone else) lacked tools to help in the effort. Second, there were several people working at IATH who understood SGML well enough to write DTDs, but not well enough to understand and employ TEI's extension methods. Third, while the community ethic of TEI was understood and appreciated, IATH was predisposed by its mission to handcraft solutions that closely reflected the intellectual interests and objectives of the scholars. These three factors, combined, led to the development of many idiosyncratic IATH DTDs. In the last year we have had the opportunity to examine SGML use at IATH, and to evaluate the various DTDs that have been employed over the years. With better understanding of both SGML and TEI, we have come to the conclusion that in many instances, it would have been wiser to use TEI than to develop a DTD.
In fact, we didn't collectively spend a lot of time trying to "decipher" TEI, and whatever tools were lacking for its application were also lacking for the application of the SGML DTDs that we developed, of course. The notion that one could be expert enough to write a DTD but not expert enough to understand the TEI extension mechanism also doesn't really stand up to scrutiny. The main reason that we didn't use TEI was that we hadn't invented it—in effect, this is the corollary to earlier observations about the openness of the TEI community in the first ten years: if you were part of it, it was open, and if you contributed to developing TEI, you probably felt invested in its use—but if you were outside the community, you might well ignore its work, even though you used the same ISO-standard grammar to express your idiosyncratic ideas about text.
This is the nub of the problem, in developing shared research infrastructure—figuring out how to enfranchise when you can't enforce, developing community that defines itself by inclusion rather than by exclusion, and understanding how to provide consistency, for those who value that, without preventing creativity, for those who need or simply want it. Here again, I find that some of the earliest documents of the TEI foreshadow these issues. Take, for example, the question of whether markup is simply descriptive (which is more or less the library point of view, and makes pragmatic consistency palatable) or inevitably interpretive (which is more or less the scholarly point of view, and makes a virtue of idiosyncracy). In the minutes of the first meeting with the Advisory Board, in 1988, we read that
Because any set of textual features implies some theory in which those features play a role, tag sets inherently involve some theoretical position. In cases where several theories have been advanced within a discipline (each, perhaps, positing different sets of significant features), a range of approaches are possible, ranging from a tag set built around a single theory to a tag set reflecting a polytheoretical consensus among the competing theories of a field. (TEI Document abm01)
Or this, from the most recent release of the TEI:
In these Guidelines, no hard and fast distinction is drawn between 'objective' and 'subjective' information or between 'representation' and 'interpretation'. These distinctions, though widely made and often useful in narrow, well-defined contexts, are perhaps best interpreted as distinctions between issues on which there is a scholarly consensus and issues where no such consensus exists. Such consensus has been, and no doubt will be, subject to change. The TEI Guidelines do not make suggestions or restrictions as to which of these features should be encoded. The use of the terms descriptive and interpretive about different types of encoding in the Guidelines is not intended to support any particular view on these theoretical issues. Historically, it reflects a purely practical division of responsibility amongst the original working committees.
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/AB.html
My point is that the TEI, as cyberinfrastructure—as an open standard developed by a disciplinary community—inevitably embodies the shifting boundary between consensus and divergence of opinion: at its best, such an infrastructure will make it easy to be consistent in expressing consensus, will provide mechanisms for articulating differences of opinion using a shared ontology, and will make it clear when an altogether different kind of infrastructure is required. Again taking myself as an example, and again quoting from my 1998 paper with Daniel Pitti, there were really only a handful of cases at IATH where departure from (rather than use or modification of) TEI was really justified, and those had to do with "The dilemma of overlapping intellectual and physical hierarchies"--for example, poems that run across page boundaries. In many cases, the decision made in TEI, "to resolve the dilemma by emphasizing the representation of the intellectual textual objects, with the structure of the physical object being treated as a secondary phenomenon" is perfectly acceptable; in a few cases, like Blake's illuminated books, this resolution of the dilemma interferes with one's ability to attach much information to the physical units (like pages) as it privileges instead the logical units (like poems) from which the text is constructed. Still, even in this case, had we really been concerned about the usability of the marked up texts outside of the environment for which they were being produced, we would have made an effort to construct a conversion (or extension) path to TEI as an interchange format, so that the texts of the Blake Archive could be mingled with other collections for other purposes than our own. But—and this is typical of such departures from shared infrastructure—we thought our own use was the limit case, we thought our resources were too constrained to produce a generalized version of the resource, and our intellectual energy was focused on elaborating our vision of the materials, rather than on reconciling that vision with a more general-purpose scholarly ontology of text developed elsewhere.
All of which brings us, in a reap-what-you-sow sort of narrative, to my current research experience, in two projects funded by one of the same foundations that provided initial support to the TEI, the Andrew W. Mellon Foundation. These consecutive two-year projects are named "nora" and "monk" respectively, and I will spare you the unpacking of those acronyms, and just say that both projects are large, multi-institutional collaborations aimed at developing tools for identifying and exploring how the minutiae of textual objects correspond to our gestalt sense of literary documents, in large humanities digital libraries. Our test collections come from a broad range of libraries, publishers, and scholarly projects; all are SGML or XML, and most are some form of TEI. One would think, therefore, that we would be comfortably in the Interchange Zone, where all of these texts play nicely with one another, but in fact, most of the time in each of these projects has gone into developing a data structure that will encompass the necessary differences across these collections (from drama to verse to prose; from early to modern orthography; from British to American grammar and idiom and dialect) and developing ingest routines that will corral all of these texts into a lowest common denominator schema (a sort of TEI Lite for the purposes of machine analysis, which we call TEI Analytic), so that we can manage them collectively. This needs to be done with a minimum of manual intervention, because ultimately we need librarians to be willing and able to deploy these tools alongside collections, rather than having us gather collections and bring them to the tools—for the simple reason that we do not want to be in the business of administering the library's intellectual property agreements. Between simple error and motivated idiosyncracy the ideal of "interchange" seems quite remote, and aside from these issues there are significant problems that arise with deracinating resources from their original context (knowledge formerly implied in now-missing system elements, functional parts gone missing, etc.) What this experience has taught me is that if we really want to make serious analytical use of our digital libraries and digital scholarly editions, interchange is an extremely important goal, and achieving it will not be easy. Moreover, it is only a first step—beyond the ability to mingle texts for new purposes, we will need to be able to trace their provenance as evidence, and trace their progress through an analytic process, and we will need to be able to refer others back to their canonical public (or proprietary) location. We'll need to be able to publish results without republishing all of the texts from which those results were drawn, and at the same time we'll need to give skeptical peers the ability to see where in those texts our evidence was found. These are some of the challenges for the next generation of librarians, scholars, and editors, and I hope we will approach them as serious infrastructural and intellectual issues that concern us all, whatever our immediate role or purpose with respect to the text might be. Such an approach might emphasize the importance of becoming literate in TEI, becoming a member of the TEI community, understanding the importance of best practices as the precondition of interchange, understanding the importance of interchange as the precondition of analytical usefulness, as well as of preservation and migration. Here again, looking forward, it seems appropriate to conclude with a quotation from an early TEI document—in this case, the first TEI newsletter, reporting on that first meeting of the advisory board, in which
Susan Hockey began with a discussion of the TEI in the context ofhumanities research. Humanities research deals with a great variety of source materials: printed books (prose, poetry, drama), historical documents (charters, correspondence, political papers), papyri, inscriptions, clay tablets, coins. The kinds of analyses performed vary similarly: examples include stylistic comparisons, authorship studies, lexical work, collocations, tracing literary themes, study of different characters in a play, critical editions, syntax and morphology, variant spellings in older texts, metrical analysis, and sound patterns in poetry. The basic tools for analysis include concordances, text retrieval programs, and databases. These must deal with some important problems characteristic of scholarly texts: character set representation for display, printout and analysis, parallel texts in different languages and alphabets, logical structure of the text (including standard referencing systems which are often complex and not hierarchical), footnotes, critical apparatus, editorial comments, marginalia, lacunae (gaps in original), unclear readings, deleted text, and multiple editions. (TEI Document abm01)
Every bit of this is as important now as it was then, and with respect to the preconditions for machine-aided analysis of literary e-text collections, we are only just barely at a point where meeting such preconditions seems possible, and we are at that point only because a number of individuals and institutions with a shared interest in representing literary texts in computable form have worked for a generation to reduce unmotivated differences and develop strategies for dealing with motivated ones. And that is cyberinfrastructure in a nutshell.