After the Fall—Structured Data at IATH


Daniel Pitti and John Unsworth

Presented at the 1998 ALLC/ACH Conference in Debrecen, Hungary,

The Institute for Advanced Technology in the Humanities helps scholars from a wide variety of humanities disciplines to apply the latest technology to their research. The resources used in IATH's projects include published and unpublished texts and manuscripts, but also pictorial materials, plans and designs, and geographic and statistical data. While all of this information is structured in one manner or another, we will focus in this paper on textual material and, in particular, on the use of Standard Generalized Markup Language (SGML) for representing text in machine-readable form. Further, we will review our use of the Text Encoding Initiative (TEI) Document Type Definition (DTD), and compare that to our use of locally developed DTDs..

In keeping with the theological metaphor in the title of today's paper, we intend to do two things: first confess, and then witness.

"The Fall" in our title has a number of different meanings in the present context. The first concerns the theological notion of the "Fall," the First Choice—the choice to eat the apple—after which mortality sets in, and a world of perfection (with all options always open) becomes a world of toil, in which each subsequent choice carries consequences and rules out other choices. With respect to our topic (structured data), what this implies is that in a world such as ours, perfect communication and perfect representation are impossible: we can never fully capture the irreducible phenomenon that is language, with or without the aid of technology. While this may go without saying, we feel that it is important to say, because if perfect representation were possible, then it would make sense to argue over which representation most closely approached perfection, whereas if perfect representation is not possible, then it makes sense instead to argue over what partial view one wants to represent, and why.

Even if we did not recognize this fundamental truth, the technology with which we work would provide ample opportunity to discover it. While our ability to represent texts, and to represent our understanding of them, has benefitted greatly from the development of Standard Generalized Markup Language, SGML still imposes severe limits on representation. Most of these limits were not apparent to us in our early work with the standard, but after five years of both exhilarating successes and abysmal failures, it is clear that, as expressive as it is, SGML still does not do everything scholars would like it to do. But having said that, and before continuing with this critique, we should also say of SGML what Richard Rorty says of bourgeois capitalist democracy: it's not the best system we can imagine, but it is the best system we have.

Before turning our attention to SGML's most serious limitation, we would like to spend a few moments recounting the history of the use of SGML in the Institute for Advanced Technology in the Humanities. IATH was founded in 1992, and from the beginning, SGML was used in its projects. Also from the beginning, TEI was at least considered, though frequently not used, in developing those projects.

There are several factors that led to rejecting TEI in favor of locally developed DTDs at IATH. First, in the early days of the Institute, people (technical staff and fellows alike) generally did not understand either the depth or the breadth of TEI. We found it difficult to decipher and to apply, and we (along with everyone else) lacked tools to help in the effort. Second, there were several people working at IATH who understood SGML well enough to write DTDs, but not well enough to understand and employ TEI's extension methods. Third, while the community ethic of TEI was understood and appreciated, IATH was predisposed by its mission to handcraft solutions that closely reflected the intellectual interests and objectives of the scholars. These three factors, combined, led to the development of many idiosyncratic IATH DTDs.

In the last year we have had the opportunity to examine SGML use at IATH, and to evaluate the various DTDs that have been employed over the years. With better understanding of both SGML and TEI, we have come to the conclusion that in many instances, it would have been wiser to use TEI than to develop a DTD. One case in particular is noteworthy in this regard: the Rossetti Archive, which employs four DTDS: the Rossetti Archive Work (RAW), the Rossetti Archive Document (RAD), the Rossetti Archive Picture (RAP), and the Rossetti Archive Commentary (RAC). This fourth DTD is, in fact, standard TEI with some minor modifications. Upon reflection, Jerome McGann, the editor of the Rossetti Archive, has come to the conclusion that his major objections to TEI were really misplaced. The problems were not essentially TEI problems, but SGML problems. Much of what he wants to represent simply defies representation under any implementation of SGML. For example, take the Rossetti Archive Work: a "work" in the context of the Rossetti Archive is the author's pure ideation, not instatiated in any material form—the "idea" of the Blessed Damozel, for instance, and not any text of that poem, nor any instance of that pictorial work. Pure ideation does not have divs, has no body, has no text to encode—pure header, perhaps? But it has no source to describe, no bibliographic features, no provenance. Nonetheless, it does have qualities, characteristics, attributes, and forms of reference that one might want to encode, and it may (and most often does) stand in some structured relationship to textual and pictorial instances—in fact, it generally structures the relation of those instances to one another—though more likely than not, those relationships are multiple, overlapping, and concurrent, rather than straightforwardly hierarchical.

For an immediately obvious instance of the limit case for SGML in another project, see figure 1: this is from the dissertation of Matthew Kirschenbaum, whose work is concerned with the aesthetics of text in new media. Though structure and relationships and even some limited hierarchy can be seen here, it is not at all clear that SGML would be up to the task of encoding a text such as Matt's. Nevertheless—and perhaps this is an example of the fortunate fall—it is a paradoxical axiom that failure can sometimes be more illuminating than success, so it might be interesting to make the attempt. In the case of the Rossetti Archive, for example, the structure called a "work" would not have emerged if it were not for the constraints imposed by encoding textual and pictorial instances in SGML, and it could be that some similar benefit would appear if we were to attempt the SGML encoding of three-dimensional, "performative" text.

But even much simpler instances of multiple concurrent or overlapping hierarchies are problematic in SGML, and these simpler instances are what present us and others with the most difficult choices, because while SGML is clearly appropriate to the purposes of these projects, this limitation with respect to multiple and overlapping hierarchy makes intellectually compatible goals into mutually exclusive options. Allen Renear and his colleagues, Elli Mylonas, Steve DeRose, and David Durand, have addressed overlapping hierarchies in various short articles that both examine the problem and suggest changes that will help SGML partially, though not fully, overcome it.

Since these authors and others have throughly discussed the issue, we will only describe it briefly here, using as an example the most frequent context in which we find it a problem, namely, simultaneously description of both the hierarchy of intellectual/textual object and the hierarchy of the physical object carrying that text. Although SGML theoretically permits such activities, there is no significant SGML software support for more than one hierarchy at time. To use a well-known instance of the problem (one with some less well-known and quite intricate solutions), we can describe a particular poem and its individual lines and line groups, but we cannot describe at the same time the hierarchical structure of the pages on which that poem is printed. Say that a particular stanza of a poem begins with the last three lines at the bottom of one page, and ends with the first three lines on the following page: faced with this entirely commonplace situation, we can either represent the stanza as a stanza, or the page as a page, but not both at the same time. The last three lines on one page and the first three lines on the next page together would constitute an object in the hierarchical representation of the poem, while each set of three lines would constitute objects within the hierarchical representation of the pages, and if we were to represent both, the structures would necessarily overlap.

The dilemma of overlapping intellectual and physical hierarchies was noted early in the development of TEI, and evidently the TEI developers chose to resolve the dilemma by emphasizing the representation of the intellectual textual objects, with the structure of the physical object being treated as a secondary phenomenon. In keeping with our earlier suggestion (that we argue about the choices we make, rather than about whether to make choices), we would say that this is a perfectly legitimate choice to make if your interest lies with the intellectual object rather than the physical one. In practical terms, TEI is optimized for representing the intellectual structure of a text, and it does this without entirely sacrificing the ability to represent the physical structure. The representation of the physical structure is accomplished using empty elements, elements that have no content and that are widely if not ubiquitously available within the emphasized hierarchy. The <milestone> and <pb> or page break elements are included as exceptions in the <text> element, making it possible for them to exist anywhere in the hierarchy of the text, which is to say, they do not participate meaningfully in the hierarchy itself, but "float" around in it.

The TEI's emphasis on the intellectual structure of a text is also evident in the <teiheader>, that portion of a TEI document used to document the electronic text and its source. The explicit assumption of TEI is that an instance will represent first and foremost the intellectual textual object, and secondarily (if at all) the artifact bearing the text. The <filedesc> element in the header, for example, is used to document the electronic text. The element used to document the source of the text, which frequently though not exclusively is a physical object, is the <sourcedesc>, which is a sub-element of the <filedesc>, and also of <biblfull>.

While many scholars at IATH wish to represent the intellectual features of a text as their primary emphasis, many others, and perhaps the majority, wish to describe and represent the artifact that bears the text, and the image of that artifact, in the foreground, and to treat the intellectual objects as secondary. In fact, most would really like to do both at the same time, but forced to choose, as they are, most choose the physical over the intellectual. For this group of scholars, the choice to use TEI is highly problematic.

The first problem they encounter is that the source description is in the <teiheader>, and further, it is treated as a secondary, supporting element of documentation. The first request they make is to ask that the <sourcedesc> be made available directly and immediately within the <text> element or sub-elements. While it is available as a sub-element of <biblfull> element, they would like to have it directly and immediately available elsewhere, and perhaps everywhere, as in many cases it represents their major interest.

The second problem encountered may be characterized as a lack of elements devoted to specifically representing the physical artifact. For example, if one wants to represent a sequence of pages, there is not an element devoted to representing both a page and the text carried by it. There is in fact a <page> element, but it is defined within a DTD that is designed to use the CONCUR feature of SGML, which is poorly supported in existing SGML software.

Several projects at IATH have been concerned with representing and describing artifactual sources. Michael Satlow's Inscriptions of Ancient Israel project is devoted to representing various inscriptions, and inscription fragments, geographically distributed throughout Palestine, and identifying them with various periods. In essence, he is not attempting to represent a text, but instead deals with many little texts and fragments of text, found typically on stone. These stones bear very little text, and what text there is tends to be forced by its medium into linebreaks and other textual behaviors that bear little resemblance to what we do with text in print (though it does bear a resemblance, some Dickinson scholars would say, to what Emily did with handwriting on paper). There is also, in these inscriptions, the problem of erasure, defacement, breakage, and so forth—all of which impinge on the text in ways that naturally provoke scholars to emendation, with emendation overlapping line breaks (and perhaps overlapping stone boundaries as well). Moreover, if one follows standard TEI practice in dealing with these materials, the descriptive information overwhelms the text, making for some very top-heavy instances and vastly complicating the problem of rendering.

But lest we make the same mistake here that we did originally in dealing with the Rossetti materials, we should point out that some of the difficulties we have faced in trying to treat Michael's materials with SGML are not difficulties with TEI, but difficulties with SGML—for example, though most of these materials cannot be dated to a particular year, most can be dated to a range of years, yet if you have encoded something as being from the date range 200-100 BC it is a difficult thing to make any SGML search engine find that object when searching for things in the date range 500-50 BC, let alone in the range 150-50 BC. For those purposes, clearly, a database would be a better choice. But that's another paper...

A more recent project, and one which represents our most complete attempt to develop an appropriate encoding scheme for artifacts, is the Blake Archive Project. Three Blake scholars—Morris Eaves, Bob Essick, and Joe Viscomi—are meticulously gathering from all over the world digital representations of all of William Blake's illuminated books, and describing and representing each in great detail. While they are interested in the text of these works, they are interested in it first and foremost as it is presented on each plate. If you look at figure 2, you will see that the centrality of the physical object is not only an editorial principle, but a design principle as well: illustration information (which is header information) and transcription (which is what would be central if the textual content were privileged) are both present here, but subordinated as links to dependent pages. And if you look at figure 3, you will see the practical consequence of choosing to subordinate the transcription to the artifact—it appears in an ancillary window, and has the same presentational status as the enlargement of the plate image.

In order to represent Blake's work according to the editorial and design principles of Eaves, Essick, and Viscomi, we have developed the Blake Archival Description (BAD) DTD, and we have supplied, in figure 4, a fragment of a BAD instance. In it, you will find many specific elements devoted to describing the physical object (<physdesc>), illustrations found on plates (<illusdesc>), and the text on them (<phystext>). If you look for a moment at this example, you will see what we mean when we say that some of our fellows are intensely interested in describing the physical artifact: the Blake Archive markup treats each plate as a collection of one or more illustrations, and each illustration as a collection of one or more components, and each component as a collection of one or more characteristics. Each of these—illustration, component, and characteristic—must be elements, not attributes, in order to achieve the functionality that these scholars want out of the archive—for example, in order to search across all of the illuminated books for illustrations containing both figures who are nude and climbing and vegetation that is arboreal and arching.

It is beyond the scope of this paper to answer the question why one would want to do this, but in fact people do want to do it, and designing a DTD for this purpose has allowed us to do other, complementary things that depend indirectly on the SGML structure, such as using IATH's Inote software to present the the person who searches for nude climbers and arching, arboreal vegetation with an image of the particular sector of that plate in which these elements occur. To return to the idea of choice, its inevitability and its consequences, it needs to be said that making the choice to privilege the artifact has made other things more difficult—as, for example, when we come to a long book with subdivisions in the intellectual order—chapters or constituent poems—and we find that we must nonetheless render this work as a series of 100 plates, rather than as a series of a dozen poems.

One of the goals of TEI is to be able to represent any text—and, by and large, it is possible to represent any text using TEI: the motivation behind this is, we assume, not hubris but rather a desire to create a uniform structure that will facilitate the sharing of texts across collections. And even when one encounters texts that resist representation in TEI, it is frequently possible to use standard extension methods to suppress unnecessary elements, to add missing elements, and to modify and extend attributes. But such extensions and changes to the default DTD have drawbacks, the most important of which is that eventually these changes undermine our ability to share. At some point, a modified and extended TEI-based DTD will bear little resemblance to the TEI proper, whereupon we have lost a major benefit of TEI, and in trying to fit TEI to this purpose, we have created something far more complex, indirect, and opaque than we would have done had we simply set out to write a new DTD from scratch, designed for the immediate task at hand. This was the case with the Blake Archive Project, and we chose to simply develop a new DTD rather than using the TEI with extensions. In fact, though it verges on heresy to say so, presentation or rendering can be an important consideration in such a choice: were we to have used TEI plus extensions in the Blake Archive, it would have been difficult, if not impossible, to achieve the desired screen rendition, at least when using a fully SGML-aware rendering engine such as Dynaweb. It could have been done, no doubt, using TEI and a perl-based rendering solution, since in that case the logic of the SGML can be more easily ignored in rendering, but it seems to us that there should be some logical parallel between encoding and presentation, and if there is not, then either the encoding or the presentation needs to be reconsidered.

The emphasis in TEI on the intellectual structure of texts, and its subordination of the text-bearing artifact does sometimes present an obstacle—especially given that many of the scholars working with IATH are primarily concerned with physical objects. And although this perspective is derived from our experience at IATH, that experience has involved many humanistic disciplines and institutions, including archives and libraries, where there is also a pronounced emphasis on physical artifacts, both published and manuscript, both ancient and modern.

Nonetheless, we are in general steering IATH away from its past tendency to rush into developing a new DTD for every project. Instead, we would like to use TEI whenever possible, and have the development of new DTDs be the exception rather than the rule, and we believe that TEI, with some revision, could meet the needs of institutions and scholars who privilege the artifact. With respect to revision, then, we have two recommendations, the goal of which is to make the description and representation of the physical artifact an option equal to representation of the intellectual object under TEI, and to make it possible under TEI to accommodate either the intellectual hierarchy or the physical one—and even, serially, both. These recommendations are preliminary, and we have as yet made no attempt to fully analyze and describe in detail the changes they would require, but we would welcome the opportunity to work on the problem with the TEI editors.

Recommendation 1: While retaining the current <sourcedesc> as subordinate to the <filedesc> for intellectual object encodings, TEI should also make it available directly in <text> and in various sub-elements of text.

Recommendation 2: TEI should include a new base set devoted to hierarchical representation of text-bearing objects. A good place to begin developing this base set would be with the elements and attributes declared in the teipl2 DTD: "Concurrent Document Type for Page and Line." While this DTD represents a solid place to begin, we further recommend that the EBIND DTD developed at Berkeley and the BAD DTD developed at IATH be analyzed for additional encoding strategies. While the BAD DTD is admittedly "Blake-centric," we believe that some, and perhaps many, of its features can be generalized.

If TEI is formally extended in this manner, the need to develop idiosyncratic DTDs at IATH, and elsewhere, would decrease substantially—and that would be a benefit to all of us, given that sharing works best in the presence of widely understood rules, and sharing codified and highly structured understanding is, and has always been, a cardinal motivation for scholarship.