Extensible Markup Language (XML) and its Applications in Scholarship & Libraries

John Unsworth

In "Adding Value To Digital Texts: A General Overview," a session of the "Slavic Digital Text Workshop: Strategies for Humanists & Social Scientists," University of Illinois, Champaign-Urbana, July 6, 2005.

Introduction and full disclosure:

I'm a current board member of the Text Encoding Initiative Consortium, and a past chair of the TEI Board (which manages the business of the Consortium) and of the TEI Council (which oversees the Guidelines, and all related technical decisions concerning the maintenance, revision, and implementation of the Guidelines). At the same time, until about 2000, I was an outsider in the world of TEI, and to some significant extent an apostate: I ran a humanities research computing organization that regularly departed from the TEI Guidelines, rolled its own DTDs, and so on. In what follows, I speak from all that experience, both inside and outside the fold.

Miranda suggested several questions to which I might respond in this talk, and they are mostly questions at the novice level. I've taken those assignments seriously: on the other hand, you may well have questions that I don't answer in what follows, so I will stop at the end of prepared remarks on each point, and then at the end of my session, for questions and discussion. Meanwhile, if my treatment of any of the following points is too elementary to be interesting to you, I recommend that you concentrate on elementary issues or topics that would matter to the novice, but that I have overlooked, rather than trying to think up really hard questions that might stump me. I'm just a dean, so that would be too easy.

What is XML and its relation to SGML?

If you think of SGML and XML as grammars, XML is a simpler grammar than SGML; if you think in those terms then XML is to HTML as a grammar is to a genre or a type of utterance. Here's a better answer, from A Gentle Introduction to XML in the TEI Guidelines (P4):

The encoding scheme defined by these Guidelines may be formulated either as an application of the ISO Standard Generalized Markup Language (SGML) or of the more recently developed W3C Extensible Markup Language (XML). Both SGML and XML are widely used for the definition of device-independent, system-independent methods of storing and processing texts in electronic form, XML being in fact a simplification or derivation of SGML. . . .

The nature of the simplification, should it be of interest, is that XML does not allow some of the things that are allowed in SGML, for example:

· Tag minimization

· Concurrent structures (non-nesting parallel hierarchies)

· Subdoc

· Inclusion or exclusion in element type declaration

(http://www.tei-c.org.uk/P4X/SG.html)

…and a host of other things, but these are some of the things that are most significant for end-users, I think. The whole list (as it was when XML made its debut) is at http://www.w3.org/TR/NOTE-sgml-xml-971215 (John Clark is the author).

It's worth noting, in addition to the simplifications, that XML does allow some things that SGML didn't allow—most importantly, I think, it allows a document not to declare its DTD (or schema) as long as it is well formed, and it allows (in fact, requires) the use of Unicode, even in element names.

The net effect of the restrictions are to make XML much simpler to write software for, and that has resulted in lots more XML-oriented software than was ever available for SGML, even though SGML has been around for more than three times as long as XML. The net effect of the things permitted in XML but not permitted in SGML is to make it easier to produce and exchange documents in XML, and also somewhat easier to write software to present those documents for browsing. And all of the above has made XML attractive for lots of "middleware" uses that we don't think of as documents at all, but that are really more in the nature of exchanging little atoms of information, records from a database or instructions from one piece of software to another.

Going back to the Gentle Introduction, though, let's talk about what we use XML for, in the library and digital humanities communities:

XML is an extensible markup language used for the description of marked-up electronic text. More exactly, XML is a metalanguage, that is, a means of formally describing a language, in this case, a markup language. Historically, the word "markup" has been used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples include wavy underlining to indicate boldface, special symbols for passages to be omitted or printed in a particular font and so forth. As the formatting and printing of texts was automated, the term was extended to cover all sorts of special codes inserted into electronic texts to govern formatting, printing, or other processing. . . .

By markup language we mean a set of markup conventions used together for encoding texts. A markup language must specify what markup is allowed, what markup is required, how markup is to be distinguished from text, and what the markup means. XML provides the means for doing the first three; documentation such as these Guidelines is required for the last.

And finally, from the Gentle Introduction, "What's special about XML?"

Three characteristics of XML seem to us to make it unlike other other markup languages:

* its emphasis on descriptive rather than procedural markup;

* its document type concept;

* its independence of any one hardware or software system.

[….]

The markup language with which XML is most frequently compared, however, is HTML, the language in which web pages had always been written until XML began to replace it. Compared with HTML, XML has some other important characteristics:

* XML is extensible: it does not contain a fixed set of tags

* XML documents must be well-formed according to a defined syntax, and

may be formally validated

* XML focuses on the meaning of data, not its presentation

So, what's the bottom line, for you? XML is easier to use than SGML, and for 99% of users—even scholarly or library users—there's no sacrifice. But even if there were, it wouldn't matter: you'd still make it, because the evolutionary process has embraced XML and stranded SGML. If you had a really, really important reason for preferring SGML as the meta-language in which you expressed your ideas, you'd still have to transform it (or dumb it down) to XML for delivery, or sacrifice delivery as a goal.

What, briefly, is the TEI (John Walsh will cover it in more detail) and what kind of customization does it allow?

TEI is a "means of representing those features of a text which need to be identified explicitly in order to facilitate processing of the text by computer programs." It is, however, a "means" that's aimed specifically at literary and linguistic texts, and at scholarly and library uses of those texts. It's cleverly expressed in a way that's independent of the particular grammar (SGML, XML) that's fashionable today, and I would argue that it's the most well thought-out community guidelines document in existence (it's not a standard—it's a set of guidelines about how to apply standards in the pursuit of the goals of a community). There is no other set of guidelines that represents such length (in time—beginning in the late 1980s) and breadth (in disciplines and participants), when it comes to documenting the best practices of a community that's trying to achieve hardware- and software-independence for the machine-readable representations of its core content.

To what extent do scholars need to know about it, and when/why should they consider using the TEI-Lite DTD rather than their own?

Scholars should understand the extensions mechanism of the TEI well and deeply before they roll their own DTD, because unless they understand extensions, they cannot responsibly judge whether a non-TEI DTD is merited (in spite of the issues that such innovations raise on the level of interoperability). The scholars most likely to have legitimate reasons for departing from the TEI are those who have an in-depth focus on the physical artifact as an information-bearing state of the text. For those people, the TEI's tendency to abstract text from carrier medium may actually present a problem. In most other cases, in my experience, the objections to TEI are ill-informed, and the scholarly project would be better served by interoperability than by uniqueness.

Why should scholars take the time to learn about XML applications (not only as end-users, but as content-providers)?

For editorial reasons, basically. The application of XML markup is an editorial act, and even if you're directing a large project with flunkies who will actually lay down the angle brackets, or spec'ing an even larger keyboarding project in which offshore labor will do the markup, you need to understand what's inside the angle brackets in order to be a responsible participant, at an intellectual level, if you have any editorial responsibility whatsoever.

Beyond that, if you're interested in information design, interface design, layout, search and retrieval, or any of the other functional aspects of the information resource you're involved in creating, you need to understand XML and (in this case, especially) the applications that work with XML in order to know what's possible, what's not, what's been done and could be had for free, what's innovative and will cost money to accomplish but will be worth it, etc..

It's worth adding that you don't have to know how to deploy these tools, or design DTDs, in order to be able to make informed decisions about what's an intelligent use of your limited resources, when it comes to these projects—you just have to be genuinely and intellectually engaged, and you have to understand that the technical dimension and the intellectual dimension converge in information systems design.

What are the emerging changes in scholarly communication that make it important for practicing scholars, especially in the humanities, to join collaborative research teams, and regard technical as well as bibliographic knowledge as part of their arsenal of research tools?

Some of the things I've just been talking about make it necessary to collaborate: it's very difficult to teach yourself everything you need to know about information systems design, for example, if you are an English professor. I'm not even sure that with time and effort you can reach the point at which collaboration would be unnecessary. Information systems design is a discipline; so is the design of ontologies, which get expressed as DTDs or Schemas; so is the analysis of workflow and communication within scholarly publishing; so is cataloging and classification and the understanding of how a particular piece of content fits into that larger library (and preservation) context.

Basically, we need collaboration because scholarly information has a life-cycle, and we generally only occupy one moment in that cycle, so we can't look after the qualities and dimensions of the data that will be important to others in the cycle unless we engage them in the process in some way—and if we're going to engage them, it's always best and cheapest and most effective to engage them early in the design and production of the resource: most of the cost of creating these things is people-money, and if we can avoid re-formatting, re-describing, re-implementing, then we are much better off in the long run. We can avoid these things, but to do so, we have to be aware of the issues, and we have to be willing to work with others from early on in the process of scholarship.

What is the impact on libraries of adherence to these kinds of metadata standards (institutional repositories, federated searching, etc)

At present, it's actually hard to say, I think. In one sense, there's Google and that gets people 99.9% of what they want from the internet, in quick and dirty search results. On the other hand, there's a whole lot of stuff in libraries and universities that's being prepared to a more exacting standard of description. This is probably appropriate, and my guess is that it mirrors what's happening in business. As with editing itself, and direct markup, I guess that the importance of adherence to metadata standards for humanities scholars is the access and reusability they achieve.

Why, most of all, do librarians need to know more about these approaches--not only from a reference standpoint, but also from a development standpoint (librarians as collaborators/educators in the research process...)

Ontology, ontology, ontology. Cataloging and classification. Usability. Interface design. Indexing services. Your own institutional position as (library) faculty members. Tremendous new opportunities to recast your relationship to the rest of campus, not only in humanities and social sciences, but everywhere.

"First, do no harm" (words that do not actually appear in the Hippocratic Oath): a principle to guide markup?

How much to mark up? When to stop? What's uncontroversial to denote? What's necessary to denote?