Tool-Time, or 'Haven't We Been Here Already?'

Ten Years in Humanities Computing

by John Unsworth

Delivered as part of "Transforming Disciplines: The Humanities and Computer Science," Saturday, January 18, 2003. Washington, DC.

Just a little over ten years ago, in April of 1992, Susan Hockey (then director of CETH, the Center for Electronic Texts in the Humanities, now Professor in and Director of the School of Library and Information Studies) gave a talk at Rutgers University entitled, "Computing in the Humanities: Are We There Yet?" The answer--a foregone conclusion--was clearly "No," but ten years later, I think it's time to ask "Haven't We Been Here Already?" You know--that problem statement over there looks pretty familiar, and I could swear I've seen that to-do list before. Could it be that we're going around in circles? This sense of deja vu is particularly keen when it comes to tool-building for humanities computing.

Before I go further, let me say that I was part of many of the discussions I'll describe, and the others named in this talk are people I admire and consider friends. My point is not to blame others--or myself--for progress not made over the last ten years. Progress has been made, and humanities computing has contributed to it in very important ways. But often the effort to make progress in the area of tool-building for humanities computing has not produced much, and if we can understand why so little has come of some of these past efforts, we might be able to take a more effective approach to a problem that still exists, and in some ways is more pressing than ever.

What's the problem? It is this: We need (we still need) to demonstrate the usefulness of all the stuff we have digitized over the last decade and more--and usefulness not just in the form of increased access, but specifically, in what we can do with the stuff once we get it: what new questions we could ask, what old ones we could answer. We need to do this for two audiences: first, for colleagues in humanities departments who, while they admit that they are glad not to have to walk to the library to consult the library catalogue, can't really see that the digital library--assembled, inevitably, at the cost of other activities, services, and purchases--is really worth all that much. Second, we need to demonstrate this for the more general public, especially as it, and its values, gets represented in legislative priorities and state and federal funding. We have a problem when the general public thinks that the enormously oversold information age was all just snake-oil, and there's really no benefit here that's in proportion to the risk. We need to be able to show the first group that there's more to digital libraries than searching and browsing, and we need to show the second group that the problems humanities scholars care about are specific cases of more general problems, and solutions to the special cases will address the general cases as well. All of this is important in part because of the national and international recession that's underway: budget cutbacks are not kind to innovation, and innovation is what's sorely needed right now.

But this isn't the first time that's been true in humanities computing, with respect to tool-building. In June of 1993, just about ten years ago, Nancy Ide announced on the Humanist email discussion group something called "The Text Software Initiative" (TSI). The birth announcement of this initiative began as follows:

The widespread availability of large amounts of electronic text and linguistic data in recent years has dramatically increased the need for generally available, flexible text software. Commercial software for text analysis and manipulation covers only a fraction of research needs, and it is often expensive and hard to adapt or extend to fit a particular research problem. Software developed by individual researchers and labs is often experimental and hard to get, hard to install, under-documented, and sometimes unreliable. Above all, most of this software is incompatible. As a result, it is not at all uncommon for researchers to develop tailor-made systems that replicate much of the functionality of other systems and in turn create programs that cannot be re-used by others, and so on in an endless software waste cycle. The reusability of data is a much-discussed topic these days; similarly, we need "software reusability", to avoid the re-inventing of the wheel characteristic of much language-analytic research in the past three decades. (http://lists.village.virginia.edu/lists_archive/Humanist/v07/0031.html)

That was 1993. Ten years ago. Everything Nancy says in that paragraph could be said today of humanities computing applications, though actually, in the particular area of language-analytic research, the situation is somewhat better than this now, largely through the efforts of Nancy and others in the field (See EAGLES and GATE, for example). As far as I know, though, nothing ever came directly out of TSI. Doubtless other projects emerged from the relationships and ambitions embodied in TSI, but TSI itself seems, in retrospect, a statement of good intentions with respect to solving a clearly identifiable problem, without organizational or financial reality. It did hook up with the Free Software Foundation and GNU, in that same year, and according to an announcement in the GNU bulletin of June, 1993, it declared that it conceived of itself as "an international effort"--but these two traces--in Humanist and in the GNU bulletin--are apparently all that's left of it, at least on the Web.

A few years later, Susan Hockey and CETH hosted a meeting, in May of 1996, at Princeton University's Carnegie Center, on Text Analysis Software for the Humanities. This meeting was actually framed, in part, in response to a perceived failures of TSI to invite community participation. Susan's summary of the event is available in the Humanist Archives (http://lists.village.virginia.edu/lists_archive/Humanist/v10/0054.html), and Michael Sperberg-McQueen's more detailed trip report (which mentions TSI and contrasts this event with it) is also still available, elsewhere on the Web (http://www.uic.edu/~cmsmcq/trips/ceth9505.html). I was at this meeting, and I remember it well.

The first paragraph of Susan's report to Humanist, and a good deal of Sperberg-McQueen's trip report, bear quoting and review in the present context, both for what has changed, and for what hasn't. Susan's opening paragraph is a good example of what hasn't changed: "For some time," she writes,

those of us active in humanities computing have felt the need for better and/or more widely accessible text analysis software tools for the humanities. There have been informal discussions about this at a number of meetings, but so far no substantial long-term plan has emerged to clarify exactly what those needs are and to identify what could to be done to ensure that humanities scholars have readily available text analysis tools to serve their computing needs into the next century.

Except that we're in the next century now, this same paragraph could be written today, word for word, and it would still be as true as it was in 1996--a mere six and a half years ago. But 1996 was donkey's years ago, in internet time: Sperberg-McQueen's 1996 emailed trip report was sent from Bitnet. 1996 was also the year in which Java and VRML were both emerging technologies (one caught on, one didn't). And one of the differences between 1996 and the present is that, because of Java (and even because of VRML), we can now imagine lots of things we'd like to be able to do in addition to text analysis, with our humanities software. Visualizations, mapping, multimedia annotation, and so on.

In 1996, in his trip report, Sperberg-McQueen writes:

From the first uses of machine-readable text for humanistic research in the late 1940s, to today, I count three generations. The first generation included special-purpose, ad hoc programs written for particular projects to apply to particular texts. The second generation can be counted from early efforts to create reusable libraries of text-processing routines, some of which eventually turned into efforts to create general-purpose, reusable programs for use with many texts. Naturally, these were batch programs. The Oxford Concordance Program (OCP) and some modules of the Tustep system are good examples. In the third generation, general-purpose programs became interactive: Arras, the Tustep shell, Word Cruncher, Tact, and other newer programs are all third-generation programs in this sense.

We are, in these terms, in the fourth generation of tools, now, and things actually have changed since 1996. We interact with our interactive general-purpose programs over the network, and they begin to be able to interact with each other. The interoperability of software with software, data with data, each with the other, takes on a new dimension. And yet, in some respects, what Sperberg-McQueen says next is still pretty true of where we are now. He says:

What these programs do is useful: produce concordances, allow interactive searching, annotate texts with linguistic or other information, and a multitude of other tasks. Why aren't people happier about the current generation of software?

I think there are several reasons:

For many potential users, existing software still seems very hard to learn. Not everyone thinks so, but those who find current software easy are in a decided minority. User interfaces vary so much between packages that learning a new package typically means learning an entire new user interface: there is very little transfer of training.

This much is still true. Interestingly, what comes next is less true, and what comes after that even less so, today:

Current programs don't interoperate well, or at all. CLAWS or other morphological analysers can tag your English text with part-of-speech information, but if you've tagged your text with Cocoa or SGML markup, you'll have to strip it out beforehand, to avoid confusing CLAWS. If you want to keep the tagging, you'll have to fold it back into the text after CLAWS gets through with it, as the British National Corpus did. Other programs, of course, will be confused in turn by the part-of-speech tagging, so you'll have to strip that out, too, before applying other text analysis tools, and then fold it back in again afterwards.

It's worth pausing for a moment to consider that these particular problems either don't exist, or needn't exist any more. Current programs can interoperate much more successfully now that they increasingly support well-formed XML at the level of data, and they can interoperate at other levels through things like SOAP (Simple Object Access Protocol) and WSDL (Web Services Definition Language), and even HTTP.

Current programs are often closed systems, which cannot easily be extended to deal with problems or analyses not originally foreseen. You either do what the authors expected you to do, or you are out of luck. The more effort a program has put into its user interface, the more likely it is to resist extension (this need not be so, but it is commonly so); the more open a system is, i.e. the more care its developers have put into allowing the user freedom to undertake new tasks, the worse the reputation of its user interface is likely to be....

This one is still true, and probably will always be true (if we agree that choosing a new skin for your MP3 software doesn't count as "extension" of an interface).

Almost all current text analysis tools rely on what now seems a hopelessly inadequate model of text structure. Text, for these programs, is almost invariably nothing but a linear sequence of words or letters. The most sophisticated of them may envisage it as a linear sequence with tags showing the values of different variables at different points in the sequence (as in Cocoa-style tagging), or as an alternating sequence of text and processing instructions (what is sometimes called the one-damn-thing-after-another model of text structure). None of the current generation of text-analysis tools support the significantly richer structural model of SGML, though the experience of the TEI has persuaded many people that that structural model represents a radical breakthrough in our ability to model text successfully in machine-readable form.

This, interestingly, is quite different, thanks in part to the success of TEI in this community, but in a much larger part, frankly, to the success and general acceptance of XML. And the difference that XML makes is profound--as you realize immediately when you read Sperberg-McQueen's account of the discussion of software architecture at this 1996 meeting:

The key features ... for [the next generation of text analysis software] are:

modularity (the system should be a "collection of relatively independent programs, each of which offers a well-defined subset of basic operations for processing textual data")

professionality (the modules should support serious research work; this requires more than supporting office automation)

integration (the modules should, together, handle all stages of a project: data capture, analysis and processing, and presentation or output)

portability (programs and data should be system-independent, to buffer users from rapid changes in hardware and to allow inter-site collaboration)

Modularity, portability, integration: it's what we all still want--except when we actually have to get something done, in which case our tendency is to build something to the purpose, and the hell with re-use. In fact, we often don't get close to building the application when we take the re-use route, because we have to begin by writing the specification. For Calvin and Hobbes fans, it's a little like Calvinball: the only game you actually play is the game of making up the rules. Later in his report, in recounting the breakout discussions, Sperberg-McQueen tells a story of Calvinball:

The architectural group began, plausibly enough, by deciding to decide what it might mean to specify an architecture for the kind of system we had been talking about: what needs to be specified, and at what level of detail? This is, surely, a necessary first step. It would be nice to be able to report that after it, we had taken another one, but after we had reached something resembling agreement, it was time for lunch. Our dedication to the cause fought with our desire to eat; struggled; wavered; lost. We went to lunch.

It would be tempting to say that we've been out to lunch ever since, but that's not quite true. On the architectural front, in particular, a lot of problems that this group was thinking they might have to solve for themselves have now been solved for them, by the larger community of the W3C, at least at the level of standards and protocols. Michael Sperberg-McQueen was an important part of that, as one of the authors of the XML standard. Good work has also been done at the level of APIs, in projects like The Open Knowledge Initiative and, in a more modest way, in IATH's own Granby. In retrospect, it is remarkable to think that a group of humanities computing types ever sat around and thought, in effect, that they might need to invent Java, XML and all its constituent standards, and by the way, the semantic web.

Again, as far as I know, nothing ever came of this meeting. Like TSI, it was a good-faith effort at getting people to rally around a software-development problem, and do it on a volunteer basis, or at least voluntarily coordinate the things they were doing on other bases. In the end, though, nothing never really happened--the spirit was willing, but the flesh was weak.

A few years later, in 1998, again on Humanist, Tom Horton announced ELTA, the Encoded Text Analysis Initiative: this initiative was a stab at developing the fourth-generation tools for humanities computing, and it came along at a time, just a couple of years later, when we could all accept the notion that SGML was, and XML soon would be the grammars we would all use. Elta described itself as

a collaborative effort to encourage and support the development of software tools for the analysis, retrieval and manipulation of electronic texts. Our focus (at least initially) is on tools to support the needs of the humanities computing community, but we hope our results are useful for anyone interested in computer processing of texts marked up with SGML and XML.

We have organized Elta in response to continued interest and need for such software, most recently expressed at the birds-of-a-feather session at ALLC/ACH'98 in Debrecen. At this time Elta provides Web resources and an email list to support those interested in the Initiative's goals for promoting software development. (http://lists.village.virginia.edu/lists_archive/Humanist/v12/0242.html)

It's hard to overstate the sadder-but-wiser twang I feel when I read, out loud, the words "At this time" (so hopeful, that) or "Elta provides Web resources" (about what Elta would like to be, mostly, and some pointers to things that might like to be it) "and an email list" (which never did much business). I was at the meeting in Debrecen, Hungary, at which ELTA was established, and I took part in the planning of it, and I can tell you what its problems were, and what problems it shared with earlier efforts. First, it had no funding--though it hoped to find some. Second, it had no organizational structure. Third, because of the first problem, it could only get off the ground with volunteer labor from people who already had full-time jobs--but the Elta initiative offered no incentive, no motivation, no account of why that labor should be volunteered, other than that we would all like to have the free use of the tools we imagined might emerge from it.

It won't come as a surprise, then, that nothing ever came of ELTA either, except maybe other reformulations of the call to action . . . for example, TAPoR, the Text Analysis Portal for Research (http://huco.ualberta.ca/Tapor/):

TAPoR will build a unique human and computing infrastructure for text analysis across the country by establishing six regional centers to form one national text analysis research portal. This portal will be a gateway to tools for sophisticated analysis and retrieval, along with representative texts for experimentation. The local centers will include text research laboratories with best-of-breed software and full-text servers that are coordinated into a vertical portal for the study of electronic texts. Each center will be integrated into its local research culture and, thus, some variation will exist from center to center.

The difference is that TAPoR has funding: $6.7M, Canadian. And yet, if you read the TAPoR text carefully, it doesn't say it will develop tools--it says it will be a gateway to them, and to testbeds of texts. That qualification of the TAPoR mission is due to the restrictions on the funding from the Canadian Foundation for Innovation: software development would fall into a category it doesn't fund, though it will fund the process of collecting software, documenting it, and providing access to it.

So, here we are only ten years after TSI, only six years after CETH's symposium, only four years after ELTA, and practically on top of TAPoR, and I am about to tell you that it's time to think about developing some tools for text analysis (and other forms of analysis) in humanities computing. And it is, for the same reason as it was: we have lots of this stuff around now, and we need to do something with it. It's very much like the Shoah problem with the 52,000 interviews with survivors.

So, let's assume that we learn from the past and that we benefit from the standards environment of the present. What principles might guide this effort, and might help to build a modular, extensible toolkit that actually does things humanists want to do--that actually helps us ask new questions, and answer old ones. I took a first stab at answering that question in a piece I wrote a couple of years ago, on Scholarly Primitives. The argument of that piece is pretty straightforward: lots of different scholarly operations in lots of different disciplines can be decomposed into a relatively small number of primitives--fundamental operations that can't be further decomposed into further constiutent elements. A partial list of those operations, presented here with some examples, would include:

Discovering
Olive Software
NITLE search engine
Annotating
Inote
Imarkup
Comparing
IBabble
The Versioning Machine
The Virtual Lightbox
Referring
Scout Portal Toolkit
Historical Event Markup Language
Omnigator
WebBrain
Sampling
TAPoR tools
Sample text for above: http://hatbox.lib.virginia.edu/conferences/futures/mcgann.xml
TextArc
Graphviz and Ivanhoe

But--and this is the all-important but--you want to enable all these things in a standards-based software architecture that is open, modular. extensible, cross-platform, etc.. And--this is the all-important and--you want to be able to get important work done with these tools. Looking at the Shoah database and interface and tools yesterday, I was thinking of the scholarly primitives--and there they all are: discovery, selection, annotation, comparison, and on and on. But how much of that code is reusable? How much can be applied to other data? How much was reused from other projects? None, I'm sure. And the same can be said of almost all the examples I just gave. Each piece of software, in order to achieve a goal, has gone after a particular one or two of these primitives, and wrapped them up tight in a self-contained package.

So what if we were actually going to do this--actually build reusable, modular tools for the computer analysis of humanities date--with funding, with organizational structure, with an account of the motivation for participation. How would you go about it? Well, as it happens, I have an answer to that question:

First, you'd want consensus about what those primitives are--and about what problems need to be solved in what order--and you'd want it across a reasonable-sized group of researchers working with computational tools in humanities research.
Second, you'd want that architectural specification: which standards, which APIs, etc.. At this point, though, you could choose many of those things off the shelf, if you didn't make the mistake of insisting that you had to reinvent them.
Third, you'd want scale--you'd want enough people working on a problem so that you could actually get a working tool in a reasonable period of time. Like while the researcher who wants to use that tool is still alive. How many people is that? Oh, probably more than ten, fewer than fifty. Shouldn't cost more than a few million a year.
Fourth, you'd want management--no, really, you would want management. You need to make sure these people actually are working together, actually are working on the same problems, working toward a common goal. In fact, if the group is big enough, you want more than one level of management--but you'd like not to have to pay for that.
And finally, you'd want these tools developing and being tested in conjunction with each other, and in real research applications, with real researchers.

So, what's the scheme that makes this work? Who is motivated to make it work? Well, foundations, agencies, and libraries that have made substantial investments in creating digital libraries are motivated to contribute funding, because they need to prove that the investment in digitizing--and especially in creating highly structured, high-quality digital collections--has been worth it. Increased access alone isn't enough to prove that, I think. We need to be able to do more, in our digital libraries, than search and browse. Humanities research computing centers (several of which are represented here) might be persuaded to identify a common research program across their projects, and train and manage the staff to accomplish that program while working with humanities researchers, if the salaries of those staff were donated. Heck, they might even come up the funding for travel and training. Staff might be motivated to accomplish the agreed-upon goals if they knew their paycheck depended on it. Researchers might be motivated to use the tools if they had been consulted, at the outset, as to the purposes for which they would be built.

I can hear the counterproposals: some begin with the words "Just collaborate with CS" and others with the words "Just use open source development." And the third: "Just learn to program, you sissy." It's true, on the first point, that you might be able to find some areas of overlapping research interests between humanities researchers and those in computer science, but in that case, the computer science faculty would be in the same role, in the same position, as the humanities faculty--they would be identifying research goals and testing tools. I doubt that the building of these tools itself is going to count as a research project for anyone. I could be wrong about that, but better to start from that assumption than to start from the assumption that CS faculty, or their students, will be eager to build, as their research, the tools that we want to do our research. On the second point--open source development--the project should have a migration path to Sourceforge, but the difference between things that thrive and things that wither on Sourceforge has a lot to do with how far along they are when they get there, and how clear it is what people are supposed to help build, and why. And as to the third objection, well, it will be a long time before many of my colleagues feel it is important to learn how to program in order to build their own tools--some do, and more will, but most don't, and many never will. And a lot of them don't mind being called sissies, either.

1996, the year of the CETH symposium on text analysis tools, was also the year in which Ross Callon proposed, in Internet RFC 1925, "The Twelve Networking Truths", Truth number 11 states:

"Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works." (http://www.faqs.org/rfcs/rfc1925.html)

So: are we back where we were in 1996, looking at an old idea in a different presentation, even though it didn't work the first time? We ought to have a very clear answer to that question before we throw a few million dollars at the problem. I think we are in a significantly different world with respect to standards and architecture--but we still have to do the requirements analysis, and we still have to agree on the tools we'll build. We still have the social issues--how to structure a successful collaboration; how to engage end-users in design; how to sustain such a project over the long run, through inevitable changes in institutional priorities, computing environments, and personnel. On the other hand, we have quite a bit more research infrastructure in this area than we did ten years ago, or even five. We also are a good deal closer to general acceptance, in the disciplines of the humanities, that it is necessary to deal with information technology. And finally, we have a lot more in the way of digital primary resources than we did in 1993 or even 1996. Those resources are, in and of themselves, the most compelling argument that we need to be able to do more with them. And the relative poverty of the humanities is the best argument for making sure that we try to build durable, general purpose, reusable software--because we can't afford purpose-built, disposable code.

Building these tools will answer or moot many of the questions we've been discussing earlier in this conference, and will shift the burden of proof, in effect, from the new modes of scholarship to the traditional ones: if we build tools that do allow us to ask new questions and answer old ones, then it will be clear why we have built our digital libraries, and in the disciplines, we will worry about what hasn't changed in scholarly methodology, and not about what has.