“All the King's Horses and All the King's Men

Couldn't Do Text-Mining Across the Big Ten.”


Symposium on What to Do With a Million Books

http://dhcs.uchicago.edu/

University of Chicago

Sunday, November 5th, 2006


John Unsworth, Dean & Professor,

Graduate School of Library and Information Science

University of Illinois, Urbana-Champaign


Alternate title: One fool can ask more questions than a thousand wise men can answer.


Like a number of people at this conference, I'm interested, at the moment, in data-mining—that is, “the practice of automatically searching large stores of data for patterns . . . [using] computational techniques from statistics and pattern recognition” (http://en.wikipedia.org/wiki/Data-mining). In particular, I'm interested in text-mining in humanities digital libraries, which now contain quite large collections of literary texts—novels, poems, and plays—in English. Large, in this case, means single-digit terabytes at most, which I realize is not large in comparison with trillions of web pages, but it is increasingly complete at least as a core sample of the history of printed literary text in English, at least up to the early 20th century. If copyright weren't an issue, we'd have much larger collections, of course. These literary texts are machine-readable and for the most part they have been marked up in XML by librarians, publishers, and scholarly projects, but they also require additional pre-processing before they can be used for text-mining (for example, tokenization, part-of-speech tagging, etc.).


I'd like to talk about some of the things about our current digital library environment that make it difficult or impossible to do text-mining across collections, some of the problems that need to be solved in order to do such work, as well as why it might be desirable to solve them and who else might be working on similar problems. All of that, in turn, comes under the general heading and buzzword of cyberinfrastructure, and part of my purpose here is to demonstrate that humanities needs and can use cyberinfrastructure, and to suggest what features of that cyberinfrastructure are likely to be required by other communities, and which are likely to be uniquely necessary to the humanities.


At various points in what follows, I will make reference to several projects:

Much of what I have to say today comes out of the experience of working on the nora project, out of developing a collaboration with the Wordhoard project, and out of thinking about what the two have in common, where they differ, and what should be done that neither of them do, so allow me to describe these two projects briefly (borrowing from Martin Mueller, for the description of WordHoard).


Let's begin with similarities: Nora and WordHoard share the basic assumption that the scholarly use of digital texts must progress beyond treating them as book-surrogates and move towards the exploration of the potential that emerges when you put many texts in a single environment that allows a variety of analytical routines to be executed across some or all of them.


The WordHoard project applies to literary texts the insights and techniques of corpus linguistics, namely the empirical and computer-assisted study of large bodies of written texts or transcribed speech. In WordHoard, such texts are annotated or tagged according to morphological, lexical, prosodic, and narratological criteria. In its current release, WordHoard contains the entire canon of Early Greek epic in the original and in translation, as well as all of Chaucer and Shakespeare, and Spenser's Faerie Queene. WordHoard offers an integrated environment for close reading and scholarly analysis of a limited number of texts. The user interface focuses on what Martin calls the basic philological activity of “going from the word here to the words there,” and it leverages the power of the computer to support for this activity more effectively than print tools, like a concordance could do. But WordHoard’s statistical module also includes a number of routines that we have heard named earlier today, and that come from the domain of data-mining from which Nora starts.


For my purposes today, it is significant that the application and the texts are fairly tightly integrated in WordHoard, in the sense that the texts have been tagged in ways that facilitate the functionality of the application. The admirable documentation that the WordHoard project publishes describes a very detailed data model, and the admirable user-interface that the WordHoard software presents is leveraged off that data model.


The goal of the Nora project has been to produce software for discovering, visualizing, and exploring significant patterns across collections of full-text humanities resources in the wild, as it were—that is, in existing digital libraries. Like WordHoard, Nora applies some of the tools, techniques, and insights of corpus linguistics to its collections, and like WordHoard, Nora deals with literary texts, though from a later era—British and American literature of the 18th and 19th centuries. Unlike WordHoard, Nora's end-user application is deliberately loosely coupled to the data, and assumes that we cannot add tagging to the underlying documents. Partly as a result, what we have to show at this point is a lot less impressive to look at, and less satisfying to work with, but it is probably more generalizable across arbitrary text collections. Current Nora applications focus on text categorization, text-mining, and visualizations of patterns in collections.


If you were to look under the hood of the two projects, you would find that both have procedures for

  1. Ingesting arbitrary texts that meet some rules (e.g. well-formed XML)

  2. Tokenizing the texts, assigning to each word a unique location, and applying part-of-speech tagging and other techniques familiar from corpus linguistics

  3. Converting the tokenized and preprocessed texts into a datastore that includes various count objects to simplify and speed up subsequent operations


In Nora, the data store provides the basis for a chain of operations triggered via web services and run in D2K before returning results to the end-user applications. In the WordHoard environment, the user interface talks to the data store through an object/relational persistence and query service called Hibernate. In both Nora and WordHoard, however, the datastore is separable from the processes it feeds and could in principle feed quite different processes via quite different intermediate layers.


Nora and WordHoard differ marginally in their basic ways of tokenizing and preprocessing data. They differ with regard to lemmatization and named-entity extraction, for example, but these are essentially unmotivated differences, and we expect it will be easy to reach agreement on a set of shared pre-processing procedures. As I mentioned before, WordHoard has also done some tagging within the source document, identifying prosodic and narratological phenomena, but the assumption going forward is that while the tools should be able to take advantage of specialized tagging, the tools cannot insert or assume it.


Nora and WordHoard have both employed relational database systems to maintain their data stores, but Nora is also exploring using Lucene. Both projects make use of the xml tags in the texts, something that sets them apart from most text-mining done in the scientific community. On a technical level, both projects distribute a webstart application written in Java, though Nora is experimenting with OpenLaszlo as a framework for building even lighter-weight front-ends.


Because these two projects have very similar underlying requirements for their texts, and very similar basic techniques for analyzing those texts, it makes sense to combine them. Because they have developed in complementary ways and explored alternative strategies for accomplishing similar goals, it seems likely that they will strengthen one another—but it will also be a challenge to combine them.


One major challenge will be to construct a datastore that will be sufficiently robust, fast, and flexible to support work with data sets larger than what either project has encompassed to date. Both Nora and WordHoard have created or collected data to work on, and neither works with collections that exist aside from the tool, so another challenge will be to develop this next generation of tools to work in the real world, alongside collections that scholars are already using. In doing that, we need to bear in mind that while some special processing or representation of those collections may be necessary to enable text-mining, such processing must be a push-button process from the point of view of those who maintain the collections, not something that requires lots of human intervention. And there must be no fiddling with the output of that process, either—no manual munging of derivative datasets that later may need to be replaced.


At the same time, we can see from the rapid rise and surprising power of social software (wikis, blogs, social bookmarking, folksonomies, etc.) that in some cases it might make sense to assume that what Bill Tozier, this morning, called “crowdsourcing” might be useful: users could contribute to improving and enriching (perhaps, as in DPP, proofreading) texts in digital libraries. The current release of WordHoard, for instance, lets users construct customized ‘word sets’ and store these for private or public use. Of course, there are many issues to address here, beginning with the library’s need to ensure the integrity of its collections—and the fact that many of these collections will actually be licensed from and served up by publishers, who are also going to concerned about the integrity of their product. Still, it seems at least imaginable that user contributions might be layered on top of the original texts and that an environment for the online analysis of texts would be far richer if it were an online community as well, where analysis could be shared, where intermediate artifacts from one person’s research process could be made available to others rather than being recreated by them, and so on.


Currently, we have to aggregate texts in order to do text-mining on them—we have to have possession and control of the data, in order to pre-process it, and we need to have it one pile in order to get meaningful answers to statistical questions. However, this means that either our access to text or our access to real users is limited: either we have to create our own collections (e.g., the No Name Shakespeare) or, in order to get permission to aggregate other people's texts, we have to promise not to republish them, and to use them only for testing and developing new tools. That's why the next step is to build and test the tools within the confines of discrete existing collections, whereupon at least we'll be able to work with the users for whom each collection is licensed—but in fact we realize that real users need to gather material from across collections in order to create a coherent set of material: one collection will not have all the texts that a scholar in any given area will want to use. And that's why, ultimately, we need to be able to do text-mining across collections, Humpty-Dumpty nothwithstanding.


So, with these modest goals and requirements in mind, let's consider the obstacles that stand in the way of attaining them. Some of these are obstacles in a scenario where we aggregate resources, some are obstacles in a scenario where we distribute the tools in order to observe collection boundaries, and some are obstacles to text-mining across distributed collections. I'll consider them in that order, at the lowest level where they occur, in this hierarchy.


I. Challenges for text-mining with aggregated collections


Deracinated Resources:

Texts that are prepared with the notion that they will always be used in the same way, for browsing and searching, in the same environment for which they were originally prepared, have a tendency to leave certain kinds of information implicit—it's implicit elsewhere in the system, and not explicit anywhere in the text itself. Once you start to aggregate these resources and combine them in a new context and for a new purpose, you find out, in practical terms, what it means to say that that their creators really only envisioned them being processed in their original context—for example, the texts don't carry within themselves a public URL, or any form of public identifier that would allow me to return a user to the public version of that text. They often don't have a proper Doctype declaration that would identify the DTD or schema according to which they are marked up, and if they do, it usually doesn't point to a publicly accessible version of that DTD or schema. Things like entity references may be unresolvable, given only the text and not the system in which it is usually processed. The list goes on: in short, it's as though the data has suddenly found itself in Union Station in its pajamas: it is not properly dressed for its new environment.


Fidelity vs. normalization:

Digital representations of printed text—especially in the humanities—tend to be faithful to the original in ways that often make it difficult to do text-mining. A simple example involves line-end hyphenation: faithful encoders tend to encode line-end hyphenation, and textual editors will probably thank them for doing that, because from the point of view of textual editing, line-end hyphenation can carry meaning, but text-miners will not thank them, because the interposition of a hyphen in the middle of a word makes it difficult to tell that said word is actually the same as its non-hyphenated (normal) form, and that makes it difficult to get accurate token counts, which makes it difficult to get accurate document frequency for tokens, etc. etc.


Scale:

Most text-mining in the world at large is done with small chunks of text: newspaper articles or email messages probably define the outer limit of size in most cases. Some texts in the humanities are short—Emily Dickinson's poetry, for example—but most are novels, plays, histories, and those generally run to hundreds of pages. Scale in the unit of text does matter, even if we're not talking about billions of words per unit. In one collection that we work with in Nora—UNC's “Documenting the American South” collection—there are about 600,000 unique tokens: in a universe of billions or trillions of tokens, that's not a big number, but it could be large enough to be “expensive” in Bill Punch's terms, if you were waiting on the other side of a web browser.


Complexity:

Most text-mining in the world at large is done with unstructured text. Structure adds complexity to the analysis, unless you ignore it—but why ignore information that some human being, familiar with the text and its uses, considered important enough to encode? What virtual intelligence could we leverage about the unstructured component of these collections (the text between the tags) by paying attention to the tags? What might we leverage from more densely tagged texts that would help us infer structure in less densely tagged texts? And then there's a different level of complexity, namely the complexity of the texts themselves, and beyond that, the complexity of the interests of scholars: I think Tanya's talk this morning gave some sense of the spiral of expectations and requirements that will arise out of the initial exposure to these tools, once the tools are out there with the collections. We've begun in Nora with individual words, and some work on named entities and their relationships, but there's a great deal more to look at, as features: phrases with semantic significance, syntactic units, affect, metaphor, and so on.


Abstracting idiosyncracy:

In Nora, we've tried hard to produce a generalized client that can work with arbitrary collections and (within a paradigm such as classification) more or less arbitrary purposes on the user's part. In order for such a client to know what collections are available on a server, how to navigate those collections and what works are available within them, what units of text are available for analysis, and what the peculiarities of a given work or collection might be, we access over the network a configuration file in which certain kinds of information about the collections can be expressed. This is particularly important as an abstraction of idiosyncratic characteristics of collections, and idiosyncratic knowledge about them—which, in turn, must be encoded in some standardized, accessible, programmable way if you want to create general-purpose clients. Similarly, the configuration file can be used to record information that would be useful on ingesting a collection and normalizing it for text-mining. At this end of the process, abstracting configuration information promotes automated derivative representations of collections for text-mining, and helps to discourages manual data-munging. In Nora, this idiosyncratic, collection-specific information is expressed in an XML data structure that is written by humans who know the characteristics of the collection (or using automated metadata extraction routines written by them), and read by software. The grammar of this XML file provides constructs that allow us to map multiple files to a work (for example different volumes of a triple-decker novel might have been encoded in separate <TEI.2> documents), or to identify a component-level (for example a chapter) that is available for the purposes of data mining. We call this configuration file “Nora Chunk,” and Nora Chunk provides


If you examine a Nora-chunk file, you will see that it:

WordHoard also has developed its own properties file, for similar reasons, and we think that going forward, in MONK, we should try to harmonize these and also consider how these files and their syntax, developed out of necessity on the way to doing something else, might relate to more generalized collection-level metadata schemes like METS (Metadata Encoding and Transmission Standard.1


  1. Challenges for text-mining with distributed tools

Keeping it simple and cheap:

People who maintain collections in libraries, or publishers who provide them to libraries, might be persuaded that it is worth also providing tools that make those collections more useful or more interesting, but not if doing so introduces lots of new complications, or requires lots of new resources. Keeping it simple means making sure that data-preparation for text-mining requires minimal human intervention, and that any new information that results from the process is either contained in that reusable configuration file or is automatically reproducible. Keeping it cheap is a different kind of challenge: an automatically generated representation of a text for the purposes of text-mining necessarily contains a lot of stuff that isn't in the original representation (all that part-of-speech tagging, for example) and it can be many times the size of the original collection. Text is cheap, in terms of the storage it requires, but many times cheap might not still be cheap. And beyond storage, there's processing: text-mining with large collections in real-time is processor-intensive and I-O intensive; doing it across the network can also be bandwidth-intensive, depending on how the data-flow is constructed and how much processing you want to be able to do on the client side.


Recombinant and re-usable data sets:

Ideally, as suggested earlier, we would have an environment that allows users of these tools to share their results with one another, and an environment that allows them to share the intermediate artifacts in the research process—the preprocessed sub-collections that the individual user has assembled to answer a particular question—because that pre-processing is what takes most of the time: the analysis itself takes relatively little time. Therefore—if we supposed that two users might be interested in the same sub-collection—there would be a value in keeping that sub-collection around once it had been preprocessed. How would we identify such sub-collections? Where would we store them, especially if they were drawn from multiple home-collections? How long would we keep them around before deciding nobody else was going to use them? Would they be necessary for checking the work done with them, or would we have confidence that the same texts, reassembled, would constitute the same sub-collection?


Amelioration vs. reliability:

We also need an environment that allows users to enrich and improve the collections that they're working with, adding user-supplied metadata, where metadata might be anything from normalized spelling to suggested corrections of underlying texts, to richer thematic or contextual information—but the user needs to be able to do that without undermining the integrity of the original collection or diminishing another user's confidence in that integrity. If we create such an environment, where will it reside, and how will it be managed? Is it something that individual collections (libraries, publishers) will participate in, in some web-like way, or will it have to be totally distributed (peer-to-peer, on the end-users' machines) or totally centralized (and who would then be the host)? If user input actually does produce improvements to underlying collections, will those collections then need to be re-processed for future analysis? If not, how would stand-off improvements to the text-base be reflected in the derivative representation that's used for text-mining?


  1. Challenges for text-mining across distributed collections

Access to XML:

In most cases, even though the underlying source of library, publisher, and scholar-produced electronic texts is XML, and the information embedded in those XML tags is almost always richer than what gets presented in a web browser, there's almost never an easy way—at the site of public dissemination—to get at that XML. The web user almost always gets a rendered html or xhtml version, with no option to request the underlying source. This wouldn't be a problem if all the texts you want to work with are in a collection that has text-mining tools mounted along with it, but it is a problem if you are interested in working with texts that are not presented along with tools, or if you need to create a sub-collection by pulling together texts across collections.


IP in distributed collections:

Texts that expect conditions of use to be enforced by the system in which they are embedded, before you get to the text itself, usually don't carry with them information that would stipulate their conditions of use—and if they did, we would need a framework like Shibboleth2 to be widely adopted and further developed, in order to resolve the user's rights under those conditions. One of the major reasons for not pursuing the strategy that both Nora and WordHoard used in the first round of their development—creating or aggregating the collections on which to work—was that we thought it would be no fun to spend all of our time administering other people's intellectual property regimes, figuring out which of our users had permission to use which of their texts, and so on. The intermediate solution is to provide one instance of the tools per collection, and assume that whoever has access to the tools also has access to the texts, but that's an expedient that's only slightly less limiting than the expedient of creating your own collections. What we really want, obviously, is for the tools to be free-range, even if the texts are caged.

The need to read:

One of the things that's become clear in both Nora and WordHoard is that users need access to at least substantial context and often the full text of individual documents at multiple points during the text-mining or text-analysis process. For example, most of the text-mining that we have done in the Nora project falls into the category of supervised learning, which is to say that human users need to rank or rate individual texts for their likeness to a particular target: in order to do that, the user needs to actually see and read the text. Once texts have been rated and that training set has been submitted to the classification software, users need to be able to see what characteristics of previously unexamined texts caused them to be sorted into the target category by software. As was suggested this morning, humanities users are not so interested in knowing that software predicts a particular text will be perceived as sentimental, or that a particular speech in a play was likely to have been delivered by a woman or a man—these are things they are perfectly capable of determining without the help of the computer. However, they are interested in knowing why software might make that prediction. In other words, they want to see the evidence, and it's hard to show evidence without showing text, in text-mining.


Combining result sets:

One of the major challenges for text-mining across collections will be the difficulty of getting accurate statistical information by aggregating results (rather than aggregating collections). We might be able to calculate raw frequency by aggregating results, but can we calculate more relative measures (for example, inverse document frequency) by aggregating scores across different collections? Who else is worrying about this?


    IV. Shared Cyberinfrastructure:

Google at al.:

In fact, if anyone on the planet is tuned in to such problems with respect to the analysis of electronic text, it is Google, and we should consider them as allies in research. And even though, as Roy Tennant says, Google is more like Microsoft than they are like us, in fact Google researchers publish interesting information about their technology—perhaps not the very latest developments, but things that are in use and were developed in-house, like Google Filesystem3 and Bigtable,4 which might be useful in large or distributed digital libraries. We might also investigate Mapreduce,5 Sawzall,6 and other interesting tools and methods coming out of Google to see if they could help to manage some of the problems of distributed or parallel analysis. We can see research and tools coming out of other text-collecting activities like the Internet Archive, the Open Content Alliance, and the Million Book Project as well, and all of this should serve as a reminder that there are certainly others working on problems that we share, in humanities text-mining.


Aquifer:

Closer to home, the Digital Library Federation's Aquifer7 initiative is about collection-sharing, and I would argue that it should be expanded to include shared collection processing—not just federated searching, which is what's now envisioned, but processing on sub-collections drawn from participating libraries. DLF is also a potentially important institutional framework within which to work out some of the social and technical issues that would need to be addressed in order for shared collection processing to become possible. Incidentally, in order to convince the DLF that it's worth working on this kind of problem, some indication of demand from users—that is, people like you—will be important.


NSF-CI:

The NSF Office of Cyberinfrastructure is supposed to be looking at the needs and opportunities for shared computational, software, and human infrastructure. Some of the needs that we have in the humanities (like the ones I have been discussing, I would suggest) are structurally, functionally like those that researchers in other disciplines, with other kinds of data, may encounter. We might have some needs (leveraging markup, or working with longer texts, for example) that are peculiar to the humanities, but in general I think we need what other research communities need, in terms of infrastructure: flexible, authenticated, and distributed access to trusted repositories of domain-specific data, the ability to parallelize processor-intensive jobs across computational resources, data curation for work-products, and so on.


One NSF-funded project that seems particularly relevant to some of the issues I've raised is the Pathways project8, a conceptual framework for using digital content in the course of scholarship. Pathways aims to

develop broadly applicable models and protocols to support a loosely-coupled, highly distributed, interoperable scholarly communication system. A graph-based information model will provide a layer of abstraction over heterogeneous resources (data, content, and services). A service-oriented process model will enable the expression and invocation of multi-stage compositional, computational, and transformational information flows.


The actual engineering of the models and protocols that Pathways describes is, in turn, the sort of thing that is being looked at in GENI, the NSF's “Global Environment for Networking Innovations.” GENI is intended

to explore new networking capabilities that will advance science and stimulate innovation and economic growth. . . . Ultimately, the research enabled by GENI is expected to lead to a next generation of network capabilities and services. Specifically, the GENI initiative envisions the creation of new networking and distributed system architectures. . . .

GENI will be a kind of laboratory for modeling and testing these new architectures, and what I was earlier calling the “post-monk” project, conceived in terms of Pathways, could, I think, provide an interesting test-case for GENI experimentation, and might (in actual implementation) benefit from the virtualized computing that GENI wants to establish as an environment for experimentation.


Humanities Cyberinfrastructure:

In general, I think we need more projects, more conversations, and more community efforts that involve particular domain experts, relevant technical experts, collection experts, focused on some sort of processing of humanities data beyond searching and browsing. We also need more systematic studies of the uses to which scholars and scientists put information in digital form, studies of the tools they use and how they use them, of the tools they need and what they need them for. Some of those studies are being done, of course, by faculty in schools of library and information science and by faculty in libraries, but more collaboration between those communities and the NSF cyberinfrastructure community is really needed now, in order to make large—and distributed—collections work. Finally, we need some interesting and experimental tools to be deployed alongside existing library collections, in order to drive demand for more distributed, shared collection-processing. And when we reach the point of actually attempting text-mining across the Big Ten, we'll need collections that envision the possibility of being recontextualized, as well as a cyberinfrastructure that supports parallelized analysis, single-sign-on authentication, access to significant snippets of text even where the user has no direct rights to the whole. Sounds like a job for all the king's men (and women)--not just the engineers and the computer scientists, but librarians, humanists, and others.


Why?

At the beginning of this talk, I said I would talk a little bit about why someone would want to do text-mining across humanities digital libraries. I'd like to offer someone else's reason, rather than one of my own, and while I don't want to make Martin Mueller an accomplice in citing this, it does remind me a bit of his story about the importance of forgetting. The reason comes from Franco Moretti, and he calls it “distant reading”:

The United States is the country of close reading, so I don’t expect this idea to be particularly popular. But the trouble with close reading (in all of its incarnations, from the new criticism to deconstruction) is that it necessarily depends on an extremely small canon. This may have become an unconscious and invisible premiss by now, but it is an iron one nonetheless: you invest so much in individual texts only if you think that very few of them really matter. Otherwise, it doesn’t make sense. And if you want to look beyond the canon (and of course, world literature will do so: it would be absurd if it didn’t!) close reading will not do it. It’s not designed to do it, it’s designed to do the opposite. At bottom, it’s a theological exercise—very solemn treatment of very few texts taken very seriously—whereas what we really need is a little pact with the devil: we know how to read texts, now let’s learn how not to read them. Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes—or genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, Less is more. If we want to understand the system in its entirety, we must accept losing something. We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But it’s precisely this ‘poverty’ that makes it possible to handle them, and therefore to know. This is why less is actually more.9

With that on the table, I'll just close by inviting you all to attend Digital Humanities 2007, which will be held in early June at the University of Illinois in Urbana-Champaign. Moretti will be a keynote speaker there, and I'll be the local host. Submissions of abstracts and panel proposals are being accepted up until the 15th of this month, at digitalhumanities.org...


1 http://www.loc.gov/standards/mets/

2http://shibboleth.internet2.edu/

3http://labs.google.com/papers/gfs.html

4http://labs.google.com/papers/bigtable.html

5http://labs.google.com/papers/mapreduce.html

6http://labs.google.com/papers/sawzall.html

7“DLF Aquifer is a Digital Library Federation initiative. Our purpose is to promote effective use of distributed digital library content for teaching, learning, and research in the area of American culture and life.” http://www.diglib.org/aquifer/

8 http://www.infosci.cornell.edu/pathways/

9http://www.newleftreview.net/?page=article&view=2094

    21