Computational Methods in Humanities Research

Computational Methods in Humanities Research

John Unsworth

Florence, Italy, December 2009

1. What are computational methods in the humanities? What's the difference between using a computer and using computational methods?

Because the computer is a general-purpose modeling machine, it tends to blur distinctions among the different activities it enables and the different functions it performs. Are we word-processing or doing email? Are we doing research or shopping? Are we entertaining ourselves or working? But even though to an observer all our activities might look the same, the goals, rhetoric, consequences, benefits, of the various things we do with computers are not the same. I would bet that everyone here uses a web browser, a word-processor and email as basic tools in their professional life, and I expect that many of you are also in the humanities. Even so, you do not all do humanities computing – nor should you, for heaven's sake – any more than you should all be medievalists, or modernists, or linguists. However, if you are in any of these disciplines, one of the many things you can do with computers is to use computational methods, in which the computer is used as tool for modeling and analyzing humanities data and our understanding of it. Today, I simply want to point out that such activity is entirely distinct from using the computer when it models the typewriter, or the telephone, or the movie theater, or any of the many other things it can model.

There are any number of tools for modeling and analysis, depending on the nature of the source material: xml is a way of modeling text; mpeg is a way of modeling audio and video; GIS is a way of modeling geographic data, with other kinds of information layered on top of it; and we have various ways of modeling other kinds of information. The point, in each case, is that there should be some way of validating the model—some way of determining whether it is internally consistent, and some other way of determining whether it corresponds accurately to important features of the thing it models, even though the selection of those features and the importance given them will, inevitably, reflect the subjective interests and purposes of the person doing the modeling. Still, a model is a form of knowledge representation, and knowledge is always situated—in a person, and with a purpose—so, beyond accurately expressing those features of the object on which all observers can agree, the measure of success is not objective accuracy, but rather expressive completeness.

In addition to expressing the perspectives and purposes of the modeler, new perspectives on familiar materials can become available to others, as a result of the creation of digital primary resources. As an example here, I offer The William Blake Archive, which presents full-color images, newly transcribed texts, and editorial description and commentary, on all of Blake's illuminated books, with non-illuminated materials (manuscript materials, individual plates and paintings, commercial engravings, etc.) now coming on line. The Blake Archive makes it practical to teach Blake as a visual artist, by the simple fact of the economics of image reproduction on the web, and this is a fundamental change from the way I was taught Blake, through Erdman's text-only synthetic edition (which is also, by the way, available on the site).

There’s a deeper impact of digitization, though, beyond increased access: that deeper impact is realized by those who do the digitization, provided that they are subject-area experts who are aware of the complexity of the source materials. In the act of representation, seemingly simple questions, like "is this poem a separate work, or is it part of a larger set of poems?" can be unavoidable—requiring some decision at the level of markup, for example—and they can also raise issues that are critical to understanding the work in question. However we may decide such questions, we are both informed and constrained by our own decisions, when subsequent and related issues arise. Likewise, with images, when we digitize, we choose file-type, compression, color-correction, and other settings based on what we consider valuable and significant in the image—and when our chosen strategy is applied across a large body of images, or when others come to our digital surrogate with purposes we hadn't shared or predicted, we are bound to confront the fact that our surrogate has been shaped by the perspective from which it was produced. In this sense, the real value of digitization for humanities scholarship is that it externalizes what we think we know about the materials we work with, and in so doing, it shows us where we have overlooked, or misunderstood, or misrepresented significant features of those materials.

No better example of this struggle between materials and intentions could be found, I think, than the documentation on the “Editorial Commentary “ pages of the British Library’s Nineteenth-Century Serials Editions project (http://www.ncse.ac.uk/commentary/index.html), which lay out the choice of materials, problems raised by multiple editions in serials, the construction of a "datamap” and a "concept map” for the materials, structural "segmentation policies," and the metadata schema that evolved during the course of the project team's effort to analyze and represent its six 19th-century serials. I'll quote just briefly from a now disappeared “work in progress” page that was once on the NCSE site, now no longer even in the internet archive, for its description of developing the NCSE datamap, in order to explain what I mean by this deeper impact of digitization. The datamap is a map of "data fields" in which the content of the NCSE primary materials will be represented, and it maps the relationships between those fields. Once an initial sketch of the map was prepared, it was tested against the primary sources in "a page turning exercise in which the team assimilated new data fields occurring in the source materials into the map and also reconfigured the map as appropriate." The team that went through this exercise noted that "this work required interpretation at every stage, our abstract conceptualisation of the source materials becoming increasingly concretely represented in the map as it was developed." Even so, the data don't always obey the map:

The creation of the map has flagged up some potential challenges in the way in which our data might be rendered. As is evident from the map there are instances where relationships between its fields skip levels. (e.g. department items) and some items 'float' and can exist at almost any level (e.g. price). The dilemma facing ncse is thus whether to enforce an artificial framework upon the sources (top-down) or to attempt to adapt the framework to the sources (bottom-up).

For me, this is very reminiscent of the exercise of developing the original SGML Document Type Definition for the Rossetti Archive, in doing which we went through an iterative process of modeling the components of Rossetti's paintings and poetry, an exercise that forced an explicit discussion of the nature of these materials, the relations between their parts, and the rules that could be deduced to govern the markup that would represent these. I guarantee that, in both of these cases, unless we had been digitizing the materials in question, and unless the scholar-expert had been party to that digitization, these discussions would never have taken place, and this explicit specification of the scholar's understanding of the materials would never have emerged. But these are the benefits of the early stages of digital humanities—the handmade phase, if you will, where the focus tends to be on scholarly editing as the analytic activity enabled by modeling the source material in digital form.

Beyond modeling, and beyond the hand-made phase of digitization, what does it mean to speak of computational methods? The word “method” implies a way of doing something; there should be something that can be computed on the basis of the representation, whether that’s a matter of information retrieval, algorithmic transformation, statistical profiling or comparison—essentially, I would say “computational methods” involve some kind of analysis, and that analysis produces some kind of (reproducible) results. Those results are not, themselves, the end of the story: in the humanities, empirical results are most likely to be the beginning of the story—the evidence for an argument, the occasion for an essay, which still needs to be argued and essayed, in the same way we’ve always done.

2. What are the conditions that call for computational methods?

In the handmade phase, we could choose digitization, but we could choose not to digitize as well: scholarly editions, for example, can still be produced without digitizing the source materials. However, when we move from handcraft to industrial-scale digitization, we are required to consider computational methods in a different light. The primary condition that calls for computational methods is the availability of a large amount of data in digital form, with the possibility of reprocessing that data into other, purpose-built, representations. With respect to humanities research that focuses on text, we are certainly in that industrial phase: Google Books, as of October, had scanned about 10 million books. The HathiTrust, which is the shared digital repository that stores materials scanned out of the collections of some of the major research libraries in the U.S., had about 4.5 million volumes as of last month. Only some of this material is public domain, but the Google Books Settlement provides for the creation of at least two research centers that will provide access to the in-copyright material, for researchers in various disciplines who want to do “non-consumptive research” with it (where non-consumptive” means, basically, that you’re not supposed to be taking material out of the research environment).

As Franco Moretti points out, in Graphs, Maps, Trees, humanities scholarship normally focuses on a “minimal fraction of the literary field”:

. . . a canon of two hundred novels, for instance, sounds very large for nineteenth-century Britain (and is much larger than the current one), but is still less than one per cent of the novels that were actually published: twenty thousand, thirty, more, no one really knows—and close reading won’t help here, a novel a day every day of the year would take a century or so... And it's not even a matter of time, but of method: a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn't a sum of individual cases: it's a collective system, that should be grasped as such, as a whole.”

I think that what Moretti calls “the quantitative approach to literature” acquires a special importance when millions of books are equally at your fingertips, all eagerly responding to your Google Book Search: you can no longer as easily ignore the books you don't know, nor can you grasp the collective systems they make up without some new strategy—a strategy for using computational methods to grapple with profusion.

However, in order to exercise these strategies, in order to use computational methods, it is almost always necessary to be able to reprocess texts into new representations---transforming them, for example, into database representations, or indexes, or adding information about parts of speech, normalized spelling, etc. Particular purposes require particular representations, and different data-types will offer different features for analysis, but the basic point is that in order to do more than search and browse, it is almost always going to be necessary to reprocess the data, and one would usually wish to begin that reprocessing with the richest form of the source material.

3. What is the potential of such methods? What kinds of research questions can be addressed computationally?

When working with texts, computational methods can help us answer questions having to do with any number of empirical features of those texts and their authors, including vocabulary, syntax, grammar, sound, structure, reference, location, genre, gender, metaphor, intertextuality, and many other things. For example, we might examine

· historical trends in the use of language (for example, is there a golden age of the passive voice?)

· distinctive patterns of language that are characteristic of an author, by comparison to other authors of the same period (for example, what are the words that Jane Austen avoids, by comparison to her peers?)

· features that distinguish one genre from another, or one mode from another (for example, comedy vs. tragedy, or sentimentalism vs. realism)

· features that distinguish male from female authors in the same period

· the role of certain ur-texts, like the Bible, in shaping later texts

· authorship attribution, for example in multi-authored works like encyclopedias

and so on.

Given the ability to reprocess texts, these questions can generally be answered at a level of specificity that would be impossible for a human reader to achieve, simply because the computer can keep track of empirical evidence at a very granular level. The role of the human interpreter is to understand and validate the methods by which the evidence is produced, and then to make sense of that evidence, in an argument.

Similarly, when working with other kinds of raw material—music, images, maps, 3D models, etc.—whatever empirical features that material offers will be available to computational methods, and those methods will support whatever meaningful questions can be asked on the basis of such features. Taking the whole process full-circle, one form of validation may eventually be production. In music composition, for example, computers have been able to learn algorithmic composition, using the features that characterize a particular composer well enough to produce new compositions that are plausible as works of that composer (see, for example Computers and Musical Style by David Cope, professor emeritus of music at the University of California at Santa Cruz; or listen to http://bit.ly/4QARWn for a couple of samples of the work of his program, named Emily Howell). In music, this is a matter of getting the syntax right; in language, this form of validation will be more difficult, because of the semantic component, but one day, we may see new works of fiction in the manner of famous now-dead authors, produced by computers. This is really just the generative inverse of analysis.

4. What has been the impact of such methods? Have computational methods changed the way we study and teach the humanities?

Certainly, both the hand-made and the industrial phases of digitization have had profound impacts on how we study and teach, or if they haven’t, they should. The profusion of texts makes it all the more important that we teach students to understand the importance of editions, and to distinguish between reliable and unreliable editions. The presence of true scholarly editions in electronic form makes it possible to provide both students and researchers with unprecedented depth of access to the process and variety of artistic production (think back to the Blake example, or look at some of the scholarly editions produced by the University of Virginia Press—Melville’s Typee with its manuscript, and an analysis of the process of revision that led to the final text. To take a different kind of example, there have been a number of interesting digital humanities projects based on correlating textual data with maps, for the purpose of analysis. I have first-hand experience with several of these, including the Valley of the Shadow project, which mapped military records and information from diaries and newspapers to produce interactive battle maps of some of the major campaigns in the American Civil War, and The Salem Witch Trials project, which mapped documentary records from the trial records to produce an interactive record of the location and spread of witchcraft accusations in the Bay Area colonies. In the case of the Civil War, these maps, combined with other data, helped to produce insights about the daily lives of individual during the civil war that no research to date has been able to match: read Ed Ayers’s book, In the Presence of Mine Enemies to see exactly what kind of impact the coordination of very granular information from many different source can have, on the telling of history—this is a book that couldn’t have been written without digitized primary source material. Ben Ray, in the Salem Witch Trials, was able to use the combination of maps and trial records to ascertain that a popularly held belief about the geographic concentration of accusers in one part of town, and accused in the other, was simply not true—and that, moreover, there were more accusations of witchcraft outside of Salem, in the larger colony, than inside it. These are some results that derive more or less directly from the act of digitization, by domain experts.

In my more recent experience, I’ve been working to develop tools that leverage digitized representation for the purpose of machine-aided analysis—in this case, text-mining. Over the last four years, I have worked with faculty, students, and computer experts at half a dozen different institutions in the United States and Canada, and at the National Center for Supercomputing Applications, to develop MONK, a workbench for text-mining across literary collections.

The full release of The MONK Project, available by authentication to about 50,000 faculty and 400,000 students at a dozen universities in the Midwest, includes about a thousand works of British literature from the 16th through the 19th century, provided by The Text Creation Partnership (EEBO and ECCO) and ProQuest (Chadwyck-Healey Nineteenth-Century Fiction), along with Martin Mueller's edition of Shakespeare (thirty-seven plays and five works of poetry), plus over five hundred works of American literature from the 18th and 19th centuries, provided by libraries at Indiana University, the University of North Carolina at Chapel Hill, and the University of Virginia.

MONK stands for Metadata Offer New Knowledge, and the metadata MONK provides is at the word level (part of speech, lemmata, position in the text, n-grams, etc.) for each of the 150 million words in this corpus. Behind the workbench interface, MONK's quantitative analytics (naive Bayesian analysis, support vector machines, Dunnings log likelihood, and raw frequency comparisons), are run through a toolkit developed at NCSA, called SEASR. Users typically start a project with one of the toolsets that has been predefined by the MONK team. Each toolset is made up of individual tools (e.g. a search tool, a browsing tool, a rating tool, and a visualization), and these tools are applied to worksets of texts selected by the user from the MONK datastore. Worksets and results can be saved for later use or modification, and results can be exported in some standard formats (e.g., CSV files).

In the process of designing MONK, we worked with humanities doctoral students and junior faculty who had specific research questions they wanted to answer, using these tools. For example, Sarah Steger was interested in sentimentalism in British fiction, and specifically, what distinguished sentimental from non-sentimental fiction, at the level of vocabulary—and, by extension, at the level of subject matter. She started by running naïve bayes routines on a training set of 409 mid-Victorian novels that she classified as either sentimental or unsentimental. The larger testbed, to which the software applied Sarah’s training data, was 3,921 novels; ultimately, the software returned 1,348 chapters as sentimental, and it had detailed information about the language use that was characteristic of the sentimental chapters, and that distinguished them from non-sentimental chapters. She was able to get fairly definitive results on the words that separate sentimental from unsentimental fiction, as well as learning that Dickens seems to be the archetype of sentimentality in this period of British fiction.

5. What are the limitations of such methods? What research questions cannot be addressed computationally?

In general, research questions that are wholly intuitive in nature, or that do not make use of empirical evidence in source material, will not lend themselves to computational methods. Aesthetic appreciations, likewise, don’t benefit much from these methods. Arguments that depend on the performance of the critic may be assisted by evidence of the sort that these methods can provide, but they may not, as well.

Reflecting on our experience in the MONK project, where we based our analysis on meticulously prepared texts with in-depth linguistic information, my colleague and co-investigator Martin Mueller gave the following, fairly exhaustive, account of the limitations of our methods in the MONK project. He said,

“The computer has no understanding of what a word is, but it follows instructions to 'count as' a word any string of alphanumerical characters that is not interrupted by non-alphabetical characters, notably blank space, but also punctuation marks, and some other symbols. 'Tokenization' is the name for the fundamental procedure in which the text is reduced to an inventory of its 'tokens' or character strings that count as words. This is an extraordinarily reductive procedure. It is very important to have a grasp of just how reductive it is in order to understand what kinds of inquiry are disabled and enabled by it. A word token is the spelling or surface of form of a word. MONK performs a variety of operations that supply each token with additional 'metadata'. Take something like 'hee louyd hir depely'. This comes to exist in the MONK textbase as something like

hee_pns31_he louyd_vvd_love hir_pno31_she depely_av-j_deep

Because the textbase 'knows' that the surface 'louyd' is the past tense of the verb 'love' the individual token can be seen as an instance of several types: the spelling, the part of speech, and the lemma or dictionary entry form of a word.”

Conclusion:

What’s really changed? Well, perhaps nothing, for humanities scholarship that isn’t primarily interested in modeling its source material in order to understand its structure or ontology, or scholarship that isn’t especially interested in the evidence that is offered by that source material, for empirical arguments. But if your scholarship depends, to some extent at least, on empirical evidence, or if you are interested in the features that the computer can “understand,” or if you are interested in correlating different kinds of evidence, along some shared dimension, then computational methods could change your work entirely, could lead to new answers to old questions, or even better, to altogether new questions. And as we approach, in the next decade, a time when all the books (not archives, but books) in research libraries are digitized, it may become harder to ignore the capabilities that computational methods offer to the scholar and the teacher. This is, in fact, how change has come to other disciplines—on the heels of a transformation of the bulk of their data from analog to digital. Our day is coming soon—in fact, it’s already upon us, so it’s time to begin thinking about how to cope with it.