How Not To Read A Million Books

[Figure 1] First of all, where does the trope of “a million books” come from? It originates, as far as I know, with the Universal Library and its Million Books Project, which began in 2001. The Universal Library is directed by Raj Reddy, professor and former Dean of Computer Science at Carnegie Mellon University; the million books project (funded by NSF and others) was a kind of very large pilot, aimed at digitizing a million books (“less than 1% of all books in all languages ever published”¹), beginning with partners in India and later expanding to China and Egypt. The “million book” goal was accomplished in 2007, by which time it had been eclipsed by some large commercial projects, including most notably Google Print (now known as Google Book Search), which had begun in secret in 2002 and was unveiled at the Frankfurt Book Fair in October 2004, and which had Harvard's library as one of its initial partners. Google Books aims to scan as many as 30 million books, a number equal to all the titles in WorldCat, and for all we know, they are already about halfway there.² Libraries and others have been digitizing books for years, but these massive digitization projects really changed the landscape, and they raised the question “What do you do with a million books?”—a question first asked, I think, by Greg Crane, in D-Lib Magazine, in March of 2006.³ My answer to that question is that whatever you do, you don't read them, because you can't.

[Figure 2] As Franco Moretti points out, in Graphs, Maps, Trees, we focus on a “minimal fraction of the literary field”:

. . . a canon of two hundred novels, for instance, sounds very large for nineteenth-century Britain (and is much larger than the current one), but is still less than one per cent of the novels that were actually published: twenty thousand, thirty, more, no one really knows—and close reading won’t help here, a novel a day every day of the year would take a century or so... And it's not even a matter of time, but of method: a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn't a sum of individual cases: it's a collective system, that should be grasped as such, as a whole.”⁴

I think that what Moretti calls “the quantitative approach to literature” acquires a special importance when millions of books are equally at your fingertips, all eagerly responding to your Google Book Search: you can no longer as easily ignore the books you don't know, nor can you grasp the collective systems they make up without some new strategy—a strategy for not reading.

[Figure 3] Martin Mueller is my collaborator and co-PI on the MONK project, and professor of classics and English at Northwestern University. Martin is fond of citing this poem about not reading, called “The Spectacles”:

Korf reads avidly and fast.
Therefore he detests the vast
bombast of the repetitious,
twelvefold needless, injudicious.

Most affairs are settled straight
just in seven words or eight;
in as many tapeworm phrases
one can prattle on like blazes.

Hence he lets his mind invent
a corrective instrument:
Spectacles whose focal strength
shortens texts of any length.

Thus, a poem such as this,
so beglassed one would just -- miss.
Thirty-three of them will spark
nothing but a question mark.⁵

Korf is the kind of reader for which some text-mining tools are intended: I'm sure he would approve of text-summarization technology, for example—the sort of thing that tells you what a newspaper article is about, so you don't have to go through the tiresome and inkstained exercise of actually reading it.

[Figure 4] What we're trying to do, by contrast, in the Mellon-funded MONK project, is to use text-mining techniques as a provocation for reading, but also to cast the net for that provocation much more broadly than one could do without computers. In other words, although we expect that our users may end up reading, even reading closely, we begin by not reading. Sometimes we don't read single unreadable texts, sometimes we don't read the collected works of an author, sometimes we don't read all of the books in the MONK datastore—but we never don't not read.

[Figure 5] That MONK datastore currently includes a small fraction of the literary field that Moretti describes, but still, it includes enough to be interesting. It has the complete text of approximately 1,200 works, including 300 American novels published between 1851 and 1875, 250 British novels published between 1780 and 1900, 300 plays by Shakespeare and his contemporaries, 30 works of 16th and 17th century poetry, and 300 works of 16th and 17th prose, including fiction, sermons, travel literature, and witchcraft texts. 250 of these works, by 104 authors, come from Chadwyck-Healey's Nineteenth-Century Fiction collection; 658 works by 366 authors come from the Text Creation Partnership at the University of Michigan, with an emphasis on Early English Books Online; 244 works by 172 authors come from The Wright American Fiction collection at Indiana University. Taken together, these 1152 works contain about 81.5 million words, and they represent a reasonable sample of printed literature in English from the 16th, 17th, 18th, and 19th centuries. There's also a single 20th-century text that we've been working with—Gertrude Stein's The Making of Americans—about which more in a moment. And we've just learned that we'll be able to include a substantial subset of the Early American Fiction collection from the University of Virginia: that set of texts, along with the Wright American Fiction collection, make up a very solid representation of American fiction before 1875, and one that we will be able to make publicly available for text-mining, in just a few months, when the MONK project ends.

[Figure 6] In the process of assembling these texts into the MONK datastore, each of them is transformed (using routines written by Brian Pytlik Zillig at the University of Nebraska) from their native markup to markup that follows an XML schema designed (again, at Nebraska) for analytic purposes. Each text is then analyzed by software (written by Phil Burns at Northwestern University) that identifies word boundaries, sentence boundaries, standard spellings, parts of speech and lemmata. This software, called MorphAdorner, differs from other NLP toolkits by having special features for dealing with orthographic and morphological variance of dialectal or Early Modern English texts. Finally, the texts are ingested into a database (also designed at Northwestern, by John Norstad) that provides Java access methods for extracting data for many purposes, including searching for objects; direct presentation in end-user applications as tables, lists, concordances, or visualizations; getting feature counts and frequencies for analysis by data-mining and other analytic procedures; and getting tokenized streams of text for doing analysis of collocation and repetition, and other pattern-matching operations.

[Figure 7] MONK's analytic routines include supervised learning methods for text-classification, like Naive Bayesian analysis and support vector machines (the sorts of things that the spam filter in your email client uses, with a little training from you, to decide whether to put something in your junk mailbox), as well as tools for unsupervised text-classification (for fully automated clustering of texts), and tools for evaluating probability (for example, measuring word frequencies in a single work vs. in other works of the same period). These analytics are run using something called the Software Environment for the Advancement of Scholarly Research (SEASR), developed by Michael Welge's Automated Learning Group at the National Center for Supercomputing Applications, with help from Amit Kumar, at the University of Illinois.

[Figure 8] MONK's user interface combines these texts and tools to enable literary research through the discovery, exploration, and visualization of patterns—which may be patterns in a single work, in a subset of the MONK collection, or across everything we have. The interface for this is being developed by colleagues at the University of Alberta (Matt Bouchard and others, under the supervision of Stan Ruecker), McMaster University (Andrew MacDonald and others, under the supervision of Stefan Sinclair), the University of Maryland (Anthony Don and others, under the supervision of Catherine Plaisant), and the University of Illinois (Amit Kumar and Duane Searsmith). It is designed as a kind of workbench, where users can assemble collections, choose tools, create worksets, save results of analysis, and submit those results to various kinds of visualization, or export them for use in other systems; I should point out that some of the things I will show you are experiments done along the way that have not been fully incorporated in the workbench. In general, though, the MONK user interface is a browser-based client: it talks to the datastore and to SEASR over the Web, and these transactions are mediated by middleware developed at Illinois—an intermediate layer of sotware that turns input from the client into queries for datastore, ships results from the datastore off to SEASR's analytics engine, and takes results from SEASR and sends them back to the client, managing queuing and communication all the while.

Without a doubt, though, the most important component of the MONK project is the scholarly user. These patient and persistent people provide real-world requirements for the tools that we are trying to build, and they are involved in every step of the process, from the initial design onward. I'll present here several use cases—two being pursued by graduate students in English (Sarah Steger at the University of Georgia and Tanya Clement at the University of Maryland, both in English), one by a junior faculty member at Simon Fraser University (also in English), and two by a senior faculty member, Martin Mueller, in English and Classics at Northwestern University. These experiments will be presented in order of increasing breadth but decreasing depth, from an examination of the structure of a single work (Gertrude Stein's The Making of Americans), to tracing the emergence of archetypes (the Gentleman Devil; the haggard witch) across dozens of texts, to understanding the characteristics of a literary movement (sentimentalism) by examining hundreds of exemplars, to identifying the sources of a multi-volume, multi-author encyclopedia.

The Making of Americans

[Figure 9] My first example, then, is Tanya Clement's work on Gertrude Stein's The Making of Americans: this work is part of her dissertation in the English Department at the University of Maryland, and it will also appear in the next issue of Literary and Linguistic Computing, published by Oxford University Press. The Making of Americans, Tanya writes,

was criticized by [those] like Malcolm Cowley who said Stein's “experiments in grammar” made this novel “one of the hardest books to read from beginning to end that has ever been published.”⁶ More recent scholars have attempted to aid its interpretation by charting the correspondence between structures of repetition and the novel's discussion of identity and representation. Yet, the use of repetition in Making is far more complicated than manual practices or traditional word-analysis programs (such as those that make concordances or measure word-frequency occurrence) could indicate. The highly repetitive nature of the text, comprising almost 900 pages and 3174 paragraphs with only approximately 5,000 unique words,⁷ makes keeping tracks of lists of repetitive elements unmanageable and ultimately incomprehensible.

[Figure 10] Thinking that she might get a bird's-eye view of repetition in the novel by using text-mining tools, Tanya applied “a frequent pattern analysis algorithm⁸” that looked at every sequence of three words in the book, and any patterns across those trigrams, but “executing the algorithm on Making generated thousands of patterns since each slight variation in a repetition generated a new pattern. ⁹” Working with the MONK partners at the University of Maryland's Human Computer Interaction Lab, Tanya's use-case helped to drive the development of a new tool called FeatureLens. FeatureLens, in Tanya's words, highlights

[Figure 11]trends such as co-occurring patterns that tend to increase or decrease in frequency 'suddenly' across a distribution of data points (area “A” in Figure 1).¹⁰ It allows the user to choose particular patterns for comparison (area “B”), charts those patterns across the text’s nine chapters at the chapter level and at the paragraph¹¹ level (area “C”), and facilitates finding these patterns in context (area “D”). Ultimately, while text mining allowed me to use statistical methods to chart repetition across thousands of paragraphs, FeatureLens facilitated my ability to read the results by allowing me to sort those results in different ways and view them within the context of the text. As a result, by visualizing clustered patterns across the text’s 900 pages of repetitions, I discovered two sections that share, verbatim, 495 words and form a bridge over [Figure 12] the center of the text. This discovery provides a new key for reading the text as a circular text with two corresponding halves, which substantiates and extends the critical perspective that Making is neither inchoate nor chaotic, but a highly systematic and controlled text. This perspective will change how scholars read and teach The Making of Americans.

Working with the Devil

Kirsten Uszkalo, junior faculty in English at Simon Fraser University, and she is, as she says, “working with the Devil right now—trying to figure out when the gentleman devil showed up in English witchcraft tracts” and also investigating the literary emergence of his diabolical creature, the witch. Kirsten says,

[Figure 13] I began the first trial run of the witchcraft sample set, starting with a small sample set -- those of the anonymous witchcraft tracts which we had morphadorned. This was actually a pretty useful way to begin to look at what witchcraft meant in early modern England, because the tracts themselves tell the story of the social, literary, and legal representation of witchcraft in not only small text chunks, but provide a useful span across the chronology of witchcraft texts. I set down to write a kind of classification system, based on the schema I had developed in “hard coding” the paper versions of the texts, with flags and highlighters. One of the most fascinating results of this pass through was a result I had not intended to find, but that which spoke volumes about Richard Head’s construction of Mother Shipton’s diabolical upbringing. Monk returned a sample I had not realized was there and may have been a kind of source text, or framework on which Head drew an expanded version of Shipton’s life. Although I was looking for sex as a signifier of diabolism, I found that a number of the texts I was getting returned spoke to the idea of the prodigious birth as a sign of the Devil’s presence. This makes all sorts of sense; witchcraft, monster babies, comets, and two-headed chickens all belong to the same kind of “enquirer” genre in early modern England. So although Head’s version of Shipton’s life has not been digitized, the computer returned a result which shares eerie similarities because it is part of the same genre. . . MONK showed what it should have showed; it returned finding like those I told it to find. However, beyond showing what couldn’t be seen with the naked eye, the computer suggested connections I hadn’t thought of. It ranked texts based on similarities I hadn’t “ranked” as there. In searching the computer results, I was able to see sub-groupings of texts in the displayed results. In that way, did not just tell me what was similar, but presented texts which illustrated similarities I hadn’t thought of. I’ve had to rethink essential ideas like the ways witches are linked, the way sex functions, and the role of the monstrous in early modern witchcraft.

Sentimentalism

Sara Steger, a graduate student in English at the University of Georgia, has (like Tanya Clement) been using MONK as a research tool for her dissertation, which is on sentimentalism in British Literature. Sara describes the questions she's examining with MONK as follows:

Can you train the computer to recognize sentimentality and return "more like these?" I have results from running naïve bayes routines on a training set of 409 mid-Victorian novels that I classified sentimental and unsentimental. The testbed was 3,921 novels and the system returned 1,348 chapters as sentimental. I'm going through these results and am talking with [experienced data miners] about ways to assess the success of this experiment. It seems that the system is really good at recognizing sentimentality in Dickens.

[Figure 14] Sara used Dunning's log likelihood to compare various aspects of sentimental novels with the rest of her testbed of texts. Martin Mueller was the person who first called our attention in MONK to the simplicity and power of Dunning's log likelihood as a technique for understanding the differences that make a difference. Martin learned about it from Paul Rayson (Lancaster University), who has used it in his wmatrix program. Martin notes that

Dunning's log likelihood ratio is a statistic that does more or less the same thing as a Chi-square, but it is supposed to be more suitable for textual data. It supports a 'figure and ground' operation where you choose one set of texts or words, called the Analysis Corpus, compare it with another set of texts or words, the Reference Corpus, and find words that are, in comparison with the Reference corpus, disproportionately common or rare in the Analysis Corpus. The resultant log likelihood ratio maps to a probability table that you interpret in the same way as a chi-square statistic. For the analysis of lexical differences, Dunning's log likelihood ratio is a powerful tool whose results can be easily interpreted by users who do not understand the details of the underlying math. And the elegant word-cloud program Wordle, from IBM's ManyEyes project, turns out to be an excellent for visualizing Dunning results.

[Figure 15] Just to show a couple of Martin's initial examples here, he produced a visualization of the “words Jane Austen Avoids” by comparison to other novelists of her era . . . and he produced these visualizations of [Figure 16] the vocabulary characteristic of male authors, [Figure 17] and of female authors, in the same period.

In using these same tools and techniques to explore sentimentalism in British fiction, Sara Steger was particularly interested in that most sentimental of situations, the deathbed scene. Sara says,

[Figure 18] I've run both a comparison of my training set of sentimental vs. unsentimental and a comparison using the machine-classified texts. Not surprisingly, "mother," "child," "heart," and "love" stand out as markers of the sentimental in both lists. . . . [Figure 19] The words that are over-represented in deathbed scenes set the scene, hinting at descriptions of the bed, the room, the pillow, the chamber, and even the hospital. Words corresponding to illness also are prominent, including “fever,” “sick,” “nurse,” “doctor,” and “sick-room.” Moreover, that word cloud is a reminder of how death is a domestic affair. Not only is there an emphasis on that most domestic of spaces, the bedroom, but the vocabulary emphasizes intimate relationships—“mamma,” “papa,” “darling,” and “child.” This latter word also crosses over into demonstrating a concern with innocence and diminutiveness, especially when read with to the related “baby,” and “little.” Moreover, the visualization reflects the thematic importance of last words and touches—the “lips” that “speak,” “whisper,” or “kiss,” the “last” “farewell,” the final “breath.” . . . .While a close reader may be able to get a sense of which words are used more often in sentimental scenes, the algorithm enabled me to discover information that a scholar would never be able to obtain without these technologies: that which is absent. What the word cloud does not include is almost as informative as what it does. Given the prominence of mourning in Victorian culture, there is almost no trace of the formal trappings of mourning in this snapshot of deathbed scenes. While the words “coffin,” “archdeacon,” and “grave” appear, the visualization shows that the topos is much more concerned with describing the death than with detailing the mourning. A description of the moment is sufficient to convey the “good death” of the character; the burial and the mourning – the public moments in the church and graveyard – are largely absent.

[Figure 20] This makes the list and visualization of the words that are under-represented in deathbed scenes even more striking. One of the most under-represented words is “holy,” and it is followed by “church,” “saint,” “faith,” “believe” and “truth.” It seems the Victorian deathbed scene is more concerned with relationships, marked by words such as “forgiveness,” “mercy,” “forgive,” and “comfort” than with personal convictions and declarations of faith. . . . Words that have to do with business and class (“money,” “power,” “business,” “lord,” and “gentleman”) don't belong at the deathbed. . . . Tellingly, words of uncertainty also appear prominent in the visualization of under-used words in deathbed scenes, including “suppose,” “perhaps,” and “doubt.” The deathbed scene, by nature a scene of resolution, leaves no room for incertitude. Altogether, the words that are not used, the “negatives,” serve as a sort of shadow to the “positives,” giving dimension to the themes and patterns that stood out in the first visualization.

Literary DNA

[Figure 21] Martin Mueller, has provided some of the most imaginative use-cases for MONK, and I recommend to your attention his writings on the MONK wiki—they are a pleasure and an education to read. Martin is one of those people who is open to ideas from all quarters, and one of his sources of inspiration is his daughter Rachel, a biologist who works on sequencing DNA. They often talk about the fact that there are many resemblances between tracing DNA sequences and verbal borrowings across genomes or literary corpora. He became very interested in the use of sequence aligment techniques by Mark Olsen and his team at Philologic. They have been tracking the sources of Diderot's Encyclopedie.¹² This work is very much along the lines of Tanya Clements' analysis of repeition in The Making of Americans. In the Philologic application of sequence analysis, Martin writes,

the fragments are fragments of text considered as overlapping n-grams [an n-gram is just a series of n words]. These overlapping 'shingles' of text are then examined for patterns of repetition. A variant and simpler version of this technology was used by Brian Vickers in a recent TLS essay, where he argued for Kyd's authorship of several Elizabethan plays, including the old Lear Play, on the basis of phrasal echoes or shingles (discovered by submitting King Lear and three of the plays of Thomas Kyd to some open source plagiarism detection software).¹³ We will work with the Philologic folks to track shingle patterns across the entire corpus of Early Modern drama.

Meaning and Mining

I began with a skeptical verse, and I want to end with a skeptical essay, called “Meaning and Mining: the Impact of Implicit Assumptions in Data Mining for the Humanities,” by D. Sculley and Bradley M. Pasanek. This essay is forthcoming in the same issue of Literary and Linguistic Computing that will include Tanya Clement's essay on Stein, and the essay begins with this headnote, from Hans Georg Gadamer's Truth and Method:

[Figure 22] A person who is trying to understand a text is always projecting. He projects a meaning for the text as a whole as some initial meaning emerges in the text. Again, the initial meaning only emerges because he is reading the text with particular expectations in regard to a certain meaning. Working out this fore- projection, which is constantly revised in terms of what emerges as he penetrates into the meaning, is understanding what is there.

Pasanek and Sculley argue that text-mining is of doubtful value for literary studies, because readers—especially professional readers—can make meaning out of almost anything: “just because results are statistically valid and humanly interpretable does not guarantee that they are meaningful.” Specifically, they point out that there are some key practices and assumptions in data-mining and machine learning that literary users of these tools may not fully appreciate or understand, and that without an understanding of fundamental principles, it is easy to misinterpret or overstate the significance of statistical results:

The temptation in applying machine learning methods to humanities data is to interpret a computed result as some form of proof or determinate answer. In this case, the validity of the evidence lies inherent in the technology. This can be problematic when the methods are treated as a black box, a critic ex machina.

[Figure 23] After reviewing some of those fundamental principles, they lay out a case study in which they applied text-mining techniques to a database of metaphors culled from 18th-century British political writing, in order to demonstrate or disprove George Lakoff's hypothesis “that political debates are contests between root conceptual metaphors (Lakoff, 2002). Party affiliation is rooted in metaphorically structured mental models (in 'pictures' not 'propositions').” They apply Support Vector Machines (one of the tools we use in MONK) to sort the metaphors into clusters, and then ask whether the resulting clusters support Lakoff's hypothesis. I won't go into the details of this part of the argument here, though if you look at the figure in the bottom right-hand cell in each of these two tables, and consider that a random distribution in this case would be about .33, you can get a sense of how much and then how little Lakoff's hypothesis seems to be supported by the data, as Pasanek and Sculley except “prove” and “disprove” it several times over, using different representations of the data, different forms of cross-validation, and different clustering techniques. “One of the ironies here,” say the authors,

is that machine learning methods, which seemed so promising as a way of performing what Moretti calls distant reading or what Martin Mueller calls, perhaps even more provocatively, not-reading, is that they require us to trade in a close reading of the original text for something that looks like a close reading of experimental results – a reading that must navigate ambiguity and contradiction. Where we had hoped to explain or understand those larger structures within which an individual text has meaning in the first place, we find ourselves acting once again as interpreters. The confusion matrix, authored in part by the classifier, is a new text, albeit a strange sort of text, one that sends us back to those texts it purports to be about.

I would agree with that, and I think we see that in the use-cases that I've presented here—Tanya Clement reads the patterns of repetition in order to discover the structure of a notoriously unreadable novel; Kirsten Uszkalo reads the clustering of descriptive language in 17th-century texts to identify the emergence of ideas about the supernatural; Sara Steger reads the predictions of sentimentalism as a way of thinking about the components of a genre; and Martin Mueller reads “shingles” of word sequences produced by plagiarism detection techniques as a way of suggesting the authorship of entries in an encyclopedia.

[Figure 24] Pasanek and Sculley make some sensible recommendations for best practices in text-mining for the humanities, including:

I might be inclined to add one more, which is to avoid approaching the application of this technology as a matter of proving the truth of a hypothesis. For literary purposes, as I suggested at the outset of this paper and elsewhere, I think it makes more sense to think of text-mining tools as offering provocations, surfacing evidence, suggesting patterns and structures, or adumbrating trends. Whereas text-mining is usually about prediction, accuracy, and ground truth, in literary study, I think it is more about surprise, suggestion, and negative capability—and on that point, Sculley and Pasanek concur:

[Figure 25] The virtue of automated analysis is not ready delivery of objective truth, but instead the more profound virtue of bringing us up short, of disturbing us in our preconceptions and our basic assumptions so that we can exist, if only for a moment, in uncertainties, mysteries, and doubts. Should we learn to forestall interpretation, we may come to revise our prejudices, theories, and fore-projections in terms of what emerges.

[Figure 26] The tendency to leap to conclusions is understandable, not least because it is impossible to operate without preconceptions or to make sense of things without paying selective attention across a field of information. But the value of these tools, especially with a large full-text collection, is that they can bring to your attention works that otherwise might be overlooked, they can expose patterns that are so fine-grained that they would otherwise escape notice, and they can allow you to not-read a milllion books on your way to reading a period, or reading a genre, or even reading a book.

Notes

2 Jeffrey Toobin, “Google's Moon Shot: The Quest for The Universal Library.” New Yorker, Feb. 5, 2007. http://www.newyorker.com/reporting/2007/02/05/070205fa_fact_toobin

6 Cowley, Malcolm (2000). “Gertrude Stein, Writer or Word Scientist.” The Critical Response to Gertrude Stein. Westport, CT: Greenwood Press, 147-150, 148.

7 Please see http://www.wam.umd.edu/~tclement/samplesMoa.html . for a comparison chart showing that texts such as Moby Dick or Ulysses which have approximately half the number of words as The Making of Americans also have, respectively, three times and five times as many unique words. (Clement's note)

8 J. Pei, J. Han, and R. Mao, ''CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets (PDF) '', Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, May 2000. (Clement's note)

9 Examples of frequent co-occurring patterns from the text may be found at ftp://ftp.ncsa.uiuc.edu/alg/tanya/withstem (any file on this list opens in a browser to reveal thousands of patterns). (Clement's note)

12 The Encyclopédie ou Dictionnaire raisonné des sciences, des arts et des métiers, par une Société de Gens de lettres was published under the direction of Diderot and d'Alembert, with 17 volumes of text and 11 volumes of plates between 1751 and 1772. Containing 72,000 articles written by more than 140 contributors, the Encyclopédie was a massive reference work for the arts and sciences, as well as a machine de guerre which served to propagate the ideas of the French Enlightment. The impact of the Encyclopédie was enormous. Through its attempt to classify learning and to open all domains of human activity to its readers, the Encyclopédie gave expression to many of the most important intellectual and social developments of its time. -- ARTFL web site, http://www.lib.uchicago.edu/efts/ARTFL/projects/encyc/

How Not to Read a Million Books