Computational analysis of literature

     Work with Julian Brooke, Adam Hammond, and Krishnapriya Vishnubhotla

Distinguishing voices in T.S. Eliots The Waste Land: T.S. Eliot’s poem The Waste Land is a notoriously challenging example of modernist poetry, mixing the independent viewpoints of over ten distinct characters without any clear demarcation of which voice is speaking when. Brooke, Hammand, and Hirst (2012, 2013, 2015a) apply unsupervised techniques in computational stylistics to distinguish the particular styles of these voices, offering a computer’s perspective on longstanding debates in literary analysis. Their work includes a model for stylistic segmentation that looks for points of maximum stylistic variation, a k-means clustering model for detecting non-contiguous speech from the same voice, and a stylistic profiling approach that makes use of lexical resources built from a much larger collection of literary texts. Evaluating using an expert interpretation, they show clear progress in distinguishing the voices of The Waste Land as compared to appropriate baselines, and they also offer quantitative evidence both for and against that particular interpretation.

Quantifying free indirect discourse in the work of Virginia Woolf and James Joyce: Modernist authors such as Virginia Woolf and James Joyce greatly expanded the use of ‘free indirect discourse’, a form of third-person narration that is strongly influenced by the language of a viewpoint character. Unlike traditional approaches to analyzing characterization using common words, such as those based on the work of Burrows, the nature of free indirect discourse and the sparseness of our data require that we understand the stylistic connotations of rarer words and expressions which cannot be gleaned directly from our target texts. To this end, Brooke, Hammond, and Hirst (2017) applied methods introduced in their recent work to derive information with regards to six stylistic aspects from a large corpus of texts from Project Gutenberg. They thus build high-coverage, finely grained lexicons that include common multiword collocations. Using this information along with student annotations of two modernist texts, Woolf’s To The Lighthouse and Joyce’s The Dead, they confirm that free indirect discourse does, at a stylistic level, reflect a mixture of narration and direct speech, and they investigate the extent to which social attributes of the various characters (in particular age, class, and gender) are reflected in their lexical stylistic profile.  

Classifying character voices in modern drama:  According to the literary theory of Mikhail Bakhtin, a dialogic novel is one in which characters speak in their own distinct voices, rather than serving as mouthpieces for their authors. Vishnubhotla, Hammond, and Hirst (2019) use text classification to determine which authors best achieve dialogism, looking at a corpus of plays from the late nineteenth and early twentieth centuries. They find that the SAGE model of text generation, which highlights deviations from a background lexical distribution, is an effective method of weighting the words of characters’ utterances. Their results show that it is indeed possible to distinguish characters by their speech in the plays of canonical writers such as George Bernard Shaw, whereas characters are clustered more closely in the works of lesser-known playwrights.

Can computational analysis contribute to literary analysis?  Hammond, Brooke, and Hirst (2013, 2016) use their own work, described above, to resolve the dilemma of close versus distant reading — whether large-scale computational analysis of literature ("distant reading") can provide insights equal to or complementary to those of human "close reading".  Their starting point is modernist dialogism: the ethically charged, politically inflected tendency of modernist writers to include mutually differentially and often ideologically opposed voices in their works. They use the insights available at the scale of big data to model and explore dialogism as a concrete phenomenon in modernist texts. By developing new quantitative metrics that are trained on large datasets yet easily interpretable by humans, they build an important bridge between the scales of big and small data, and also between the disciplines of computer science and literary studies. Their approach is specifically tailored to modernist literary studies, developing its computational style-based methodology in response to modernist-era accounts of the politics and ethics of genre (Mikhail Bakhtin’s “dialogism” and Erich Auerbach’s “multipersonal representation of consciousness”).


Brooke, Julian; Hammond, Adam; and Hirst, Graeme. “Unsupervised stylistic segmentation of poetry with change curves and extrinsic features.” Proceedings, Workshop on Computational Linguistics for Literature, Montreal, June 2012, 26–35.  [PDF]

Brooke, Julian; Hirst, Graeme; and Hammond, Adam. “Clustering voices in The Waste Land.” Proceedings, Second ACL Workshop on Computational Linguistics for Literature, Atlanta, June 2013, 41–46.  [PDF]

Brooke, Julian; Hammond, Adam; and Hirst, Graeme. “Distinguishing voices in The Waste Land using computational stylistics.” Linguistic Issues in Language Technology, 12(2), 2015a.  [PDF]

Brooke, Julian; Hammond, Adam; and Hirst, Graeme. “GutenTag: An NLP-driven tool for digital humanities research in the Project Gutenberg Corpus.” Proceedings, Fourth Workshop on Computational Linguistics for Literature, Denver, June 2015b, 42–47.  [PDF]

Brooke, Julian; Hammond, Adam; and Hirst, Graeme. “Using models of lexical style to quantify free indirect discourse in modernist fiction.” Digital Scholarship in the Humanities, 32(2), 2017, 234–250.  [PDF]

Hammond, Adam; Brooke, Julian; and Hirst, Graeme. “A tale of two cultures: Bringing literary analysis and computational linguistics together.” Proceedings, Second ACL Workshop on Computational Linguistics for Literature, Atlanta, June 2013, 1–8.  [PDF]

Hammond, Adam; Brooke, Julian; and Hirst, Graeme. “Modeling modernist dialogism: Close reading with big data.” Reading Modernism with Machines: Digital humanities and modernist literature, edited by Shawna Ross and James O’Sullivan. London: Palgrave Macmillan, 2016, 49-77.  [PDF]

Vishnubhotla, Krishnapriya; Hammond, Adam; and Hirst, Graeme.  “Are fictional voices distinguishable? Classifying character voices in modern drama.”  Proceedings, 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Minneapolis, June 2019.  [PDF]

Graeme Hirst

Professor of Computational Linguistics

University of Toronto, Department of Computer Science