Computational analysis of literature

Work with Julian Brooke, Adam Hammond, and Krishnapriya Vishnubhotla

Distinguishing voices in T.S. Eliot’s The Waste Land: T.S. Eliot’s poem The Waste Land is a notoriously challenging example of modernist poetry, mixing the independent viewpoints of over ten distinct characters without any clear demarcation of which voice is speaking when. Brooke, Hammond, and Hirst (2012, 2013, 2015a) apply unsupervised techniques in computational stylistics to distinguish the particular styles of these voices, offering a computer’s perspective on longstanding debates in literary analysis. Their work includes a model for stylistic segmentation that looks for points of maximum stylistic variation, a k-means clustering model for detecting non-contiguous speech from the same voice, and a stylistic profiling approach that makes use of lexical resources built from a much larger collection of literary texts. Evaluating using an expert interpretation, they show clear progress in distinguishing the voices of The Waste Land as compared to appropriate baselines, and they also offer quantitative evidence both for and against that particular interpretation.

Quantifying free indirect discourse in the work of Virginia Woolf and James Joyce: Modernist authors such as Virginia Woolf and James Joyce greatly expanded the use of ‘free indirect discourse’, a form of third-person narration that is strongly influenced by the language of a viewpoint character. Unlike traditional approaches to analyzing characterization using common words, such as those based on the work of Burrows, the nature of free indirect discourse and the sparseness of our data require that we understand the stylistic connotations of rarer words and expressions which cannot be gleaned directly from our target texts. To this end, Brooke, Hammond, and Hirst (2017) applied methods introduced in their recent work to derive information with regards to six stylistic aspects from a large corpus of texts from Project Gutenberg. They thus build high-coverage, finely grained lexicons that include common multiword collocations. Using this information along with student annotations of two modernist texts, Woolf’s To The Lighthouse and Joyce’s The Dead, they confirm that free indirect discourse does, at a stylistic level, reflect a mixture of narration and direct speech, and they investigate the extent to which social attributes of the various characters (in particular age, class, and gender) are reflected in their lexical stylistic profile.

Classifying character voices in modern drama: According to the literary theory of Mikhail Bakhtin, a dialogic novel is one in which characters speak in their own distinct voices, rather than serving as mouthpieces for their authors. Vishnubhotla, Hammond, and Hirst (2019) use text classification to determine which authors best achieve dialogism, looking at a corpus of plays from the late nineteenth and early twentieth centuries. They find that the SAGE model of text generation, which highlights deviations from a background lexical distribution, is an effective method of weighting the words of characters’ utterances. Their results show that it is indeed possible to distinguish characters by their speech in the plays of canonical writers such as George Bernard Shaw, whereas characters are clustered more closely in the works of lesser-known playwrights.

Can computational analysis contribute to literary analysis? Hammond, Brooke, and Hirst (2013, 2016) use their own work, described above, to resolve the dilemma of close versus distant reading — whether large-scale computational analysis of literature ("distant reading") can provide insights equal to or complementary to those of human "close reading". Their starting point is modernist dialogism: the ethically charged, politically inflected tendency of modernist writers to include mutually differentially and often ideologically opposed voices in their works. They use the insights available at the scale of big data to model and explore dialogism as a concrete phenomenon in modernist texts. By developing new quantitative metrics that are trained on large datasets yet easily interpretable by humans, they build an important bridge between the scales of big and small data, and also between the disciplines of computer science and literary studies. Their approach is specifically tailored to modernist literary studies, developing its computational style-based methodology in response to modernist-era accounts of the politics and ethics of genre (Mikhail Bakhtin’s “dialogism” and Erich Auerbach’s “multipersonal representation of consciousness”).

Quotation attribution in literary novels: Prior models for quotation attribution in literary novels assume varying levels of available information in their training and test data, which poses a challenge for in-the-wild inference. We approach quotation attribution as a set of four interconnected sub-tasks: character identification, coreference resolution, quotation identification, and speaker attribution. We benchmark state-of-the-art models on each of these sub-tasks independently, using a large dataset of annotated coreferences and quotations in literary novels (the Project Dialogism Novel Corpus). We also train and evaluate models for the speaker attribution task in particular, showing that a simple sequential prediction model achieves accuracy scores on par with state-of-the-art models.

The Emotion Dynamics of Literary Novels: Stories are rich in the emotions they exhibit in their narratives and evoke in the readers. The emotional journeys of the various characters within a story are central to their appeal. Computational analysis of the emotions of novels, however, has rarely examined the variation in the emotional trajectories of the different characters within them, instead considering the entire novel to represent a single story arc. In this work, we use character dialogue to distinguish between the emotion arcs of the narration and the various characters. We analyze the emotion arcs of the various characters in a dataset of English literary novels using the framework of Utterance Emotion Dynamics. Our findings show that the narration and the dialogue largely express disparate emotions through the course of a novel, and that the commonalities or differences in the emotional arcs of stories are more accurately captured by those associated with individual characters. [Work with Saif M. Mohammad, National Research Council of Canada.]

References

Brooke, Julian; Hammond, Adam; and Hirst, Graeme. “Unsupervised stylistic segmentation of poetry with change curves and extrinsic features.” Proceedings, Workshop on Computational Linguistics for Literature, Montreal, June 2012, 26–35. [PDF]

Brooke, Julian; Hirst, Graeme; and Hammond, Adam. “Clustering voices in The Waste Land.” Proceedings, Second ACL Workshop on Computational Linguistics for Literature, Atlanta, June 2013, 41–46. [PDF]

Brooke, Julian; Hammond, Adam; and Hirst, Graeme. “Distinguishing voices in The Waste Land using computational stylistics.”

Linguistic Issues in Language Technology, 12(2), 2015a. [PDF]

Brooke, Julian; Hammond, Adam; and Hirst, Graeme. “GutenTag: An NLP-driven tool for digital humanities research in the Project Gutenberg Corpus.” Proceedings, Fourth Workshop on Computational Linguistics for Literature, Denver, June 2015b, 42–47. [PDF]

Brooke, Julian; Hammond, Adam; and Hirst, Graeme. “Using models of lexical style to quantify free indirect discourse in modernist fiction.” Digital Scholarship in the Humanities, 32(2), 2017, 234–250. [PDF]

Hammond, Adam; Brooke, Julian; and Hirst, Graeme. “A tale of two cultures: Bringing literary analysis and computational linguistics together.” Proceedings, Second ACL Workshop on Computational Linguistics for Literature, Atlanta, June 2013, 1–8. [PDF]

Hammond, Adam; Brooke, Julian; and Hirst, Graeme. “Modeling modernist dialogism: Close reading with big data.” Reading Modernism with Machines: Digital humanities and modernist literature, edited by Shawna Ross and James O’Sullivan. London: Palgrave Macmillan, 2016, 49–77. [PDF]

Vishnubhotla, Krishnapriya; Hammond, Adam; and Hirst, Graeme. “Are fictional voices distinguishable? Classifying character voices in modern drama.”Proceedings, 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Minneapolis, June 2019. [PDF]

Vishnubhotla, Krishnapriya; Rudzicz, Frank; Hirst, Graeme; and Hammond, Adam. “Improving automatic quotation attribution in literary novels.” Proceedings, 61st Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers, Toronto, July 2023, 737–746. [PDF]

Vishnubhotla, Krishnapriya; Hammond, Adam; and Hirst, Graeme. “The Project Dialogism Novel Corpus: A dataset for quotation attribution in literary texts”. Proceedings, 13th Conference on Language Resources and Evaluation, Marseille, June 2022, 5838–5848. [PDF]

Vishnubhotla, Krishnapriya; Hirst, Graeme; Hammond, Adam; and Mohammad, Saif M. “The emotion dynamics of literary novels.” Findings of the Association for Computational Linguistics: ACL- 2024, August 2024. [PDF]

Vishnubhotla, Krishnapriya. Computational Measures of Language Variation in Textual Utterances. PhD thesis, Department of Computer Science, University of Toronto, May 2024. [PDF]