Computational analysis of authors’ writing styles, including the detection of plagiarism and sexual predation
Using smaller samples: Traditional methods of automatic authorship discrimination rely on having large samples of text, typically complete novels. Hirst and Feiguina (2007) presented a method using shorter samples that is based on the frequency of bigrams of syntactic labels that arise from partial parsing of the text. They showed that this method, alone or combined with other classification features, achieves a high accuracy on discrimination of the work of Anne and Charlotte Brontë, which is very difficult to do by traditional methods. Moreover, high accuracies are achieved even on fragments of text little more than 200 words long.
Detecting stylistic inconsistencies: As part of a larger project to develop an aid for writers that would help to eliminate stylistic inconsistencies within a document, Graham, Hirst, and Marthi (2005) experimented with neural networks to find the points in a text at which its stylistic character changes. The best results, well above baseline, were achieved with time-delay networks that used features related to the author's syntactic preferences, whereas low-level and vocabulary-based features were not found to be useful. An alternative approach with character bigrams was not successful. (This research is a continuation of our project on .)
Detecting plagiarism: Stylistic inconsistencies in a supposedly single-authored document may indicate plagiarism — parts of the text may have been copied from elsewhere. Brooke and Hirst (2012) developed a new approach to look for sudden stylistic changes within a document using low-level textual features and extrinsically-measured characteristics.
Lexical style: The role of the lexicon has been ignored or minimized in most work on computational stylistics. In his doctoral dissertation research, Julian Brooke aimed to fill that gap, demonstrating the key role that the lexicon plays in stylistic variation. The work brings together a number of diverse perspectives, including aesthetic, functional, and sociological aspects of style.
Brooke and Hirst created stylistic lexical resources from large mixed-register corpora. They adapted statistical techniques from approaches to topic and sentiment analysis to induce stylistic lexicons for formality (Brooke, Wang, and Hirst 2010a, 2010b; Brooke and Hirst 2014), readability (Brooke et al 2012), and three component dimensions of style — colloquial vs. literary, concrete vs. abstract, and subjective vs. objective (Brooke and Hirst 2013a, 2013c). A key novelty of the work is considering multiple correlated styles in a single model.
The stylistic lexicons were used in a variety of tasks that are relevant to style, in particular tasks relevant to genre and demographic variables, showing that the use of lexical resources compares well to more traditional approaches, in some cases offering information that is simply not available to a system based on surface features. In particular, there was a focus on the task of native language identification, offering a novel method for deriving lexical information from native language texts, and using a cross-corpus supervised approach to show definitively that lexical features are key to high performance on this task (Brooke and Hirst 2012b, 2012d, 2013b, 2013d, 2013e). Other tasks included genre differentiation (Brooke and Hirst 2013f); distinguishing sociolinguistic factors (Brooke and Hirst 2012a); predicting the clipping of long words to shorter ones (such as hippopotamus to hippo) (Brooke, Wang, and Hirst 2011); the detection of plagiarism (Brooke and Hirst 2012c); and various tasks in literary analysis of works by T.S. Eliot and Virginia Woolf (see ).
Detecting sexual predation: Detecting pedophiles who prey on minors in Internet chatrooms is an important problem for law-enforcement. But sexual predators may be subtle and their chats must be distinguished from sexually explicit chats between consenting adults. Morris and Hirst (2012) developed a system that can identify predators in a chatroom with a high accuracy by looking at subtleties of their utterances and their chat behaviour.
Does cognitive decline attenuate an author's individual style? As part of our work on detecting by looking at people's writing, Hirst and Feng (2012) considered the question of whether cognitive decline, while causing simplification of a writer's language, also leads to the decline of their individual style. The results were equivocal, as different frameworks yielded contrary results, but an SVM classifier was able to make age discriminations, or nearly so, for all three authors whom we studied, thereby casting doubt on the underlying axiom that an author's essential style is invariant in the absence of cognitive decline.
Labelled network motifs as an indicator of author's style: In representations of texts as complex networks, nodes of the network represent the words, and edges represent some relationship, usually word co-occurrence. Such networked models are able to grasp textual patterns. Marinho, Hirst, and Amancio (2018) devised a hybrid classifier, called labelled subgraphs, that combines the frequency of common words with small structures found in the topology of the network. The approach is illustrated in two contexts, authorship attribution and the identification of translationese. In the former, a set of novels written by different authors is analysed. To identify translationese, texts from the Canadian Hansard and the European Parliament were classified as to original and translated instances. These results suggest that labelled subgraphs are able to represent texts and this should be further explored in other tasks, such as the analysis of text complexity, language proficiency and machine translation.
Brooke, Julian and Hirst, Graeme. “Factors of formality: A dimension of register in a sociolinguistic corpus.” Georgetown University Round Table on Languages and Linguistics 2012: Measured Language: Quantitative Approaches to Acquisition, Assessment, Processing and Variation, Washington D.C., March 2012a. 
Brooke, Julian and Hirst, Graeme. “Measuring interlanguage: Native language identification with L1-influence metrics.” Proceedings, 8th ELRA Conference on Language Resources and Evaluation (LREC 2012), Istanbul, May 2012b. 
Brooke, Julian and Hirst, Graeme. “Paragraph clustering for intrinsic plagiarism detection using a stylistic vector-space model with extrinsic features.” Proceedings, PAN 2012 Lab: Uncovering Plagiarism, Authorship and Social Software Misuse -- at the CLEF 2012 Conference and Labs of the Evaluation Forum: Information Access Evaluation meets Multilinguality, Multimodality, and Visual Analytics, Rome, September 2012c. 
Brooke, Julian and Hirst, Graeme. “Robust, lexicalized native language identification.” Proceedings, 24th International Conference on Computational Linguistics (COLING-2012), Mumbai, December 2012d, 391–408. 
Brooke, Julian and Hirst, Graeme. “A multi-dimensional Bayesian approach to lexical style.” Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, June 2013a, 673–679. 
Brooke, Julian and Hirst, Graeme. “Using other learner corpora in the 2013 NLI shared task.” Proceedings, 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, June 2013b, 188–196. 
Brooke, Julian and Hirst, Graeme. “Hybrid models for lexical acquisition of correlated styles.” Proceedings, 6th International Joint Conference on Natural Language Processing, Nagoya, October 2013c, 82–90. 
Brooke, Julian and Hirst, Graeme. “Native language detection with ‘cheap’ learner corpora.” In: Granger, Sylviane; Gilquin, Gaëtanelle; and Meunier, Fanny (editors) Twenty Years of Learner Corpus Research: Looking back, moving ahead. Louvain-la-Neuve: Presses universitaires de Louvain, 2013d, 37–47. 
Brooke, Julian and Hirst, Graeme. “Investigating the influence of multi-L1 learner corpora variables on native language identification.” Learner Corpus Research Conference, Bergen, Norway, September 2013e.
Brooke, Julian and Hirst, Graeme. “Multidimensional analysis versus latent semantic analysis for constructing a register space:
Are hand-coded features needed or is bags-of-words enough?” Presented at the Linguistic Society of Belgium conference on Genre-and Register-related Text and Discourse Features in Multilingual Corpora, Brussels, January 2013.
Brooke, Julian and Hirst, Graeme. “Supervised ranking of co-occurrence profiles for acquisition of continuous lexical attributes.” Proceedings, 25th International Conference on Computational Linguistics (COLING-2014), Dublin, August 2014, 2172–2183. 
Brooke, Julian; Wang, Tong; and Hirst, Graeme. “Inducing lexicons of formality from corpora.” Workshop on Methods for the Automatic Acquisition of Language Resources and their Evaluation Methods, 7th Lexical Resources and Evaluation Conference, Valetta, Malta, May 2010a, 17–22.  
Brooke, Julian; Wang, Tong; and Hirst, Graeme. “Automatic acquisition of lexical formality.” Proceedings, 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, August 2010b, Poster volume pages 90–98. 
Brooke, Julian; Wang, Tong; and Hirst, Graeme. “Predicting word clipping with latent semantic analysis.” Proceedings, 5th International Joint Conference on Natural Language Processing, Chiang Mai, November 2011, 1392–1396. 
Brooke, Julian; Tsang, Vivian; Jacob, David; Shein, Fraser; and Hirst, Graeme. “Building readability lexicons with unannotated corpora.” Proceedings, Workshop on Predicting and Improving Text Readability for Target Reader Populations, Montreal, June 2012, 33–39. 
Graham, Neil; Hirst, Graeme; and Marthi, Bhaskara. “Segmenting documents by stylistic character.” Natural Language Engineering, 11(4), December 2005, 397-415. 
Hirst, Graeme and Feiguina, Ol'ga. “Bigrams of syntactic labels for authorship discrimination of short texts.” Literary and Linguistic Computing, 22(4), 2007, 405–417. 
Hirst, Graeme and Feng, Vanessa Wei. “Changes in style in authors with Alzheimer's disease.” English Studies (special issue on stylometry and authorship attribution), 93(3), May 2012, 357–370. 
Marinho, Vanessa Q.; Hirst, Graeme; and Amancio, Diego R. “Labelled network motifs reveal stylistic subtleties in written texts.” Journal of Complex Networks, 6(4), 1 August 2018, 620–638.
Morris, Colin and Hirst, Graeme. “Identifying sexual predators by SVM classification with lexical and behavioral features.” Proceedings, PAN 2012 Lab: Uncovering Plagiarism, Authorship and Social Software Misuse -- at the CLEF 2012 Conference and Labs of the Evaluation Forum: Information Access Evaluation meets Multilinguality, Multimodality, and Visual Analytics, Rome, September 2012. 
Professor of Computational Linguistics
University of Toronto, Department of Computer Science