GutenTag is an NLP-driven tool for digital humanities research in the Project Gutenberg corpus.

The high-level goal of the project is to create an ongoing two-way flow of resources between computational linguists and digital humanists, allowing computational linguists to identify pressing problems in the large-scale analysis of literary texts, while giving digital humanists access to a wider variety of NLP tools for exploring literary phenomena. GutenTag is intended to be a standalone software tool for non-programmers, but the source code is also available and we welcome others in the computational linguistics community to contribute to its development or adapt it as needed.

At its simplest level, GutenTag is a corpus reader. A second facet of the tool is a corpus filter: it uses the information contained explicitly within the PG database and/or derived automatically from other sources to allow researchers to build subcorpora of interest reflecting their exact analytic needs. Another feature gives GutenTag its name: the tool has access to tagging models which represent the intersection of literary analysis needs and existing NLP methods. The output of GutenTag is either an XML corpus with tags (at both text and metatextual levels) based on the TEI-encoding standard; or, if desired, direct statistical analysis of the distribution of tags across different subcorpora. None of the features of GutenTag mentioned above are intended to be static: GutenTag is a tool that will grow and improve with feedback from the digital humanities community and new methods from the computational linguistics community.

For more information about GutenTag in general, please read the paper which introduced it.

Project GutenTag is led by Julian Brooke (University of Melbourne) and Adam Hammond (San Diego State University).

The GutenTag interface proceeds in three steps.

First, you define subcorpora. Next, you decide whether you want to export or analyze these subcorpora. Finally, you specify your export or analysis options.

In the first step, you specify which Project Gutenberg texts you are interested in. Individual subcorpora can be defined by a broad range of parameters, such as genre, author gender, and publication date. Further, you are able to specify particular subsections of texts such as character speech, introductions, and stage directions. You can select multiple subcorpora for export or analysis.

If you choose to export your subcorpora, you are given a range of options, such as whether you would like GutenTag to output plain text or TEI XML; whether you would like your texts to the Part of Speech-tagged; and the maximum number of texts to export.

If you choose to analyze your subcorpora, you are given the option of adding lexical tags. Once you have specified your options, your results are displayed on a separate page.

Web Version
Download GutenTag