Discourse structure, rhetorical parsing, and text summarization

Daniel Marcu has developed a first-order formalization of the high-level, rhetorical structure of text. He proposed, analyzed theoretically, and compared empirically four algorithms for determining the valid text structures of a sequence of units among which some rhetorical relations hold. Two algorithms apply model-theoretic techniques; the other two apply proof-theoretic techniques. An exploratory corpus analysis of cue phrases then provided the means for applying the formalization to unrestricted natural language texts. A set of empirically motivated algorithms were designed in order to determine the elementary textual units of a text, to hypothesize rhetorical relations that hold among these units, and eventually, to derive the discourse structure of that text. The process that finds the discourse structure of unrestricted natural language texts is called rhetorical parsing.

Marcu has also explored two possible applications of the text theory that he proposes. The first is a discourse-based summarization system, which was shown to significantly outperform both a baseline algorithm and a commercial system. The second is a set of text planning algorithms that can be used by natural language generation systems in order to construct text plans in the cases in which the high-level communicative goal is to map an entire knowledge pool into text.

References:

Also: Marcu 1998.

Return to Research by Graeme Hirst and students