The Shmoop Corpus:
A Dataset of Stories with Loosely Aligned Summaries

Atef Chaudhury1,2
Makarand Tapaswi1,2,4
Seung Wook Kim1,2,3
Sanja Fidler1,2,3

1University of Toronto
2Vector Institute
3NVIDIA
4Inria Paris


An excerpt from the corpus showing Shmoop summary paragraphs (left) and their chronological alignment to story paragraphs (right).

Understanding stories is a challenging reading comprehension problem for machines as it requires reading a large volume of text and following long-range dependencies. In this paper, we introduce the Shmoop Corpus: a dataset of 231 stories that are paired with detailed multi-paragraph summaries for each individual chapter (7,234 chapters), where the summary is chronologically aligned with respect to the story chapter. From the corpus, we construct a set of common NLP tasks, including Cloze-form question answering and a simplified form of abstractive summarization, as benchmarks for reading comprehension on stories. We then show that the chronological alignment provides a strong supervisory signal that learning-based methods can exploit leading to significant improvements on these tasks. We believe that the unique structure of this corpus provides an important foothold towards making machine story comprehension more approachable.



News



Paper

Atef Chaudhury, Makarand Tapaswi, Seung Wook Kim, Sanja Fidler

The Shmoop Corpus:
A Dataset of Stories with Loosely Aligned Summaries

arXiv: 1912.13082

[arXiv]
[Dataset]
[Bibtex]


Aligning Summaries with Stories


We show that using chronological constraints helps improve alignment performance. Interestingly, state-of-the-art NLP models such as BERT are outperformed by a simple TF-IDF based representation for summary and story paragraphs. We believe that this may be due to the complexity of stories, several named characters, need for tracking long-range dependencies, and high variance in linguistic style (e.g., from Shakespeare to modern).



Story Understanding

Task 1: Cloze-Form Question Answering

Story: Oedipus the King

Question:
_______ reenters and demands that anyone with information about the former king's murder speak up. He curses the murderer.



Multiple Choice Answers:
1) Creon 2) Jocasta 3) Oedipus 4) Teiresias 5) Laius 6) Polybus 7) Apollo 8) Sphinx 9) Corinth 10) Thebes


Aligned Paragraphs:

OEDIPUS
Thebans, if any knows the man by whom Laius, son of Labdacus, was slain, I summon him to make clean shrift to me. And if he shrinks, let him reflect that thus Confessing he shall 'scape the capital charge; For the worst penalty that shall befall him Is banishment--unscathed he shall depart. But if an alien from a foreign land Be known to any as the murderer, Let him who knows speak out, and he shall have Due recompense from me and thanks to boot.


Task 2: Multiple-Choice Abstractive Summarization

Story: A Christmas Carol

Complete the summary:
Scrooge throws out his famous ...



Multiple Choice Answers:
1) ... come over for Christmas dinner, but Scrooge isn't having any of it.
2) ... but what about the whole Jesus's birth thing?
3) ... catchphrase - Bah! Humbug!.
4) ... guys show up asking for any donations for the poor.
5) ... the cellar bursts open and out of it comes Marley's ghost!


Aligned Paragraphs:

"A merry Christmas, uncle! God save you!" cried a cheerful voice. It was the voice of Scrooge's nephew, who came upon him so quickly that this was the first intimation he had of his approach.

"Bah!" said Scrooge, "Humbug!"

He had so heated himself with rapid walking in the fog and frost, this nephew of Scrooge's, that he was all in a glow; his face was ruddy and handsome; his eyes sparkled, and his breath smoked again.




We thank Shmoop for creating an amazing learning resource and allowing us to use their summary data for research purposes. The project was supported by DARPA Explainable AI (XAI) and NSERC Cohesa. This webpage template was borrowed from Poly-RNN++.