Prefix verb semantic compositionality data

This page provides our semantic compositionality–tagged prefix verb data. The verbs are originally from the New York Times section of the Gigaword corpus, and are annotated for semantic compositionality in given example sentences. The paper provides more details on how we gathered the data and what exactly the annotators were asked to do.

Data format

The data are available in two sets, “standard” and “extra,” details of which are given below. Both sets include training, development, and test sets that indicate the splits that we used in the paper. Each file is tab-delimited with the following fields/columns:

  1. The annotation (this is different for the two sets, more info below)
  2. The original verb
  3. The verb in hyphenated form to indicate the prefix-stem separation
  4. The original text in which the verb occurs
  5. The zero-based index of the word in the text, if the text is treated as a space-delimited list of words

Standard set

This set contains the data described in the paper. In this, each verb is marked for semantic compositionality in the first column; 1 indicates that the verb is compositional, and 0 indicates that it is not.

The standard data set is available here.

Extra set

This set contains data not used in the paper. For every verb that was marked as non-compositional, the annotators were asked to indicate whether the sentence could be re-phrased using the verb stem in a different form, such as a syntactic or morphological variant, that sufficiently captures the verb’s meaning (the noun stigma for the verb destigmatize, for example). As with the standard set used in the paper, a bidirectional entailment between the original sentence and the re-phrasing is required. In this set, the annotation is indicated in the first column; 1 indicates that the sentence can be re-phrased in the aforementioned way, and 0 indicates that it cannot. Only the verbs that were marked as non-compositional in the standard set are included in the extra set.

The extra data set is available here.

Using the data in your work

If you use our data in your work, please cite our prefix verb compositionality paper (BiBTeX here):

Shane Bergsma, Aditya Bhargava, Hua He, and Grzegorz Kondrak. 2010. Predicting the semantic compositionality of prefix verbs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 293–303, Cambridge, MA. Association for Computational Linguistics.