This page includes datasets and resources for public use and research in natural language processing, computational linguistics, and cognitive science.

Historical noun-verb compositions dataset described here includes 10k entries of historical syntactic usages of verb-noun composition (e.g., abandon ship.dobj) that emerged over the past 150 years (1850-2000) in Google Syntactic-Ngrams English corpus. Historical adjective usages dataset described here includes emergent adjective-noun co-occurrences over the past 150 years (1850-2000) collected from Google Book corpus.
  • Reference: Grewal, K. and Xu, Y. (2021) Chaining algorithms and historical adjective extension. In Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang Xu, and Simon Hengchen (eds.), Computational approaches to semantic change (Language Variation). Berlin: Language Science Press.
Urban Dictionary dataset described here includes over 2,600 form-meaning pairs of slang terms (e.g., ace: "an expert or master specialist") recorded in Urban Dictionary. Euphemism-taboo pairs dataset described here includes 106 euphemism-taboo pairs (e.g., lavatory-toilet) recorded in a collection of sources. Child overextension dataset described here includes over 230 word-referent pairs of overextended noun usages (e.g., ball → "balloon") recorded in young children. Moral vignettes dataset described here includes over 700 moral vignettes (e.g., "people are starving animals to death") and human judged moral categories. Symmetry inference sentence dataset (SIS) described here includes 400 sentences of naturalistic usage for 40 English verbs that range from highly symmetric (e.g., A marry B = B marry A) to asymmetric (e.g., A eat B != B eat A) predicates. Corpus of Chinese Dynastic Histories dataset (CCDH) described here contains text corpora of 24 Chinese Dynastic Histories spanning approximately 2,000 years, from the 3rd century BCE to the 18th century CE.