Yang Xu: Datasets

This page includes datasets and resources for public use.

OpenSub-Slang dataset described here includes a large benchmark dataset (with 7,488 entries of slang usages extracted from movie subtitles) for evaluating informal language processing in (large) language models.

Reference: Sun, Z., Hu, Q., Gupta, R., Zemel, R., and Xu, Y. (2024) Toward informal language processing: Knowledge of slang in large language models. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics.

Historical noun-verb compositions dataset described here includes 10k entries of historical syntactic usages of verb-noun composition (e.g., abandon ship.dobj) that emerged over the past 150 years (1850-2000) in Google Syntactic-Ngrams English corpus.

Reference: Yu, L. and Xu, Y. (2021) Predicting emergent linguistic compositions through time: Syntactic frame extension via multimodal chaining. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.

Historical adjective usages dataset described here includes emergent adjective-noun co-occurrences over the past 150 years (1850-2000) collected from Google Book corpus.

Reference: Grewal, K. and Xu, Y. (2021) Chaining algorithms and historical adjective extension. In Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang Xu, and Simon Hengchen (eds.), Computational approaches to semantic change (Language Variation). Berlin: Language Science Press.

Urban Dictionary dataset described here includes over 2,600 form-meaning pairs of slang terms (e.g., ace: "an expert or master specialist") recorded in Urban Dictionary.

Reference: Sun, Z., Zemel, R., and Xu, Y. (2021) A computational framework for slang generation. Transactions of the Association for Computational Linguistics, 9, 462-478.

Euphemism-taboo pairs dataset described here includes 106 euphemism-taboo pairs (e.g., lavatory-toilet) recorded in a collection of sources.

Reference: Kapron-King, A. and Xu, Y. (2021) A diachronic evaluation of gender asymmetry in euphemism. In Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change, ACL.

Child overextension dataset described here includes over 230 word-referent pairs of overextended noun usages (e.g., ball → "balloon") recorded in young children.

Reference: Ferreira Pinto Jr., R. and Xu, Y. (2021) A computational theory of child overextension. Cognition, 206, 104472.

Moral vignettes dataset described here includes over 700 moral vignettes (e.g., "people are starving animals to death") and human judged moral categories.

Reference: Xie, J. Y., Hirst, G., and Xu, Y. (2020). Contextualized moral inference. arXiv preprint arXiv:2008.10762.

Symmetry inference sentence dataset (SIS) described here includes 400 sentences of naturalistic usage for 40 English verbs that range from highly symmetric (e.g., A marry B = B marry A) to asymmetric (e.g., A eat B != B eat A) predicates.

Reference: Tanchip, C., Yu, L., Xu, A., and Xu, Y. (2020) Inferring symmetry in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020.

Corpus of Chinese Dynastic Histories dataset (CCDH) described here contains text corpora of 24 Chinese Dynastic Histories spanning approximately 2,000 years, from the 3rd century BCE to the 18th century CE.

Reference: Zinin, S. and Xu, Y. (2020) Corpus of Chinese dynastic histories: Gender analysis over two millennia. In Proceedings of the 12th International Conference on Language Resources and Evaluation.