CodeSwitch Reddit corpus described in the paper:
"CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums" Rabinovich et al., 2019

cs_main_reddit_corpus.csv: main dataset comprising English-{Tagalog, Greek, Romanian, Indonesian, Russian} code-switched posts.

cs_additional_reddit_corpus.csv: additional dataset comprising English-{Spanish, Turkish, Arabic, Croatian, Albanian} code-switched posts. Despite its lower (true code-switching) accuracy, we recognize the potential usefulness of this additional data, and release it as an addendum to our main corpus, for possible further cleanup and preprocessing.

eng_monolingual_reddit_corpus.csv: monolingual English posts from the set of country-specific subreddits.


Please contact ellarabi@gmail.com for any questions.
