English-Hindi language identification data

This page provides our language origin–tagged data set. We tagged the data for our language identification experiments to see if we could improve transliteration results using language identification. The paper provides more details on the tagging process (ambiguous names, etc.).

The data

The data are available here. The whitepaper for the shared task describes the overall XML format; our language identification tags are added as an extra langid attribute for each Name element (in addition to the ID attribute). The langid attribute is "Hi" for a name of Indian origin or "En" for a name of non-Indian origin.

Because the data that we tagged come from the NEWS 2009 Shared Task on Transliteration, the data are available under the Microsoft Research License Agreement for non-commercial use only.

Using the data in your work

If you use our tagged data in your work, please cite our language identification paper (BiBTeX here):

Aditya Bhargava and Grzegorz Kondrak. 2010. Language identification of names with SVMs. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 693–696, Los Angeles, USA. Association for Computational Linguistics.

If you would like to use the original transliteration data, please double-check with A. Kumaran at Microsoft Research India for their citation requirements.