G2P-transliteration improvement scripts

This page provides the reproducibles for our papers on improving grapheme-to-phoneme conversion (G2P) and machine transliteration with supplemental transcriptions and transliterations. In particular, by reproducibles we mean scripts and instructions for reproducing the experimental results given the same data sets. You can grab the first (ACL 2011) paper here and the second (NAACL 2012) paper here.

The scripts

The scripts for the first paper are available in this tarball, while the ones for the second paper can be found here. For both of them, you will need to read the README in the base directory first. It contains details on what software versions, etc. are needed/expected and also lays out how the files are organized and what is included. Note that what is provided is almost entirely scripts; we can’t provide any corpora due to their respective licenses. As well, we don’t include any of the external programs that we use. The README lists all such programs and also provides information as to where you can get them.

Using the scripts in your work

If you use the code (or the ideas in the paper), note that I do discourage direct use of the provided code; it’s meant primarily for educational purposes and is, in general, a mess and likely riddled with gross inefficiencies. If the code or ideas go into academic research, please cite the ACL 2011 paper or the NAACL 2012 paper as relevant (BiBTeX here and here):

Aditya Bhargava and Grzegorz Kondrak. 2011. How do you pronounce your name? Improving G2P with transliterations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 399–408, Portland, USA. Association for Computational Linguistics.

Aditya Bhargava and Grzegorz Kondrak. 2012. Leveraging supplemental representations for sequential transduction. In Human Language Technologies: The 2012 Conference of the North American Chapter of the Association for Computational Linguistics, pages 396–406, Montréal, Canada. Association for Computational Linguistics.