A generic character aligned machine transliteration system for Indic languages
MetadataShow full item record
A typical problem encountered in machine translation is the Out of Vocabulary (OOV) terms. These are usually names of places, people or technical terms that cannot be easily translated from one language to another or become obfuscated when translated. These end up as transliterated terms, i.e., a syllable or syllable group conversion from one language to another while trying to preserve the phonetic pronunciation. Although a large number of transliteration systems have been built over the years, they suffer from several problems. Firstly, any machine learning system is only as good as the underlying dataset used to train the system. For resource poor languages thus, either no such systems exist or perform extremely poorly. Secondly, most transliteration systems are over fitted to cater to the source language. However, with the proliferation of the Internet and the social media, language mixing is fairly common and most such systems fail if words derived from other languages are introduced. In this research, we aim to build better transliteration systems that can better model the language under consideration and incorporate additional features that can offset the over fitting problem described above. Also we explore how inherent language similarities can be used to bootstrap transliteration systems for resource poor languages. We explore how classical techniques in machine translation and information retrieval can be adapted to the problem in hand to build better and more robust systems.