A Bootstrapped Approach to Multilingual Text Stream Parsing
MetadataShow full item record
The ubiquitous hashtag has disruptively transformed how news stories are reported and shared across social media networks. Often, such text streams are massively multilingual with 50 different languages on an average and contain a combination of subjective user opinion, objective evolving information about the story and unrelated spam. This is in addition to the usual challenges of processing social media content like lack of grammar, stylized spellings and usage of slang, emojis and emoticons. Further, language dense regions frequently exhibit code switching and code mixing, where users switch between languages in a single post with or without retaining a single writing system. So far, most research on parsing such streams has largely resorted to piecemeal and language specific approaches. As part of this work, we propose a processing pipeline with two salient features. First, we show how the topical and temporal relationships between the posts can be utilized for language agnostic discourse interpretation. Second, we also show how bootstrapping for incremental parsing can lead to an improved system performance and propose an end to end pipeline to that effect. We explore how the said pipeline can be utilized for two sample use cases - question answering and summarization.