Article: Mining multilingual and multiscript Twitter data: unleashing the language and script barrier Journal: International Journal of Business Intelligence and Data Mining (IJBIDM) 2020 Vol.16 No.1 pp.107 - 127 Abstract: Micro-blogging sites like Twitter have become an opinion hub where views on diverse topics are expressed. Interpreting, comprehending and analysing this emotion-rich information can unearth many valuable insights. The job is trivial if the tweets are in English. But lately, increase in native languages for communication has imposed a great challenge in social media mining. Things become more complicated when people use Roman scripts to write non-English languages. India, being a country with a diverse collection of scripts and languages, encounters the problem severely. We have developed a system that automatically identifies and classifies native tweets, irrespective of the script used. Converting all tweets to English, we get rid of the 'script vs language' problem. The new approach we formulated consists of Script Identification, Language analysis, and Clustered mining. Considering English and the top two Indian languages, we found that the proposed framework gives better precision than the prevailing approaches. Inderscience Publishers - linking academia, business and industry through research

Title: Mining multilingual and multiscript Twitter data: unleashing the language and script barrier

Authors: Bidhan Sarkar; Nilanjan Sinhababu; Manob Roy; Pijush Kanti Dutta Pramanik; Prasenjit Choudhury

Addresses: Department of Computer Science and Engineering, National Institute of Technology, Durgapur, West Bengal, India ' Department of Computer Science and Engineering, Sanaka Educational Trust's Group of Institutions, Durgapur, West Bengal, India ' Department of Computer Science and Engineering, National Institute of Technology, Durgapur, West Bengal, India ' Department of Computer Science and Engineering, National Institute of Technology, Durgapur, West Bengal, India ' Department of Computer Science and Engineering, National Institute of Technology, Durgapur, West Bengal, India

Abstract: Micro-blogging sites like Twitter have become an opinion hub where views on diverse topics are expressed. Interpreting, comprehending and analysing this emotion-rich information can unearth many valuable insights. The job is trivial if the tweets are in English. But lately, increase in native languages for communication has imposed a great challenge in social media mining. Things become more complicated when people use Roman scripts to write non-English languages. India, being a country with a diverse collection of scripts and languages, encounters the problem severely. We have developed a system that automatically identifies and classifies native tweets, irrespective of the script used. Converting all tweets to English, we get rid of the 'script vs language' problem. The new approach we formulated consists of Script Identification, Language analysis, and Clustered mining. Considering English and the top two Indian languages, we found that the proposed framework gives better precision than the prevailing approaches.

Keywords: Twitter mining; language classification; script identification; Indic language; preprocessing; naive Bayes; support vector machine; LDA.

DOI: 10.1504/IJBIDM.2020.103847

International Journal of Business Intelligence and Data Mining, 2020 Vol.16 No.1, pp.107 - 127

Received: 11 May 2017
Accepted: 07 Aug 2017
Published online: 02 Dec 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Mining multilingual and multiscript Twitter data: unleashing the language and script barrier

Keep up-to-date