Statistical machine translation for Indic languages Sudhansu Bala Das, Divyajyoti Panda, Tapas Kumar Mishra, Bidyut Kr. Patra Natural Language Processing, 2025 Statistical Machine Translation (SMT) systems use various probabilistic and statistical Natural Language Processing (NLP) methods to automatically translate from one language to another language while retaining the originality of the context. This paper aims to discuss the development of bilingual SMT models for translating English into fifteen low-resource Indic languages (ILs) and vice versa. The process to build the SMT model is described and explained using a workflow diagram. Samanantar and OPUS corpus are utilized for training, and Flores200 corpus is used for fine-tuning and testing purposes. The paper also highlights various preprocessing methods used to deal with corpus noise. The Moses open-source SMT toolkit is being investigated for the system’s development. The impact of distance-based reordering and Morpho-syntactic Descriptor Bidirectional Finite-State Encoder (msd-bidirectional-fe) reordering on ILs is compared in the paper. This paper provides a comparison of SMT models with Neural Machine Translation (NMT) for ILs. All the experiments assess the translation quality using standard metrics such as BiLingual Evaluation Understudy, Rank-based Intuitive Bilingual Evaluation Score, Translation Edit Rate, and Metric for Evaluation of Translation with Explicit Ordering. From the result, it is observed that msd-bidirectional-fe reordering performs better than the distance-based reordering model for ILs. It is also noticed that even though the IL-English and English-IL systems are trained using the same corpus, the former performs better for all the evaluation metrics. The comparison between SMT and NMT shows that across various languages, SMT performs better in some cases, while NMT outperforms in others.
Multilingual Neural Machine Translation for Indic to Indic Languages Sudhansu Bala Das, Divyajyoti Panda, Tapas Kumar Mishra, Bidyut Kr. Patra, Asif Ekbal ACM Transactions on Asian and Low Resource Language Information Processing, 2024 The method of translation from one language to another without human intervention is known as Machine Translation (MT). Multilingual neural machine translation (MNMT) is a technique for MT that builds a single model for multiple languages. It is preferred over other approaches, since it decreases training time and improves translation in low-resource contexts, i.e., for languages that have insufficient corpus. However, good-quality MT models are yet to be built for many scenarios such as for Indic-to-Indic Languages (IL-IL). Hence, this article is an attempt to address and develop the baseline models for low-resource languages i.e., IL-IL (for 11 Indic Languages (ILs)) in a multilingual environment. The models are built on the Samanantar corpus and analyzed on the Flores-200 corpus. All the models are evaluated using standard evaluation metrics i.e., Bilingual Evaluation Understudy (BLEU) score (with the range of 0 to 100). This article examines the effect of the grouping of related languages, namely, East Indo-Aryan (EI), Dravidian (DR), and West Indo-Aryan (WI) on the MNMT model. From the experiments, the results reveal that related language grouping is beneficial for the WI group only while it is detrimental for the EI group and it shows an inconclusive effect on the DR group. The role of pivot-based MNMT models in enhancing translation quality is also investigated in this article. Owing to the presence of large good-quality corpora from English (EN) to ILs, MNMT IL-IL models using EN as a pivot are built and examined. To achieve this, English-Indic Language (EN-IL) models are developed with and without the usage of related languages. Results show that the use of related language grouping is advantageous specifically for EN to ILs. Thus, related language groups are used for the development of pivot MNMT models. It is also observed that the usage of pivot models greatly improves MNMT baselines. Furthermore, the effect of transliteration on ILs is also analyzed in this article. To explore transliteration, the best MNMT models from the previous approaches (in most of cases pivot model using related groups) are determined and built on corpus transliterated from the corresponding scripts to a modified Indian language Transliteration script (ITRANS). The outcome of the experiments indicates that transliteration helps the models built for lexically rich languages, with the best increment of BLEU scores observed in Malayalam (ML) and Tamil (TA), i.e., 6.74 and 4.72, respectively. The BLEU score using transliteration models ranges from 7.03 to 24.29. The best model obtained is the Punjabi (PA)-Hindi (HI) language pair trained on PA-WI transliterated corpus.
Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers Tusarkanta Dalai, Tapas Kumar Mishra, Pankaj K. Sa ACM Transactions on Asian and Low Resource Language Information Processing, 2024 Developing effective natural language processing (NLP) tools for low-resourced languages poses significant challenges. This article centers its attention on the task of Part-of-speech (POS) tagging and chunking, which pertains to the identification and categorization of linguistic units within sentences. POS tagging and Chunking have already produced positive results in English and other European languages. However, in Indian languages, particularly in Odia language, it is not yet well explored because of the lack of supporting tools, resources, and its complex linguistic morphology. This study presents the building of a manually annotated dataset for Odia phrase chunking task and the development of a deep learning-based model specifically tailored to accommodate the distinctive properties of the language. The process of annotating the Odia chunking corpus involved the utilization of inside-outside-begin labels, which were tagged by using designed Odia chunking tagset. We utilize the constructed Odia chunking dataset to build Odia chunker based on deep learning techniques, employing state-of-the-art architectures. Various techniques, such as Recurrent Neural Networks, Convolutional Neural Networks, and transformer-based models, are investigated to determine the most effective approach for Odia POS tagging and chunking. In addition, we conduct experiments utilizing diverse input representations, including Odia word embeddings, character-level representations, and sub-word units, to effectively capture the complex linguistic characteristics of the Odia language. Numerous experiments are conducted that evaluate the performance of our Odia POS tagger and chunker, employing standard evaluation metrics and making comparisons with existing approaches. The results demonstrate that our transformer-based tagger and chunker achieves superior accuracy and robustness in identifying and categorizing linguistic POS tags and chunks within Odia sentences. It outperforms existing work and exhibits consistent performance across diverse linguistic contexts and sentence structures. The developed Odia POS tagger and chunker have enormous potential for a variety of NLP applications, including information extraction, syntactic parsing, and machine translation, all of which are tailored to the low-resource Odia language. This work contributes to developing NLP tools and technologies for low-resource languages, thereby facilitating enhanced language processing capabilities in various linguistic contexts.
Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches Tusarkanta Dalai, Tapas Kumar Mishra, Pankaj K. Sa ACM Transactions on Asian and Low Resource Language Information Processing, 2023 Automatic part-of-speech (POS) tagging is a preprocessing step of many natural language processing tasks, such as named entity recognition, speech processing, information extraction, word sense disambiguation, and machine translation. It has already gained promising results in English and European languages. However, in Indian languages, particularly in the Odia language, it is not yet well explored because of the lack of supporting tools, resources, and morphological richness of the language. Unfortunately, we were unable to locate an open source POS tagger for the Odia language, and only a handful of attempts have been made to develop POS taggers for the Odia language. The main contribution of this research work is to present statistical approaches such as the maximum entropy Markov model and conditional random field (CRF), as well as deep learning based approaches, including the convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM) to develop the Odia POS tagger. A publicly accessible corpus annotated with the Bureau of Indian Standards (BIS) tagset is used in our work. However, most of the languages around the globe have used the dataset annotated with the Universal Dependencies (UD) tagset. Hence, to maintain uniformity, the Odia dataset should use the same tagset. Thus, following the BIS and UD guidelines, we constructed a mapping from the BIS tagset to the UD tagset. The maximum entropy Markov model, CRF, Bi-LSTM, and CNN models are trained using the Indian Languages Corpora Initiative corpus with the BIS and UD tagsets. We have experimented with various feature sets as input to the statistical models to prepare a baseline system and observed the impact of constructed feature sets. The deep learning based model includes the Bi-LSTM network, the CNN network, the CRF layer, character sequence information, and a pre-trained word vector. Seven different combinations of neural sequence labeling models are implemented, and their performance measures are investigated. It has been observed that the Bi-LSTM model with the character sequence feature and pre-trained word vector achieved a result with 94.58% accuracy.
Improving Multilingual Neural Machine Translation System for Indic Languages Sudhansu Bala Das, Atharv Biradar, Tapas Kumar Mishra, Bidyut Kr. Patra ACM Transactions on Asian and Low Resource Language Information Processing, 2023 The Machine Translation System (MTS) serves as effective tool for communication by translating text or speech from one language to another language. Recently, neural machine translation (NMT) has become popular for its performance and cost-effectiveness. However, NMT systems are restricted in translating low-resource languages as a huge quantity of data is required to learn useful mappings across languages. The need for an efficient translation system becomes obvious in a large multilingual environment like India. Indian languages (ILs) are still treated as low-resource languages due to unavailability of corpora. In order to address such an asymmetric nature, the multilingual neural machine translation (MNMT) system evolves as an ideal approach in this direction. The MNMT converts many languages using a single model, which is extremely useful in terms of training process and lowering online maintenance costs. It is also helpful for improving low-resource translation. In this article, we propose an MNMT system to address the issues related to low-resource language translation. Our model comprises two MNMT systems, i.e., for English-Indic (one-to-many) and for Indic-English (many-to-one) with a shared encoder-decoder containing 15 language pairs (30 translation directions). Since most of IL pairs have a scanty amount of parallel corpora, not sufficient for training any machine translation model, we explore various augmentation strategies to improve overall translation quality through the proposed model. A state-of-the-art transformer architecture is used to realize the proposed model. In addition, the article addresses the use of language relationships (in terms of dialect, script, etc.), particularly about the role of high-resource languages of the same family in boosting the performance of low-resource languages. Moreover, the experimental results also show the advantage of back-translation and domain adaptation for ILs to enhance the translation quality of both source and target languages. Using all these key approaches, our proposed model emerges to be more efficient than the baseline model in terms of evaluation metrics, i.e., BLEU (BiLingual Evaluation Understudy) score for a set of ILs.
NIT Rourkela Machine Translation(MT) System Submission to WAT 2022 for MultiIndicMT: An Indic Language Multilingual Shared Task Proceedings International Conference on Computational Linguistics Coling, 2022
The Linear Arboricity Conjecture for Graphs with Large Girth TK Mishra arXiv preprint arXiv:2512.11240 , 2025 2025
Development of a Low-Cost Named Entity Recognition System for Odia Language using Deep Active Learning T Dalai, TK Mishra, PK Sa, P Mohanty, C Swain, AK Nayak Proceedings of the Workshop on Beyond English: Natural Language Processing … , 2025 2025
A thresholding method for Improving translation Quality for Indic MT task SB Das, LR Rodrigues, TK Mishra, BK Patra Proceedings of the First Workshop on Advancing NLP for Low-Resource … , 2025 2025
OdNER: NER resource creation and system development for low-resource Odia language T Dalai, A Das, TK Mishra, PK Sa Natural Language Processing Journal 11, 100139 , 2025 2025 Citations: 6
Comparative analysis of subword tokenization approaches for Indian languages SB Das, S Choudhury, TK Mishra, BK Patra arXiv preprint arXiv:2505.16868 , 2025 2025 Citations: 5
Comparative analysis of subword tokenization approaches for Indian languages S Bala Das, S Choudhury, TK Mishra, BK Patra arXiv e-prints, arXiv: 2505.16868 , 2025 2025
Statistical machine translation for indic languages SB Das, D Panda, TK Mishra, BK Patra Natural Language Processing 31 (2), 328-345 , 2025 2025 Citations: 25
Investigating the Effect of Backtranslation for Indic Languages SB Das, S Choudhury, TK Mishra, BK Patra Proceedings of the First Workshop on Natural Language Processing for Indo … , 2025 2025 Citations: 5
HGR-FYOLO: a robust hand gesture recognition system for the normal and physically impaired person using frozen YOLOv5 A Sen, S Dombe, TK Mishra, R Dash Multimedia Tools and Applications 83 (30), 73797-73815 , 2024 2024 Citations: 8
Novel Human Machine Interface via Robust Hand Gesture Recognition System using Channel Pruned YOLOv5s Model A Sen, TK Mishra, R Dash arXiv preprint arXiv:2407.02585 , 2024 2024 Citations: 3
Multilingual Neural Machine Translation for Indic to Indic Languages S Bala Das, D Panda, T Kumar Mishra, B Kr. Patra, A Ekbal ACM Transactions on Asian and Low-Resource Language Information Processing … , 2024 2024 Citations: 36
Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers T Dalai, TK Mishra, PK Sa ACM Transactions on Asian and Low-Resource Language Information Processing … , 2024 2024 Citations: 15
An approach for mistranslation removal from popular dataset for Indic MT Task SB Das, LR Rodrigues, TK Mishra, BK Patra arXiv preprint arXiv:2401.06398 , 2024 2024 Citations: 5
An approach for mistranslation removal from popular dataset for Indic MT Task S Bala Das, LR Rodrigues, TK Mishra, BK Patra arXiv e-prints, arXiv: 2401.06398 , 2024 2024
On the size of an -wise fractional -intersecting family TK Mishra Journal of Combinatorics 15 (1), 77-87 , 2024 2024 Citations: 2
Deep Learning-Based Hand Gesture Recognition System and Design of a Human–Machine Interface A Sen, TK Mishra, R Dash Neural Processing Letters 55 (9), 12569-12596 , 2023 2023 Citations: 33
Improving multilingual neural machine translation system for Indic languages SB Das, A Biradar, TK Mishra, BK Patra ACM Transactions on Asian and Low-Resource Language Information Processing … , 2023 2023 Citations: 61
Part-of-speech tagging of Odia language using statistical and deep learning based approaches T Dalai, TK Mishra, PK Sa ACM Transactions on Asian and Low-Resource Language Information Processing … , 2023 2023 Citations: 41
A novel hand gesture detection and recognition system based on ensemble-based convolutional neural network A Sen, TK Mishra, R Dash Multimedia Tools and Applications 81 (28), 40043-40066 , 2022 2022 Citations: 39
MOST CITED SCHOLAR PUBLICATIONS
Improving multilingual neural machine translation system for Indic languages SB Das, A Biradar, TK Mishra, BK Patra ACM Transactions on Asian and Low-Resource Language Information Processing … , 2023 2023 Citations: 61
Part-of-speech tagging of Odia language using statistical and deep learning based approaches T Dalai, TK Mishra, PK Sa ACM Transactions on Asian and Low-Resource Language Information Processing … , 2023 2023 Citations: 41
A novel hand gesture detection and recognition system based on ensemble-based convolutional neural network A Sen, TK Mishra, R Dash Multimedia Tools and Applications 81 (28), 40043-40066 , 2022 2022 Citations: 39
Multilingual Neural Machine Translation for Indic to Indic Languages S Bala Das, D Panda, T Kumar Mishra, B Kr. Patra, A Ekbal ACM Transactions on Asian and Low-Resource Language Information Processing … , 2024 2024 Citations: 36
Deep Learning-Based Hand Gesture Recognition System and Design of a Human–Machine Interface A Sen, TK Mishra, R Dash Neural Processing Letters 55 (9), 12569-12596 , 2023 2023 Citations: 33
Statistical machine translation for indic languages SB Das, D Panda, TK Mishra, BK Patra Natural Language Processing 31 (2), 328-345 , 2025 2025 Citations: 25
Blockchain: Basics, applications, challenges and opportunities J Arya, A Kumar, AP Singh, TK Mishra, PHJ Chong Jan , 2021 2021 Citations: 19
Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers T Dalai, TK Mishra, PK Sa ACM Transactions on Asian and Low-Resource Language Information Processing … , 2024 2024 Citations: 15
Fractional L-intersecting Families N Balachandran, R Mathew, TK Mishra The Electronic Journal of Combinatorics 26 (2), 2.40 , 2019 2019 Citations: 13
Source code auto-completion using various deep learning models under limited computing resources M Sharma, TK Mishra, A Kumar Complex & Intelligent Systems 8 (5), 4357-4368 , 2022 2022 Citations: 9
HGR-FYOLO: a robust hand gesture recognition system for the normal and physically impaired person using frozen YOLOv5 A Sen, S Dombe, TK Mishra, R Dash Multimedia Tools and Applications 83 (30), 73797-73815 , 2024 2024 Citations: 8
NIT Rourkela machine translation (MT) system submission to WAT 2022 for MultiIndicMT: An Indic language multilingual shared task SB Das, A Biradar, TK Mishra, BK Patra Proceedings of the 9th Workshop on Asian Translation, 73-77 , 2022 2022 Citations: 7
A Combinatorial Proof of Fisher’s Inequality R Mathew, TK Mishra Graphs and Combinatorics 36 (6), 1953-1956 , 2020 2020 Citations: 7
Boundary detection in dynamic wireless sensor networks using convex hull techniques TK Mishra, J Sadhu, A Kumar 2020 IEEE Calcutta Conference (CALCON), 368-372 , 2020 2020 Citations: 7
OdNER: NER resource creation and system development for low-resource Odia language T Dalai, A Das, TK Mishra, PK Sa Natural Language Processing Journal 11, 100139 , 2025 2025 Citations: 6
Modular and Fractional L -Intersecting Families of Vector Spaces SS Rogers Mathew, Tapas Kumar Mishra, Ritabrata Ray the electronic journal of combinatorics 29 (1), P1.45 , 2022 2022 Citations: 6
Comparative analysis of subword tokenization approaches for Indian languages SB Das, S Choudhury, TK Mishra, BK Patra arXiv preprint arXiv:2505.16868 , 2025 2025 Citations: 5
Investigating the Effect of Backtranslation for Indic Languages SB Das, S Choudhury, TK Mishra, BK Patra Proceedings of the First Workshop on Natural Language Processing for Indo … , 2025 2025 Citations: 5
An approach for mistranslation removal from popular dataset for Indic MT Task SB Das, LR Rodrigues, TK Mishra, BK Patra arXiv preprint arXiv:2401.06398 , 2024 2024 Citations: 5
Bisecting and D-secting families for set systems N Balachandran, R Mathew, TK Mishra, SP Pal Discrete Applied Mathematics 280, 2-13 , 2020 2020 Citations: 5