Sainik Kumar Mahata

@iem.edu.in

Assistant Professor, Computer Science and Engineering
Institute of Engineering and Management



                 

https://researchid.co/sainik.mahata

EDUCATION

M.Tech, B.E

RESEARCH INTERESTS

Natural Language Processing

23

Scopus Publications

359

Scholar Citations

9

Scholar h-index

9

Scholar i10-index

Scopus Publications

  • Consensus-Based Machine Translation for Code-Mixed Texts
    Sainik Kumar Mahata, Dipankar Das, and Sivaji Bandyopadhyay

    Association for Computing Machinery (ACM)
    Multilingualism in India is widespread due to its long history of foreign acquaintances. This leads to the presence of an audience familiar with conversing using more than one language. Additionally, due to the social media boom, the usage of multiple languages to communicate has become extensive. Hence, the need for a translation system that can serve the novice and monolingual user is the need of the hour. Such translation systems can be developed by methods such as statistical machine translation and neural machine translation, where each approach has its advantages as well as disadvantages. In addition, the parallel corpus needed to build a translation system, with code-mixed data, is not readily available. In the present work, we present two translation frameworks that can leverage the individual advantages of these pre-existing approaches by building an ensemble model that takes a consensus of the final outputs of the preceding approaches and generates the target output. The developed models were used for translating English-Bengali code-mixed data (written in Roman script) into their equivalent monolingual Bengali instances. A code-mixed to monolingual parallel corpus was also developed to train the preceding systems. Empirical results show improved BLEU and TER scores of 17.23 and 53.18 and 19.12 and 51.29, respectively, for the developed frameworks.

  • Exploring the Role of Automated Video Inspection and Recognition in Security Enhancement
    Darothi Sarkar, Monalisa Dey, Sahini Das, Soham Bangal, Aniket Kar, Sainik Kumar Mahata, and Anupam Mondal

    IEEE
    Security becomes the highest priority for people, organisations, and governments in a world that is getting more interconnected and complicated. The use of automated video inspection and recognition systems is becoming a very popular and effective method for strengthening security measures. This is a technology-driven system that integrates image processing, Artificial Intelligence, Machine Learning to identify specific objects, analyse their behaviour and movement and also recognize a specific trend to detect unusual activities. This comprehensive review paper explores and compares different technologies that have been used in automated video inspection and recognition systems. This survey paper aims to comprehensively examine several approaches used for automated video inspection. It explores the real-world applications and challenges of automated video inspection systems, providing a comprehensive grasp of this vital part of security and monitoring systems.

  • Design Evaluation and Uses of Paraphraser Content Generation
    Soumyajit Chowdhury, Pawan Shaw, and Sainik Kumar Mahata

    IEEE
    Paraphrasing is a challenging and important job in NLP, which involves generating alternative versions of text that preserve the original meaning. Paraphrasing can benefit many NLP applications, such as summarization, translation, and sentiment analysis, by improving the diversity, quality, and readability of the generated text. However, paraphrasing is not a trivial task, as it requires a deep understanding of the semantics and syntax of the text, as well as the ability to produce fluent and natural language. Existing paraphrasing tools are often limited in their scope, accuracy, and flexibility, and they cannot handle complex and diverse texts. In this project, we present Paraphraser, a novel and comprehensive software tool that can perform paraphrasing on any text, using advanced natural language processing techniques, deep learning models, and rich language resources. Paraphraser can generate multiple paraphrased versions of text, ranging from subtle rewording to significant revisions, while maintaining the real meaning and style. Paraphraser can also optimize the paraphrased text for various purposes, such as avoiding plagiarism, ensuring message clarity, and reaching a wider audience. Paraphraser is a versatile and reliable tool that can be used in various domains, such as content marketing, academic writing, and SEO optimization, enhancing the originality and attractiveness of textual content. We demonstrate the effectiveness and usefulness of Paraphraser through various experiments and evaluations, and we show how Paraphraser can improve the performance of various NLP tasks and applications.

  • Text summarization implementing abstractive and extractive methods
    Debjyoti Ghosh, Abhirup Mazumder, and Sainik Kumar Mahata

    IEEE
    How often do we come across paragraphs which contain important information but are too long to read? Most people tend to overlook humungous paragraphs at the expense of losing out on crucial information. This leads to a gap in topics which otherwise connect other significant concepts to make a meaningful learning experience - something we term as knowledge void. This report aims to highlight the importance of summarization using the two different methods of summarization – abstractive and extractive. We shall discuss the methods in detail including the methodologies, architectures and algorithms involved. This includes the preprocessing of data, the introduction of word embeddings, the application of algorithms like TextRank, building sequence-to-sequence models using LSTMs, applying encoder-decoder architecture and other advanced NLP techniques. We shall also evaluate our work using appropriate evaluation metrics. There has been experimentation using different approaches like unidirectional LSTMs, bidirectional LSTMs, a variety of tokenizers, and incorporation of attention layer to obtain the model with optimal accuracy and consistency.

  • Exploring Summarization of Scientific Tables: Analysing Data Preparation and Extractive to Abstractive Summary Generation


  • Simplification of English and Bengali Sentences for Improving Quality of Machine Translation
    Sainik Kumar Mahata, Avishek Garain, Dipankar Das, and Sivaji Bandyopadhyay

    Springer Science and Business Media LLC

  • Preparation of Sentiment tagged Parallel Corpus and Testing Its Effect on Machine Translation
    Sainik Kumar Mahata, Amrita Chandra, Dipankar Das, and Sivaji Bandyopadhyay

    Springer Singapore

  • Classification of COVID19 tweets using Machine Learning Approaches


  • Sentiment Classification of Code-Mixed Tweets using Bi-Directional RNN and Language Tags


  • Performance Gain in Low Resource MT with Transfer Learning: An Analysis concerning Language Families
    Sainik Kumar Mahata, Subhabrata Dutta, Dipankar Das, and Sivaji Bandyopadhyay

    ACM
    Translation systems require a huge amount of parallel data to produce quality translations, but acquiring one for low-resource languages is difficult. To counter this, recent research has been shown to combine languages and use them to augment the low resource data, through transfer learning. While the gain in performance is apparent using transfer learning, we try to investigate the correlation between the performance gain and position of the concerned languages within a language family. We further probe and try to coordinate the performance gain with the degree of vocabulary sharing between the concerned languages.

  • Normalization of Numeronyms using NLP Techniques
    Avishek Garain, Sainik Kumar Mahata, and Subhabrata Dutta

    IEEE
    This paper presents a method to apply Natural Language Processing for normalizing numeronyms to make them understandable by humans. We try to deal with the problem using two approaches, viz., semi-supervised approach and supervised approach. For the semi-supervised approach, we make use of the state of the art DamerauLevenshtein distance of words. We then apply Cosine Similarity for selection of the normalized text and reach greater accuracy in solving the problem. For the supervised approach, we used a deep learning architecture to solve the problem at hand. Our approach garners accuracy figures of 71% and 72% for Bengali and English (for the semi-supervised approach) and 89% for the supervised approach, respectively.

  • JUNLP at SemEval-2020 Task 9: Sentiment Analysis of Hindi-English code mixed data using Grid Search Cross Validation


  • JUNLP@Dravidian-CodeMix-FIRE2020: Sentiment classification of code-mixed tweets using bi-directional RNN and language tags


  • Analyzing Code-Switching Rules for English–Hindi Code-Mixed Text
    Sainik Kumar Mahata, Sushnat Makhija, Ayushi Agnihotri, and Dipankar Das

    Springer Singapore

  • Code-mixed to monolingual translation framework
    Sainik Kumar Mahata, Soumil Mandal, Dipankar Das, and Sivaji Bandyopadhyay

    ACM
    The use of multilingualism among the new generation is widespread in the form of code-mixed data on social media, and therefore a robust translation system is required for catering to the novice and monolingual users. In this work, we present a translation framework that uses a translation-transliteration strategy for translating code-mixed data into their equivalent monolingual instances. One of the goals of this work is to translate a code-mixed source (written in Roman script) to a Bengali target (written in Devanagari script), where the source may contain English, along with transliterated Bengali. Finally, to convert the output to a more readable form, it is reordered using a target language model. The decisive advantage of the proposed framework is that it does not require a code-mixed to monolingual parallel corpus for training and decoding. On testing the framework, it achieved BLEU and TER scores of 16.47 and 55.45, respectively. Since the proposed framework deals with various sub-modules, we dive deeper into the importance of each of them, analyze the errors and finally, discuss some improvement strategies.

  • MTIL2017: Machine translation using recurrent neural network on statistical machine translation
    Sainik Kumar Mahata, Dipankar Das, and Sivaji Bandyopadhyay

    Walter de Gruyter GmbH
    Abstract Machine translation (MT) is the automatic translation of the source language to its target language by a computer system. In the current paper, we propose an approach of using recurrent neural networks (RNNs) over traditional statistical MT (SMT). We compare the performance of the phrase table of SMT to the performance of the proposed RNN and in turn improve the quality of the MT output. This work has been done as a part of the shared task problem provided by the MTIL2017. We have constructed the traditional MT model using Moses toolkit and have additionally enriched the language model using external data sets. Thereafter, we have ranked the phrase tables using an RNN encoder-decoder module created originally as a part of the GroundHog project of LISA lab.

  • JUMT at WMT2019 news translation task: A hybrid approach to machine translation for Lithuanian to English


  • Sentiment analysis at SEPLN (TASS)-2019: Sentiment analysis at tweet level using deep learning



  • BUCC2017: A hybrid approach for identifying parallel sentences in comparable corpora


  • WMT2016: A Hybrid Approach to Bilingual Document Alignment


  • Tamper detection of electrocardiographic signal using watermarked bio-hash code in wireless cardiology
    Nilanjan Dey, Monalisa Dey, Sainik Kumar Mahata, Achintya Das, and Sheli Sinha Chaudhuri

    Inderscience Publishers
    The current globalised era is marked with a rapid increase in the use of wireless media to exchange information over globally distributed locations. This advancement and growth of technologically mediated information helps to provide medical care from a distant location by exchanging biomedical information amongst various hospitals and diagnostic centres across the world. However, while transmitting, the medical information becomes highly vulnerable to miscellaneous attacks like tampering and hacking. A watermark is added in the Electrocardiographic (ECG) signal to increase the level of security to help protect the integrity of the data and decrease the chances of wrong diagnosis. In this current work, a technique is proposed to detect undesirable modifications, if present, in a transmitted biomedical ECG signal. The proposed method is based on bio–hashing and reversible watermarking techniques.

  • Electrocardiogram feature based inter-human biometric authentication system
    Monalisa Dey, Nilanjan Dey, Sainik Kumar Mahata, Sayan Chakraborty, Suvojit Acharjee, and Achintya Das

    IEEE
    Biometrics integrates various technologies to identify an individual by exploiting their physiological and behavioral characteristics, which are unique and measurable. This paper proposes a novel technique for the development of a robust and secure biometric authentication system. In this current work, an interhuman ECG-Hash code is generated by performing an inner product between the Electrocardiogram (ECG) feature matrices of two different individuals located remotely. The individuals will have each other's ECG features, stored in their database. The accuracy of the system increases as the authentication mechanism requires traits from both the individuals, amongst whom the transmission is taking place. Moreover, the use of ECG features as a biometric trait enhances the security aspects of the system as traits like fingerprints or facial features maybe compromised with age or otherwise.

RECENT SCHOLAR PUBLICATIONS

  • Consensus-Based Machine Translation for Code-Mixed Texts
    SK Mahata, D Das, S Bandyopadhyay
    ACM Transactions on Asian and Low-Resource Language Information Processing 2024

  • Text summarization implementing abstractive and extractive methods
    D Ghosh, A Mazumder, SK Mahata
    2023 7th International Conference on Electronics, Materials Engineering 2023

  • Exploring the Role of Automated Video Inspection and Recognition in Security Enhancement
    D Sarkar, M Dey, S Das, S Bangal, A Kar, SK Mahata, A Mondal
    2023 7th International Conference on Electronics, Materials Engineering 2023

  • Design Evaluation and Uses of Paraphraser Content Generation
    S Chowdhury, P Shaw, SK Mahata
    2023 7th International Conference on Electronics, Materials Engineering 2023

  • Exploring Summarization of Scientific Tables: Analysing Data Preparation and Extractive to Abstractive Summary Generation.
    M Dey, SK Mahata, D Das
    International Journal for Computers & Their Applications 30 (4) 2023

  • Breast Cancer Classification Using Deep Convolutional Neural Networks
    M Dey, A Mondal, SK Mahata, D Sarkar
    Proceedings of International Conference on Computational Intelligence, Data 2022

  • Sentiment Analysis using Machine Translation
    SK Mahata, A Mondal, M Dey, D Sarkar
    Applications of Machine intelligence in Engineering, 371-377 2022

  • An Automatic Summarization System to Understand the Impact of COVID-19 on Education
    A Mondal, M Dey, SK Mahata, D Sarkar
    Applications of Machine intelligence in Engineering, 379-386 2022

  • Simplification of English and Bengali sentences for improving quality of machine translation
    SK Mahata, A Garain, D Das, S Bandyopadhyay
    Neural Processing Letters 54 (4), 3115-3139 2022

  • Investigating the roles of sentiment in machine translation
    SK Mahata, D Das, S Bandyopadhyay
    Machine Translation 35 (4), 687-709 2021

  • Disease prediction from drug information using machine learning
    S Das, S Kumar Mahata, A Das, K Deb
    American Journal of Electronics & Communication 1 (4), 16-21 2021

  • Classification of COVID19 tweets using machine learning approaches
    A Mondal, S Mahata, M Dey, D Das
    Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop 2021

  • Sentiment classification of code-mixed tweets using bi-directional RNN and language tags
    S Mahata, D Das, S Bandyopadhyay
    Proceedings of the First Workshop on Speech and Language Technologies for 2021

  • Preparation of sentiment tagged parallel corpus and testing its effect on machine translation
    SK Mahata, A Chandra, D Das, S Bandyopadhyay
    Proceedings of International Conference on Big Data, Machine Learning and 2021

  • Performance Gain in Low Resource MT with Transfer Learning: An Analysis concerning Language Families
    SK Mahata, S Dutta, D Das, S Bandyopadhyay
    Proceedings of the 12th Annual Meeting of the Forum for Information 2020

  • JUNLP@ ICON2020: Low Resourced Machine Translation for Indic Languages
    S Mahata, D Das, S Bandyopadhyay
    Proceedings of the 17th International Conference on Natural Language 2020

  • JUNLP@ Dravidian-CodeMix-FIRE2020: Sentiment classification of code-mixed tweets using bi-directional RNN and language tags
    SK Mahata, D Das, S Bandyopadhyay
    arXiv preprint arXiv:2010.10111 2020

  • Development of pos tagger for english-bengali code-mixed data
    T Raha, SK Mahata, D Das, S Bandyopadhyay
    arXiv preprint arXiv:2007.14576 2020

  • JUNLP@ SemEval-2020 Task 9: Sentiment analysis of Hindi-English code mixed data using grid search cross validation
    A Garain, SK Mahata, D Das
    arXiv preprint arXiv:2007.12561 2020

  • Junlp@ semeval-2020 task 9: Sentiment analysis of hindi-english code mixed data
    A Garain, SK Mahata, D Das
    arXiv preprint arXiv:2007.12561 2020

MOST CITED SCHOLAR PUBLICATIONS

  • Mtil2017: Machine translation using recurrent neural network on statistical machine translation
    SK Mahata, D Das, S Bandyopadhyay
    Journal of Intelligent Systems 2018
    Citations: 60

  • Preparing bengali-english code-mixed corpus for sentiment analysis of indian languages
    S Mandal, SK Mahata, D Das
    arXiv preprint arXiv:1803.04000 2018
    Citations: 40

  • Tamper detection of electrocardiographic signal using watermarked bio–hash code in wireless cardiology
    N Dey, M Dey, SK Mahata, A Das, SS Chaudhuri
    International Journal of Signal and Imaging Systems Engineering 8 (1), 46-58 2015
    Citations: 35

  • SMT vs NMT: a comparison over Hindi & Bengali simple sentences
    SK Mahata, S Mandal, D Das, S Bandyopadhyay
    arXiv preprint arXiv:1812.04898 2018
    Citations: 29

  • Electrocardiogram feature based inter-human biometric authentication system
    M Dey, N Dey, SK Mahata, S Chakraborty, S Acharjee, A Das
    2014 International Conference on Electronic Systems, Signal Processing and 2014
    Citations: 27

  • Code-mixed to monolingual translation framework
    SK Mahata, S Mandal, D Das, S Bandyopadhyay
    Proceedings of the 11th Annual Meeting of the Forum for Information 2019
    Citations: 17

  • Wmt2016: A hybrid approach to bilingual document alignment
    S Mahata, D Das, S Pal
    Proceedings of the First Conference on Machine Translation: Volume 2, Shared 2016
    Citations: 15

  • Classification of COVID19 tweets using machine learning approaches
    A Mondal, S Mahata, M Dey, D Das
    Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop 2021
    Citations: 12

  • BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora
    SB Sainik Kumar Mahata, Dipankar Das
    10th Workshop on Building and Using Comparable Corpora, 56-59 2017
    Citations: 12

  • Simplification of English and Bengali sentences for improving quality of machine translation
    SK Mahata, A Garain, D Das, S Bandyopadhyay
    Neural Processing Letters 54 (4), 3115-3139 2022
    Citations: 9

  • Analyzing code-switching rules for english–hindi code-mixed text
    SK Mahata, S Makhija, A Agnihotri, D Das
    Emerging Technology in Modelling and Graphics: Proceedings of IEM Graph 2018 2020
    Citations: 9

  • A Novel Approach of Steganography using Hill Cipher
    SK Mahata, PM Anupam Mondal, Deepak Kumar
    International Journal of Computer Application, 29-31 2013
    Citations: 9

  • Sentiment classification of code-mixed tweets using bi-directional RNN and language tags
    S Mahata, D Das, S Bandyopadhyay
    Proceedings of the First Workshop on Speech and Language Technologies for 2021
    Citations: 8

  • JUNLP@ Dravidian-CodeMix-FIRE2020: Sentiment classification of code-mixed tweets using bi-directional RNN and language tags
    SK Mahata, D Das, S Bandyopadhyay
    arXiv preprint arXiv:2010.10111 2020
    Citations: 8

  • JUNLP@ SemEval-2020 Task 9: Sentiment analysis of Hindi-English code mixed data using grid search cross validation
    A Garain, SK Mahata, D Das
    arXiv preprint arXiv:2007.12561 2020
    Citations: 8

  • Normalization of numeronyms using nlp techniques
    A Garain, SK Mahata, S Dutta
    2020 IEEE Calcutta Conference (CALCON), 7-9 2020
    Citations: 8

  • Development of pos tagger for english-bengali code-mixed data
    T Raha, SK Mahata, D Das, S Bandyopadhyay
    arXiv preprint arXiv:2007.14576 2020
    Citations: 7

  • Disease prediction from drug information using machine learning
    S Das, S Kumar Mahata, A Das, K Deb
    American Journal of Electronics & Communication 1 (4), 16-21 2021
    Citations: 6

  • Sentiment analysis at sepln (tass)-2019: Sentiment analysis at tweet level using deep learning
    A Garain, SK Mahata
    arXiv preprint arXiv:1908.00321 2019
    Citations: 6

  • A Novel Approach to Cryptography using Modified Substitution Cipher and Hybrid Crossover Technique
    SK Mahata, S Nogaja, S Srivastava, M Dey, S Som
    International Journal of Computer Applications 2013
    Citations: 6