@iitrpr.ac.in
Assistant Professor, Department of Electrical Engineering
Indian Institute of Technology Ropar (IIT Ropar)
Dr. Santosh Kumar Vipparthi, a senior member of IEEE, has 11+ years of experience in teaching and industry. Currently, he is an Assistant Professor in the Department of Electrical Engineering, Indian Institute of Technology, Ropar (IIT Ropar). Before this, he served as an Assistant Professor in the Mehta Family School of Data Science and Artificial Intelligence at the Indian Institute of Technology Guwahati (IIT Guwahati) and the Department of Computer Science and Engineering at Malaviya National Institute of Technology (MNIT), Jaipur (An Institute of National Importance and one of the top NITs fully funded by the Ministry of Education, Government of India) (2013-2022). Dr. Vipparthi’s research in computer vision, deep learning, etc. He successfully supervised 03 research scholars (PhD’s) and 11 PG students (M.Tech).
Computer Vision, Deep Learning, Facial Expression Recognition, Change Detection
Scopus Publications
Scholar Citations
Scholar h-index
Scholar i10-index
Satya Narayan, Arka Prokash Mazumdar, and Santosh Kumar Vipparthi
Elsevier BV
Monu Verma, Murari Mandal, Satish Kumar Reddy, Yashwanth Reddy Meedimale, and Santosh Kumar Vipparthi
Elsevier BV
Gopa Bhaumik, Monu Verma, Mahesh Chandra Govil, and Santosh Kumar Vipparthi
Springer Science and Business Media LLC
Shruti S. Phutke, Ashutosh Kulkarni, Santosh Kumar Vipparthi, and Subrahmanyam Murala
IEEE
Blind image inpainting is a crucial restoration task that does not demand additional mask information to restore the corrupted regions. Yet, it is a very less explored research area due to the difficulty in discriminating between corrupted and valid regions. There exist very few approaches for blind image inpainting which sometimes fail at producing plausible inpainted images. Since they follow a common practice of predicting the corrupted regions and then inpaint them. To skip the corrupted region prediction step and obtain better results, in this work, we propose a novel end-to-end architecture for blind image inpainting consisting of wavelet query multi-head attention transformer block and the omni-dimensional gated attention. The proposed wavelet query multi-head attention in the transformer block provides encoder features via processed wavelet coefficients as query to the multi-head attention. Further, the proposed omni-dimensional gated attention effectively provides all dimensional attentive features from the encoder to the respective decoder. Our proposed approach is compared numerically and visually with existing state-of-the-art methods for blind image inpainting on different standard datasets. The comparative and ablation studies prove the effectiveness of the proposed approach for blind image inpainting. The testing code is available at : https://github.com/shrutiphutke/Blind_Omni_Wav_Net
Florin-Alexandru Vasluianu, Tim Seizinger, Radu Timofte, Shuhao Cui, Junshi Huang, Shuman Tian, Mingyuan Fan, Jiaqi Zhang, Li Zhu, Xiaoming Wei,et al.
IEEE
This work reviews the results of the NTIRE 2023 Challenge on Image Shadow Removal. The described set of solutions were proposed for a novel dataset, which captures a wide range of object-light interactions. It consists of 1200 roughly pixel aligned pairs of real shadow free and shadow affected images, captured in a controlled environment. The data was captured in a white-box setup, using professional equipment for lights and data acquisition sensors. The challenge had a number of 144 participants registered, out of which 19 teams were compared in the final ranking. The proposed solutions extend the work on shadow removal, improving over the performance level describing state-of-the-art methods.
Monu Verma, Priyanka Lubal, Santosh Kumar Vipparthi, and Mohamed Abdel-Mottaleb
IEEE
Existing neural architecture search (NAS) methods comprise linear connected convolution operations and use ample search space to search task-driven convolution neural networks (CNN). These CNN models are computationally expensive and diminish the quality of receptive fields for tasks like micro-expression recognition (MER) with limited training samples. Therefore, we propose a refined neural architecture search strategy to search for a tiny CNN architecture for MER. In addition, we introduced a refined hybrid module (RHM) for inner-level search space and an optimal path explore network (OPEN) for outer-level search space. The RHM focuses on discovering optimal cell structures by incorporating a multilateral hybrid spatiotemporal operation space. Also, spatiotemporal attention blocks are embedded to refine the aggregated cell features. The OPEN search space aims to trace an optimal path between the cells to generate a tiny spatiotemporal CNN architecture instead of covering all possible tracks. The aggregate mix of RHM and OPEN search space availed the NAS method to robustly search and design an effective and efficient framework for MER. Compared with contemporary works, experiments reveal that the RNAS-MER is capable of bridging the gap between NAS algorithms and MER tasks. Furthermore, RNAS-MER achieves new state-of-the-art performances on challenging MER benchmarks, including 0.8511%, 0.7620%, 0.9078% and 0.8235% UAR on COMPOSITE, SMIC, CASME-II and SAMM datasets respectively.
Anamika Satrawala, Arka Prokash Mazumdar, and Santosh Kumar Vipparthi
Elsevier BV
Monu Verma, M. Satish Kumar Reddy, Yashwanth Reddy Meedimale, Murari Mandal, and Santosh Kumar Vipparthi
Institute of Electrical and Electronics Engineers (IEEE)
Facial microexpressions offer useful insights into subtle human emotions. This unpremeditated emotional leakage exhibits the true emotions of a person. However, the minute temporal changes in the video sequences are very difficult to model for accurate classification. In this article, we propose a novel spatiotemporal architecture search algorithm, AutoMER for microexpression recognition (MER). Our main contribution is a new parallelogram design-based search space for efficient architecture search. We introduce a spatiotemporal feature module named 3-D singleton convolution for cell-level analysis. Furthermore, we present four such candidate operators and two 3-D dilated convolution operators to encode the raw video sequences in an end-to-end manner. To the best of our knowledge, this is the first attempt to discover 3-D convolutional neural network (CNN) architectures with a network-level search for MER. The searched models using the proposed AutoMER algorithm are evaluated over five microexpression data sets: CASME-I, SMIC, CASME-II, CAS(ME)2, and SAMM. The proposed generated models quantitatively outperform the existing state-of-the-art approaches. The AutoMER is further validated with different configurations, such as downsampling rate factor, multiscale singleton 3-D convolution, parallelogram, and multiscale kernels. Overall, five ablation experiments were conducted to analyze the operational insights of the proposed AutoMER.
Murari Mandal and Santosh Kumar Vipparthi
Institute of Electrical and Electronics Engineers (IEEE)
Murari Mandal and Santosh Kumar Vipparthi
Institute of Electrical and Electronics Engineers (IEEE)
Visual change detection in video is one of the essential tasks in computer vision applications. Recently, a number of supervised deep learning methods have achieved top performance over the benchmark datasets for change detection. However, inconsistent training-testing data division schemes adopted by these methods have led to documentation of incomparable results. We address this crucial issue through our own propositions for benchmark comparative analysis. The existing works have evaluated the model in scene dependent evaluation setup which makes it difficult to assess the generalization capability of the model in completely unseen videos. It also leads to inflated results. Therefore, in this paper, we present a completely scene independent evaluation strategy for a comprehensive analysis of the model design for change detection. We propose well-defined scene independent and scene dependent experimental frameworks for training and evaluation over the benchmark CDnet 2014, LASIESTA and SBMI2015 datasets. A cross-data evaluation is performed with PTIS dataset to further measure the robustness of the models. We designed a fast and lightweight online end-to-end convolutional network called ChangeDet (speed-58.8 fps and model size-1.59 MB) in order to achieve robust performance in completely unseen videos. The ChangeDet estimates the background through a sequence of maximum multi-spatial receptive feature (MMSR) blocks using past temporal history. The contrasting features are produced through the assimilation of temporal median and contemporary features from the current frame. Further, these features are processed through an encoder-decoder to detect pixel-wise changes. The proposed ChangeDet outperforms the existing state-of-the-art methods in all four benchmark datasets.
Prafulla Saxena, Kuldeep Biradar, Dinesh Kumar Tyagi, and Santosh Kumar Vipparthi
IEEE
Monu Verma, Santosh Kumar Vipparthi, and Girdhari Singh
Institute of Electrical and Electronics Engineers (IEEE)
Anamika Satrawala, Arka Prokash Mazumdar, and Santosh Kumar Vipparthi
IEEE
Driver behavior analysis is one of the critical issues that need to be addressed to prevent traffic accidents. It contributes to many real-time applications, such as usage-based pricing (UBI), pay-as-you-drive (PAY-D), and insurance premium calculations. Driver Behavior Profiling-Prognosis (DBP-P) is considered a quantitative risk assessment parameter in road accidents and is a fusion of two sub-processes, behavior scoring and classification of driving patterns. The selection of features like speed or acceleration is the essential and decisive factor in automobile driving behaviour. Though there exists a number of such schemes in the literature, most of them primarily focus on independently on each vehicle and score them. This goal, however, does not clearly indicate any driver's driving quality or its risk of collision with other vehicles. Therefore, to overcome the limitations of the literature, this paper proposes a relative, adaptive, and distributed driver behaviour profiling technique, named Distributed Adaptive Recommendation & Time-stamp based Estimation of Driver-Behaviour (DARTED), to generate driving scores to quantify and classify driver behavior as good or bad. Moreover, the driver scores can be computed at each timestamp with a classified label that can be used in various applications aiming for collision analysis. The experimental results indicate that the proposed method achieves significant accuracy in different traffic scenarios. The model may be helpful to researchers for study and enhance understanding and many real-time industrial applications.
Murari Mandal, Yashwanth Reddy Meedimale, M. Satish Kumar Reddy, and Santosh Kumar Vipparthi
Institute of Electrical and Electronics Engineers (IEEE)
Sachin Dube, Kuldeep Biradar, Santosh Kumar Vipparthi, and Dinesh Kumar Tyagi
Springer International Publishing
Deepti Sharma, Kuldeep M. Biradar, Santosh K. Vipparthi, and Ramesh B. Battula
Springer International Publishing
Gopa Bhaumik, Monu Verma, Mahesh Chandra Govil, and Santosh Kumar Vipparthi
Springer Singapore
Monika Choudhary, Satyendra Singh Chouhan, Emmanuel S. Pilli, and Santosh Kumar Vipparthi
Elsevier BV
Monu Verma, Ayushi Gupta, and Santosh K. Vipparthi
IEEE
The HGR is a quite challenging task as its performance is influenced by various aspects such as illumination variations, cluttered backgrounds, spontaneous capture, etc. The conventional CNN networks for HGR are following two stage pipeline to deal with the various challenges: complex signs, illumination variations, complex and cluttered backgrounds. The existing approaches needs expert expertise as well as auxiliary computation at stage 1 to remove the complexities from the input images. Therefore, in this paper, we proposes an novel end-to-end compact CNN framework: fine grained feature attentive network for hand gesture recognition (Fit-Hand) to solve the challenges as discussed above. The pipeline of the proposed architecture consists of two main units: FineFeat module and dilated convolutional (Conv) layer. The FineFeat module extracts fine grained feature maps by employing attention mechanism over multiscale receptive fields. The attention mechanism is introduced to capture effective features by enlarging the average behaviour of multi-scale responses. Moreover, dilated convolution provides global features of hand gestures through a larger receptive field. In addition, integrated layer is also utilized to combine the features of FineFeat module and dilated layer which enhances the discriminability of the network by capturing complementary context information of hand postures. The effectiveness of Fit-Hand is evaluated by using subject dependent (SD) and subject independent (SI) validation setup over seven benchmark datasets: MUGD-I, MUGD-II, MUGD-III, MUGD-IV, MUGD-V, Finger Spelling and OUHANDS, respectively. Furthermore, to investigate the deep insights of the proposed Fit-Hand framework, we performed ten ablation study
Gaurav Jain, Amit M. Joshi, Ravi Kumar Maddila, and Santosh Kumar Vipparthi
IEEE
Hemoglobin is a protein in Red Blood Cells (RBC) which supplies oxygen to the human body. A person’s hemoglobin becomes glycosylated as per the increase in the level of blood sugar. Glycated hemoglobin (HbA1c) is a widely used measure of glycemic control which measures the glucose attached to hemoglobin. Different methods are adopted and utilized for the measurement of HbA1c. Several invasive methods are widely used in pathological laboratories across the globe. The current status of non-invasive HbA1c and blood glucose measurement techniques is summarized in this paper.
Murari Mandal, Vansh Dhar, Abhishek Mishra, Santosh Kumar Vipparthi, and Mohamed Abdel-Mottaleb
Institute of Electrical and Electronics Engineers (IEEE)
Change detection is an elementary task in computer vision and video processing applications. Recently, a number of supervised methods based on convolutional neural networks have reported high performance over the benchmark dataset. However, their success depends upon the availability of certain proportions of annotated frames from test video during training. Thus, their performance on completely unseen videos or scene independent setup is undocumented in the literature. In this work, we present a scene independent evaluation (SIE) framework to test the supervised methods in completely unseen videos to obtain generalized models for change detection. In addition, a scene dependent evaluation (SDE) is also performed to document the comparative analysis with the existing approaches. We propose a fast (speed-25 fps) and lightweight (0.13 million parameters, model size-1.16 MB) end-to-end 3D-CNN based change detection network (3DCD) with multiple spatiotemporal learning blocks. The proposed 3DCD consists of a gradual reductionist block for background estimation from past temporal history. It also enables motion saliency estimation, multi-schematic feature encoding-decoding, and finally foreground segmentation through several modular blocks. The proposed 3DCD outperforms the existing state-of-the-art approaches evaluated in both SIE and SDE setup over the benchmark CDnet 2014, LASIESTA and SBMI2015 datasets. To the best of our knowledge, this is a first attempt to present results in clearly defined SDE and SIE setups in three change detection datasets.
Monu Verma, Santosh Kumar Vipparthi, and Girdhari Singh
Institute of Electrical and Electronics Engineers (IEEE)
Microexpressions are hard to spot due to fleeting and involuntary moments of facial muscles. Interpretation of microemotions from video clips is a challenging task. In this article, we propose an affective-motion imaging that cumulates rapid and short-lived variational information of microexpressions into a single response. Moreover, we have proposed an AffectiveNet: Affective-motion feature learning network that can perceive subtle changes and learns the most discriminative dynamic features to describe the emotion classes. The AffectiveNet holds two blocks: MICRoFeat and MFL block. MICRoFeat block conserves the scale-invariant features, which allows network to capture both coarse and tiny edge variations. Whereas, the MFL block learns microlevel dynamic variations from two different intermediate convolutional layers. Effectiveness of the proposed network is tested over four datasets by using two experimental setups: person independent and cross dataset validation. The experimental results of the proposed network outperform the state-of-the-art approaches with significant margin for MER approaches.
Satya Narayan, S. K. Vipparthi, and A.P. Mazumdar
IEEE
Feature extraction is one of the most important technique in many pattern recognition applications. More specifically, the performance of a hand gesture detection and recognition system depends on the robustness of the designed feature descriptor. In this paper, we propose a parity check based descriptor (PCBD) for hand gesture recognition. The descriptor extracts the intensity variations by establishing the bit-plane relationship between the neighboring pixels. The bit level thresholding is used to encode the patterns and extracted features trained by SVM classifier with HGRI database that improves the efficiency of hand gesture recognition with more discriminability with low memory storage. The experimental results show better performance of the proposed method as compared to existing state-of-the-art approaches.
Satya Narayan, S. K. Vipparthi, and A.P. Mazumdar
IEEE
Hand gesture recognition is a vital aspect of robotic vision models. This paper presents a fusion based approach for hand gesture recognition. In this approach, we first extract the Gaussian scale space of an image and compute features on different scales. Kirsch’s convolution mask is then applied on the feature map. The aim of the proposed approach is to remove unwanted information extract scale, rotation, and illumination invariant patterns from hand gestures. The final feature vector is aggregated through the concatenation of multiscale histograms. The Support Vector Machine classifier is demonstrated using extracted features. Moreover, we calculate the progress efficiency of proposed methods on three distinct databases by conducting experiments viz, Thomson, Bochum, and HGRI. The proposed method achieves classification accuracies of 94.25%, 92.77%, and 95.78% respectively on the investigated databases that outperform the existing approaches for hand gesture recognition