@iimranchi.ac.in
Assistant Professor, Information Systems & Business Analytics
Indian Institute of Management Ranchi
I am currently serving as an Assistant Professor at IIM Ranchi, India in Information Systems & Business Analytics area. I served as a Post-doctoral Fellow (PDF) in the Management Science Division of Business School at the University of Edinburgh, UK. I received my Ph.D. degree from the Department of Industrial & Systems Engineering of IIT Kharagpur (India), and both ME and BE degrees from the Department of Production Engineering of Jadavpur University (India). My domain of research includes theoretical improvement and applications of data analytics using machine learning (ML), data mining (DM), and Operations Research (OR) approaches. So far, I have published 19 journal papers, 21 book chapters, and 23 conference papers. I have been serving as a reviewer in 48 peer-reviewed top-tier journals, including Information Sciences, Automation in Construction, Applied Soft Computing, International Journal of Industrial Ergonomics, Computers & Industrial Engineering, and Safety Science.
Ph.D. (Jul, 2014 - Aug, 2019) - Department of Industrial & Systems Engineering, IIT Kharagpur, India
M.E. (Aug, 2012 - May, 2014) - Department of Production Engineering, Jadavpur University, India.
B.E. (Aug, 2005 - May, 2009) - Department of Production Engineering, Jadavpur University, India.
Information Systems, Machine Learning, Operations Research
Scopus Publications
Scholar Citations
Scholar h-index
Scholar i10-index
Sobhan Sarkar, Arup Ratan Paramanik, and Biswajit Mahanty
Elsevier BV
Anima Pramanik, Sobhan Sarkar, and Sankar K. Pal
Elsevier BV
Subhajit Bag, Rahul Golder, Sobhan Sarkar, and Saptashwa Maity
Elsevier BV
Arup Ratan Paramanik, Sobhan Sarkar, and Bijan Sarkar
Elsevier BV
Saptashwa Maity, Soujatya Khan, and Sobhan Sarkar
IEEE
Twitter is one of the best places to learn how people feel about current affairs related to the forecast of cryptocurrencies. There have been unavailability of any robust models for determining the user-preference of the major cryptocurrencies in the market on the basis of the tweets made by the major investors. Morever, the statistical dependency and the feature importance of the various contributing features in determining the prices of these cryptocurrencies has been a major concern too. This study proposes a novel two-phase robust hybrid approach for determining both user preference and feature importance of various contributing features on pricing of top cryptocurrencies in the market. It determines the user-preference on the basis of the subjectivity and polarity score obtained from the sentiment polarity classification of the tweets. It also determines both statistical dependency and feature importance of the various contributing features involved in pricing of the top cryptocurrencies in the market with the help of SHapley Additive exPlanations (SHAP) score. We have used two types of datasets for our detailed study. Additionally, knowledge graphs have been used to describe the capacity to recognise semantic data. With accuracy, precision, recall, F-1 score, and AUC-ROC values of 92%, 88%, 85%, 95%, and 94%, respectively our proposed approach outperformed conventional machine learning techniques.
Sobhan Sarkar and Anima Pramanik
IEEE
In this study, a new measure of imbalance is introduced in order to compute the extent of imbalance for multiclass data. In the case of binary datasets, the Imbalance Ratio (IR) can be used to measure the amount of imbalance. However in the case of multi-class datasets, since it only takes into account the frequency of the most frequent majority class and the least frequent minority class, it fails to encapsulate any properties from the intermediate classes. An imbalance Degree (ID) was proposed to overcome the issues of IR by considering information from the intermediate classes as well. Nevertheless, it required us to choose a distance metric that largely influenced the results and could lead to unfavorable results. It is also assumed that the number of minority classes impacted the extent of the imbalance without considering their individual contributions, which is not correct. Thus, ID cannot be chosen as an authentic metric if this assumption is breached. Furthermore, another metric called Likelihood Ratio Imbalance Degree (LRID) was proposed to make the metric independent of the number of minority classes in the data. However, it considered the imbalance to be directional and assumed both positive and negative values for individual contributions from classes. In this study, we obtain a more authentic procedure to measure the extent of imbalance extent using statistical divergence from balanced class distributions.
Shisam Bhattacharyya, Sobhan Sarkar, Bishal Dey Sarkar, and Ramkrishna Manatkar
Institute of Electrical and Electronics Engineers (IEEE)
Prasanta Kumar Dey, Soumyadeb Chowdhury, Amelie Abadie, Emilia Vann Yaroson, and Sobhan Sarkar
Informa UK Limited
Sobhan Sarkar, Anima Pramanik, and J. Maiti
Elsevier BV
Subasish Das, Eun Sug Park, and Sobhan Sarkar
Informa UK Limited
Sobhan Sarkar, Numan Ejaz, J. Maiti, and Anima Pramanik
Springer Science and Business Media LLC
Sobhan Sarkar, Sammangi Vinay, Chawki Djeddi, and J. Maiti
Springer Science and Business Media LLC
AbstractClassifying or predicting occupational incidents using both structured and unstructured (text) data are an unexplored area of research. Unstructured texts, i.e., incident narratives are often unutilized or underutilized. Besides the explicit information, there exist a large amount of hidden information present in a dataset, which cannot be explored by the traditional machine learning (ML) algorithms. There is a scarcity of studies that reveal the use of deep neural networks (DNNs) in the domain of incident prediction, and its parameter optimization for achieving better prediction power. To address these issues, initially, key terms are extracted from the unstructured texts using LDA-based topic modeling. Then, these key terms are added with the predictor categories to form the feature vector, which is further processed for noise reduction and fed to the adaptive moment estimation (ADAM)-based DNN (i.e., ADNN) for classification, as ADAM is superior to GD, SGD, and RMSProp. To evaluate the effectiveness of our proposed method, a comparative study has been conducted using some state-of-the-arts on five benchmark datasets. Moreover, a case study of an integrated steel plant in India has been demonstrated for the validation of the proposed model. Experimental results reveal that ADNN produces superior performance than others in terms of accuracy. Therefore, the present study offers a robust methodological guide that enables us to handle the issues of unstructured data and hidden information for developing a predictive model.
Arup Ratan Paramanik, Sobhan Sarkar, and Bijan Sarkar
Elsevier BV
Sobhan Sarkar, Numan Ejaz, J. Maiti, and Anima Pramanik
Springer Science and Business Media LLC
Maria Rella Riccardi, Filomena Mauriello, Sobhan Sarkar, Francesco Galante, Antonella Scarano, and Alfonso Montella
MDPI AG
The study aims to investigate the factors that are associated with fatal and severe vehicle–pedestrian crashes in Great Britain by developing four parametric models and five non-parametric tools to predict the crash severity. Even though the models have already been applied to model the pedestrian injury severity, a comparative analysis to assess the predictive power of such modeling techniques is limited. Hence, this study contributes to the road safety literature by comparing the models by their capabilities of identifying the significant explanatory variables, and by their performances in terms of the F-measure, the G-mean, and the area under curve. The analyses were carried out using data that refer to the vehicle–pedestrian crashes that occurred in the period of 2016–2018. The parametric models confirm their advantages in offering easy-to-interpret outputs and understandable relations between the dependent and independent variables, whereas the non-parametric tools exhibited higher classification accuracies, identified more explanatory variables, and provided insights into the interdependencies among the factors. The study results suggest that the combined use of parametric and non-parametric methods may effectively overcome the limits of each group of methods, with satisfactory prediction accuracies and the interpretation of the factors contributing to fatal and serious crashes. In the conclusion, several engineering, social, and management pedestrian safety countermeasures are recommended.
Anima Pramanik, Kavya Venkatagiri, Sobhan Sarkar, and Sankar K. Pal
IEEE
One of the most concerning safety hazards for elderly people is abnormal falls in public places. Vision-based fall detection using ambient cameras is a popular non-intrusive solution. Recent research uses Slow Feature Analysis (SFA), which can learn the slow invariant varying shape features obtained from input signals and is efficient. Another recent famous approach in motion detection is deep learning. However, the fall event in actual cases is diverse, resulting in complications in the detection task. Additionally, it is difficult to acquire fall-related data; hence, simulation is done on fall events to generate a training dataset, resulting in smaller data. Considering these complications, we have presented a novel method by combining SFA, deep learning models, namely Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM), and rule-base. CNN is used to extract the object region, thereby reducing the region of interest (RoI). Two shape features, such as aspect ratio and area of RoI are considered as input to the LSTM for retrieving the temporal information which is further used for rule generation, thereby increasing the detection accuracy. The efficacy of the proposed method for various features, such as aspect ratio, area, and aspect r$a$ tio+area is demonstrated over the UR Fall data with an accuracy of 95.2%, 93.8%, and 96.36%, respectively.
Shashank Sadafule, Sobhan Sarkar, and Shaomin Wu
IEEE
The performance of classification models is often measured using the metric, area under the curve (AUC). The non-parametric estimate of this metric only considers the ranks of the test instances and fails to consider the predicted scores of the model. Consequently, not all the valuable information about the model’s output is utilized. To address this issue, the present paper introduces a new metric, called Gamma AUC (G-AUC) that can take into account both ranks as well as scores. The parameter G tackles the problem of overfitting scores into the metric. To validate the proposed metric, we tested it on 20 UCI datasets with 10 state-of-the-art models. Out of all the values of the parameter G that we tested, four of them got p-value less than 0.05 for the alternative hypothesis that, on the training sets, G-AUC has a greater correlation than AUC itself, with AUC on test sets. Furthermore, for all values of G considered, G-AUC always won majority of the times than AUC for selecting better models.
Subhajit Bag, Anmol Kumar, and Sobhan Sarkar
IEEE
Session-based recommender systems have evolved as a new paradigm in recent years, intending to capture short-term yet dynamic user preferences to give more timely and accurate suggestions that are responsive to the change in their session contexts. However, sparse data for user-item interaction has been one of the significant essential issues as we need a colossal amount of memory to store those sparse data. Seasonality is another major issue in recommendation systems as there are many variations in the pattern of customers’ interests at different time intervals. In our study, we resolve the above mentioned issues by using graph collaborative filtering and creating feature bins. As a case study, we used sequential data from YooChoose customers to validate the efficacy of our proposed methodology. Further, we use five state-of-the-art graph neural network models to get the best recommendation. The performance of those models is evaluated using the NDCG (Normalized Discounted Cumulative Gain) and ROC-AUC (Area under the Receiver operating characteristic curve) metrics. In our study, we find out that Residual Gated Convolutional Neural Network with four layers and Adam optimizer gave the best recommendations.
Varun Balakrishna, Subhajit Bag, and Sobhan Sarkar
IEEE
Nowadays, online product reviews are more common on e-commerce platforms. Before making a purchase, people frequently consult product reviews to assess the quality of the item. However, the review system has been seriously harmed by a huge number of review spammers, who frequently cooperate to promote or denigrate specific products. Earlier research uses machine learning techniques to identify singleton suspicious reviews and reviewers without considering the meta-data. In this study, we utilise the meta-data of the consumer’s reviews to identify review spammer organisations using the state-of the-art community detection techniques. Due to the diversity of behavioural indicators, group spammers are challenging to identify. In this study, we propose that clustering the singleton spammers using the meta-data (location and time) of the reviews is the key to identifying group spammers (and their fraudulent reviews). We propose filling out the review-product matrix using the product and review information and text. We then use this to deduce the hidden reviewer-product connections to address the issue of the absence of explicit behavioural signals for singleton reviewers. Subsequently, we build a bipartite graph using the review-product matrix. Using the meta-data of the reviews, which are frequently overlooked by existing algorithms, experiments on a real-world Yelp dataset demonstrated the effectiveness of our methodology in detecting group spammers.
Aditya Kumar Singh, Rahul Golder, and Sobhan Sarkar
IEEE
In Consumer Review Analysis (CRA), identification of the context of reviews holds paramount importance. In this purview, it is the responsibility of all businesses to suffice their underlying sectors with a structured and classified list of consumer feedback, available on various online platforms. However, generally, reviews and feedbacks are available in a very unorganized manner and need to be tagged and distributed properly to appropriate sectors. To address the problem, we propose a comprehensive model, employing sequential Clustering, Sentiment prediction and subsequent ranking of reviews. To validate the proposed model, data from a Samsung smartphone manufacturing firm was used. The robustness and stability of our model have been examined through different performance indices-Silhouette Index (SI), Davies-Bouldin Index (DBI) and Calinski Harabasz Score (CHS) Score. Our analysis shows a distinct categorization of reviews based on their contexts with minimal noise in the classification measures. Our custom declared coefficient, Relevant Voting Score (RVS) has been found to rank the reviews in an accurate priority list thereby helping the sectors to contemplate only the most important customer feedback.
Subhajit Bag, Saptashwa Maity, and Sobhan Sarkar
IEEE
Distracted driving plays a pivotal role in road accidents. Therefore, prediction of the crash severity due to distracted driving is essential. Although several machine learning techniques exist for such prediction, it is difficult to use them in case of the unavailability of class labels and class imbalance issues. Moreover, there is a severe lack of research considering environmental factors and driver’s behaviour to predict the crash severity. To address the issues, in this study, a robust two-phase ensemble prediction model has been developed, considering the geolocation information and driver’s behaviour. An analysis of the unlabeled and high-dimensional data is generally challenging. We perform dimensionality reduction using t-SNE, followed by agglomerative hierarchical clustering to get labelled data. We have used Synthetic Minority Over-sampling Technique (SMOTE) to mitigate the class imbalance issue. Subsequently, we observe that some localities have much more severe crashes, so we develop a feature considering the geolocation information. Then, we create a novel predictor called Robust Two-Phase Ensemble Predictor (R2PEP) to predict the crash severity. The performance of the proposed model has been compared with five state-of-the-art algorithms using a dataset we obtained from the Nevada Department of Transportation. The comparison demonstrates the superiority of our model over the other models, with an accuracy of 99.6%.
Piran Karkaria, Rahul Golder, and Sobhan Sarkar
IEEE
Combating fake news on social media is a critical challenge in today's digital age, especially when misinformation is spread regarding vital matters such as the Covid-19 pandemic. Manual verification of all content is infeasible; hence, Artificial Intelligence is used to classify fake news. Our ensemble model uses multiple Natural Language Processing techniques to analyze the truthfulness of the text in tweets. We create custom parameters that analyze the consistency and truthfulness of domains contained in hyperlinked URLs. We then combine these parameters with the results of our deep learning models to achieve classification with greater than 99% accuracy. We have proposed a novel method to calculate a custom coefficient, the Combined Metric of Prediction Uncertainty (CMPU), which is a measure of how uncertain the model is of its classification of a given tweet. Using CMPU, we have proposed the creation of a priority queue following which the tweets classified with the lowest certainty can be manually verified. By manually verifying 3.93% of tweets, we were able to improve the accuracy from 99.02% to 99.77%.
Anima Pramanik, Sobhan Sarkar, Chawki Djeddi, and J. Maiti
Springer International Publishing
Saptashwa Maity, Arjav Rastogi, Chawki Djeddi, Sobhan Sarkar, and J. Maiti
Springer International Publishing
Sobhan Sarkar, Anima Pramanik, J. Maiti, and Genserik Reniers
Elsevier BV