Jan Kocon

@pwr.edu.pl

Department of Artificial Intelligence
Wroclaw University of Science and Technology

https://researchid.co/jankocon

I'm an Assistant Professor in the department of Artificial Intelligence at the Wroclaw University of Science and Technology, where I earned both my Ph.D. in computer science (2018) and M.Sc. Eng. degree (2012). I serve as the AI/ML Team Leader and Senior ML/NLP Data Scientist for the CLARIN-BIZ project. My passion for natural language processing (NLP) has spanned over a decade, with a keen interest in machine learning techniques. I've penned over 60 scientific papers, showcased at prominent conferences including ACL, ICDM, and more. My current endeavors revolve around pioneering deep learning models for subjective tasks such as emotion and sentiment analysis. I'm also delving into cross-lingual knowledge transfer and language-agnostic models. My contributions have been integral to projects like CrisisDetector, StockBrief, Sentimenti, and CLARIN-PL, among others. I enjoy imparting knowledge on data science, AI's role in NLP, and building sophisticated deep neural network models.

RESEARCH, TEACHING, or OTHER INTERESTS

Computer Science, Artificial Intelligence, Computer Science Applications, Signal Processing

Scopus Publications

3160

Scholar Citations

Scholar h-index

Scholar i10-index

Scopus Publications

Fortifying NLP models against poisoning attacks: The power of personalized prediction architectures
Teddy Ferdinan and Jan Kocoń
Elsevier BV

Improving Training Dataset Balance with ChatGPT Prompt Engineering
Mateusz Kochanek, Igor Cichecki, Oliwier Kaszyca, Dominika Szydło, Michał Madej, Dawid Jędrzejewski, Przemysław Kazienko, and Jan Kocoń
MDPI AG
The rapid evolution of large language models, in particular OpenAI’s GPT-3.5-turbo and GPT-4, indicates a growing interest in advanced computational methodologies. This paper proposes a novel approach to synthetic data generation and knowledge distillation through prompt engineering. The potential of large language models (LLMs) is used to address the problem of unbalanced training datasets for other machine learning models. This is not only a common issue but also a crucial determinant of the final model quality and performance. Three prompting strategies have been considered: basic, composite, and similarity prompts. Although the initial results do not match the performance of comprehensive datasets, the similarity prompts method exhibits considerable promise, thus outperforming other methods. The investigation of our rebalancing methods opens pathways for future research on leveraging continuously developed LLMs for the enhanced generation of high-quality synthetic data. This could have an impact on many large-scale engineering applications.

ChatGPT: Jack of all trades, master of none
Jan Kocoń, Igor Cichecki, Oliwier Kaszyca, Mateusz Kochanek, Dominika Szydło, Joanna Baran, Julita Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil Kanclerz,et al.
Elsevier BV

Migrants vs. stayers in the pandemic – A sentiment analysis of Twitter content
Olga Czeranowska, Karol Chlasta, Piotr Miłkowski, Izabela Grabowska, Jan Kocoń, Krzysztof Hwaszcz, Jan Wieczorek, and Agata Jastrzębowska
Elsevier BV

Human-centered neural reasoning for subjective content processing: Hate speech, emotions, and humor
Przemysław Kazienko, Julita Bielaniewicz, Marcin Gruza, Kamil Kanclerz, Konrad Karanowski, Piotr Miłkowski, and Jan Kocoń
Elsevier BV

Deep Emotions Across Languages: A Novel Approach for Sentiment Propagation in Multilingual WordNets
Jan Kocoń
IEEE
Sentiment analysis involves using WordNets enriched with emotional metadata, which are valuable resources. However, manual annotation is time-consuming and expensive, resulting in only a few WordNet Lexical Units being annotated. This paper introduces two new techniques for automatically propagating sentiment annotations from a partially annotated WordNet to its entirety and to a WordNet in a different language: Multilingual Structured Synset Embeddings (MSSE) and Cross-Lingual Deep Neural Sentiment Propagation (CLDNS). We evaluated the proposed MSSE+CLDNS method extensively using Princeton WordNet and Polish WordNet, which have many inter-lingual relations. Our results show that the MSSE+CLDNS method outperforms existing propagation methods, indicating its effectiveness in enriching WordNets with emotional metadata across multiple languages. This work provides a solid foundation for large-scale, multilingual sentiment analysis and is valuable for academic research and practical applications.

From Big to Small Without Losing It All: Text Augmentation with ChatGPT for Efficient Sentiment Analysis
Stanisław Woźniak and Jan Kocoń
IEEE
In the era of artificial intelligence, data is gold but costly to annotate. The paper demonstrates a groundbreaking solution to this dilemma using ChatGPT for text augmentation in sentiment analysis. We leverage ChatGPT’s generative capabilities to create synthetic training data that significantly improves the performance of smaller models, making them competitive with, or even outperforming, their larger counterparts. This innovation enables models to be both efficient and effective, thereby reducing computational cost, inference time, and memory usage without compromising on quality. Our work marks a key advancement in the cost-effective development and deployment of robust sentiment analysis models.

Towards Model-Based Data Acquisition for Subjective Multi-Task NLP Problems
Kamil Kanclerz, Julita Bielaniewicz, Marcin Gruza, Jan Kocoń, Stanisław Woźniak, and Przemysław Kazienko
IEEE
Data annotated by humans is a source of knowledge by describing the peculiarities of the problem and therefore fueling the decision process of the trained model. Unfortunately, the annotation process for subjective natural language processing (NLP) problems like offensiveness or emotion detection is often very expensive and time-consuming. One of the inevitable risks is to spend some of the funds and annotator effort on annotations that do not provide any additional knowledge about the specific task. To minimize these costs, we propose a new model-based approach that allows the selection of tasks annotated individually for each text in a multi-task scenario. The experiments carried out on three datasets, dozens of NLP tasks, and thousands of annotations show that our method allows up to 40% reduction in the number of annotations with negligible loss of knowledge. The results also emphasize the need to collect a diverse amount of data required to efficiently train a model, depending on the subjectivity of the annotation task. We also focused on measuring the relation between subjective tasks by evaluating the model in single-task and multi-task scenarios. Moreover, for some datasets, training only on the labels predicted by our model improved the efficiency of task selection as a self-supervised learning regularization technique.

Modeling Uncertainty in Personalized Emotion Prediction with Normalizing Flows
Piotr Miłkowski, Konrad Karanowski, Patryk Wielopolski, Jan Kocoń, Przemysław Kazienko, and Maciej Zięba
IEEE
Designing predictive models for subjective problems in natural language processing (NLP) remains challenging. This is mainly due to its non-deterministic nature and different perceptions of the content by different humans. It may be solved by Personalized Natural Language Processing (PNLP), where the model exploits additional information about the reader to make more accurate predictions. However, current approaches require complete information about the recipients to be straight embedded. Besides, the recent methods focus on deterministic inference or simple frequency-based estimations of the probabilities. In this work, we overcome this limitation by proposing a novel approach to capture the uncertainty of the forecast using conditional Normalizing Flows. This allows us to model complex multimodal distributions and to compare various models using negative log-likelihood (NLL). In addition, the new solution allows for various interpretations of possible reader perception thanks to the available sampling function. We validated our method on three challenging, subjective NLP tasks, including emotion recognition and hate speech. The comparative analysis of generalized and personalized approaches revealed that our personalized solutions significantly outperform the baseline and provide more precise uncertainty estimates. The impact on the text interpretability and uncertainty studies are presented as well. The information brought by the developed methods makes it possible to build hybrid models whose effectiveness surpasses classic solutions. In addition, an analysis and visualization of the probabilities of the given decisions for texts with high entropy of annotations and annotators with mixed views were carried out.

PALS: Personalized Active Learning for Subjective Tasks in NLP

RWKV: Reinventing RNNs for the Transformer Era

Capturing Human Perspectives in NLP: Questionnaires, Annotations, and Biases

CLARIN-Emo: Training Emotion Recognition Models Using Human Annotation and ChatGPT
Bartłomiej Koptyra, Anh Ngo, Łukasz Radliński, and Jan Kocoń
Springer Nature Switzerland

Differential Dataset Cartography: Explainable Artificial Intelligence in Comparative Personalized Sentiment Analysis
Jan Kocoń, Joanna Baran, Kamil Kanclerz, Michał Kajstura, and Przemysław Kazienko
Springer Nature Switzerland

Personalized Models Resistant to Malicious Attacks for Human-centered Trusted AI

Emotion norms for 6000 Polish word meanings with a direct mapping to the Polish wordnet
Małgorzata Wierzba, Monika Riegel, Jan Kocoń, Piotr Miłkowski, Arkadiusz Janz, Katarzyna Klessa, Konrad Juszczyk, Barbara Konat, Damian Grimling, Maciej Piasecki,et al.
Springer Science and Business Media LLC
AbstractEmotion lexicons are useful in research across various disciplines, but the availability of such resources remains limited for most languages. While existing emotion lexicons typically comprise words, it is a particular meaning of a word (rather than the word itself) that conveys emotion. To mitigate this issue, we present the Emotion Meanings dataset, a novel dataset of 6000 Polish word meanings. The word meanings are derived from the Polish wordnet (plWordNet), a large semantic network interlinking words by means of lexical and conceptual relations. The word meanings were manually rated for valence and arousal, along with a variety of basic emotion categories (anger, disgust, fear, sadness, anticipation, happiness, surprise, and trust). The annotations were found to be highly reliable, as demonstrated by the similarity between data collected in two independent samples: unsupervised (n = 21,317) and supervised (n = 561). Although we found the annotations to be relatively stable for female, male, younger, and older participants, we share both summary data and individual data to enable emotion research on different demographically specific subgroups. The word meanings are further accompanied by the relevant metadata, derived from open-source linguistic resources. Direct mapping to Princeton WordNet makes the dataset suitable for research on multiple languages. Altogether, this dataset provides a versatile resource that can be employed for emotion research in psychology, cognitive science, psycholinguistics, computational linguistics, and natural language processing.

Multi-Modal Personalized Hate Speech Analysis using Differential Dataset Cartography

Towards a contextualised spatial-diachronic history of literature: mapping emotional representations of the city and the country in Polish fiction from 1864 to 1939

Linguistic Knowledge Application to Neuro-Symbolic Transformers in Sentiment Analysis
Joanna Baran and Jan Kocon
IEEE
Neuro-symbolic approaches explore ways to com-bine neural networks with traditional symbolic knowledge. These methods are gaining attention due to their efficiency and the requirement of fewer data compared to currently used deep models. This work investigated several neuro-symbolic models for sentiment analysis focusing on a variety of ways to add linguistic knowledge to the transformer-based architecture. English and Polish WordNets were used as a knowledge source with their polarity extensions (SentiWordNet, plWordNet Emo). The neuro- symbolic methods using knowledge during fine-tuning were not better or worse than the baseline model. However, a statistically significant gain of about three percentage points in the Fl- macro was obtained for the SentiLARE model that applied domain data - word sentiment labels - already at the pretraining stage. It was the most visible for medium-sized training sets. Therefore, developing an effective neuro-symbolic model is not trivial. The conclusions drawn from this work indicate a further need for a detailed study of these approaches, especially in natural language processing. In the context of sentiment classification, it could help design more efficient AI systems that can be deployed in business or marketing.

MultiAspectEmo: Multilingual and Language-Agnostic Aspect-Based Sentiment Analysis
Joanna Szolomicka and Jan Kocon
IEEE
The paper addresses the important problem of multilingual and language-agnostic approaches to the aspect-based sentiment analysis (ABSA) task, using modern approaches based on transformer models. We propose a new dataset based on automatic translation of the Polish AspectEmo dataset together with cross-lingual transfer of tags describing aspect polarity. The result is a MultiAspectEmo dataset translated into five other languages: English, Czech, Spanish, French and Dutch. In this paper, we also present the original Tr Asp (Transformer-based Aspect Extraction and Classification) method, which is significantly better than methods from the literature in the ABSA task. In addition, we present multilingual and language-agnostic variants of this method, evaluated on the MultiAspectEmo and also the SemEval2016 datasets. We also test various language models for the ABSA task, including compressed models that give promising results while significantly reducing inference time and memory usage.

Compression Methods for Transformers in Multidomain Sentiment Analysis
Wojciech Korczynski and Jan Kocon
IEEE
Transformer models like BERT have significantly improved performance on many NLP tasks, e.g., sentiment analysis. However, their large number of parameters makes real-world applications difficult because of computational costs and latency. Many compression methods have been proposed to solve this problem using quantization, weight pruning, and knowledge distillation. In this work, we explore some of these task-specific and task-agnostic methods by comparing their effectiveness and quality on the MultiEmo sentiment analysis dataset. Additionally, we analyze their ability to generalize and capture sentiment features by conducting domain-sentiment experiments. The results show that the compression methods reduce the model size by 8.6 times and the inference time by 6.9 times compared to the original model while maintaining unimpaired quality. Smaller models perform better on tasks with fewer data and retain more remarkable generalization ability after fine-tuning because they are less prone to overfitting. The best trade-off is obtained using the task-agnostic XtremeDistil model.

Deep-SHEEP: Sense of Humor Extraction from Embeddings in the Personalized Context
Julita Bielaniewicz, Kamil Kanclerz, Piotr Milkowski, Marcin Gruza, Konrad Karanowski, Przemyslaw Kazienko, and Jan Kocon
IEEE
As humans, we experience a wide range of feelings and reactions. One of these is laughter, often related to a personal sense of humor and the perception of funny content. Due to its subjective nature, recognizing humor in NLP is a very challenging task. Here, we present a new approach to the task of predicting humor in the text by applying the idea of a personalized approach. It takes into account both the text and the context of the content receiver. For that purpose, we proposed four Deep-SHEEP learning models that take advantage of user preference information differently. The experiments were conducted on four datasets: Cockamamie, HUMOR, Jester, and Humicroedit. The results have shown that the application of an innovative personalized approach and user-centric perspective significantly improves performance compared to generalized methods. Moreover, even for random text embeddings, our personalized methods outperform the generalized ones in the subjective humor modeling task. We also argue that the user-related data reflecting an individual sense of humor has similar importance as the evaluated text itself. Different types of humor were investigated as well.

StudEmo: A Non-aggregated Review Dataset for Personalized Emotion Recognition

What if Ground Truth is Subjective? Personalized Deep Neural Hate Speech Detection

Multi-module Natural Language Search Engine for Travel Offers
Karol Gawron, Konrad Wojtasik, Bartłomiej Bojanowski, Arkadiusz Janz, Jan Kocoń, Tomasz Krupa, Agnieszka Kukałowicz, Piotr Miłkowski, Maciej Piasecki, Michał Pogoda,et al.
Springer International Publishing

RECENT SCHOLAR PUBLICATIONS

Fortifying NLP models against poisoning attacks: The power of personalized prediction architectures
T Ferdinan, J Kocoń
Information Fusion 114, 102692 2025

Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence
B Peng, D Goldstein, Q Anthony, A Albalak, E Alcaide, S Biderman, ...
arXiv preprint arXiv:2404.05892 2024

Personalized large language models
S Woźniak, B Koptyra, A Janz, P Kazienko, J Kocoń
arXiv preprint arXiv:2402.09269 2024

Into the Unknown: Self-Learning Large Language Models
T Ferdinan, J Kocoń, P Kazienko
arXiv preprint arXiv:2402.09147 2024

Improving Training Dataset Balance with ChatGPT Prompt Engineering
M Kochanek, I Cichecki, O Kaszyca, D Szydło, M Madej, D Jędrzejewski, ...
Electronics 13 (12), 2255 2024

Beyond Human Review: Leveraging ChatGPT for Label Noise Detection
I Cichecki, J Kocon, P Kazienko, O Kaszyca, D Szydłlo, M Kochanek
https://www.techrxiv.org/doi/full/10.36227/techrxiv.170326715.56351742/v1 2023

Can innovative prompt engineering with ChatGPT address imbalances in machine learning datasets?
M Kochanek, P Kazienko, J Kocon, I Cichecki, O Kaszyca, D Szydło
Authorea Preprints 2023

Is it possible for ChatGPT to mimic human annotator?
O Kaszyca, P Kazienko, J Kocoń, I Cichecki, M Kochanek, D Szydło
Authorea Preprints 2023

Deep Emotions Across Languages: A Novel Approach for Sentiment Propagation in Multilingual WordNets
J Kocoń
2023 IEEE International Conference on Data Mining Workshops (ICDMW), 744-749 2023

Modeling Uncertainty in Personalized Emotion Prediction with Normalizing Flows
P Miłkowski, K Karanowski, P Wielopolski, J Kocoń, P Kazienko, M Zięba
2023 IEEE International Conference on Data Mining Workshops (ICDMW), 757-766 2023

From Big to Small Without Losing It All: Text Augmentation with ChatGPT for Efficient Sentiment Analysis
S Woźniak, J Kocoń
2023 IEEE International Conference on Data Mining Workshops (ICDMW), 799-808 2023

Towards Model-Based Data Acquisition for Subjective Multi-Task NLP Problems
K Kanclerz, J Bielaniewicz, M Gruza, J Kocoń, S Woźniak, P Kazienko
2023 IEEE International Conference on Data Mining Workshops (ICDMW), 726-735 2023

Beyond Human Review: Levereging ChatGPT for Label Noise Detection
I Cichecki, J Kocoń, P Kazienko, O Kaszyca, D Szydło, M Kochanek
2023

PALS: Personalized Active Learning for Subjective Tasks in NLP
K Kanclerz, K Karanowski, J Bielaniewicz, M Gruza, P Miłkowski, J Kocoń, ...
Proceedings of the 2023 Conference on Empirical Methods in Natural Language 2023

ChatGPT: Jack of all trades, master of none
J Kocoń, I Cichecki, O Kaszyca, M Kochanek, D Szydło, J Baran, ...
Information Fusion 99, 101861 2023

Clarin-emo: Training emotion recognition models using human annotation and chatgpt
B Koptyra, A Ngo, Ł Radliński, J Kocoń
International Conference on Computational Science, 365-379 2023

Differential dataset cartography: Explainable artificial intelligence in comparative personalized sentiment analysis
J Kocoń, J Baran, K Kanclerz, M Kajstura, P Kazienko
International Conference on Computational Science, 148-162 2023

Migrants vs. stayers in the pandemic–A sentiment analysis of Twitter content
O Czeranowska, K Chlasta, P Miłkowski, I Grabowska, J Kocoń, ...
Telematics and Informatics Reports 10, 100059 2023

Human-centered neural reasoning for subjective content processing: Hate speech, emotions, and humor
P Kazienko, J Bielaniewicz, M Gruza, K Kanclerz, K Karanowski, ...
Information Fusion 94, 43-65 2023

Rwkv: Reinventing rnns for the transformer era
B Peng, E Alcaide, Q Anthony, A Albalak, S Arcadinho, S Biderman, ...
arXiv preprint arXiv:2305.13048 2023

MOST CITED SCHOLAR PUBLICATIONS

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ...
arXiv preprint arXiv:2206.04615 2022
Citations: 1114

ChatGPT: Jack of all trades, master of none
J Kocoń, I Cichecki, O Kaszyca, M Kochanek, D Szydło, J Baran, ...
Information Fusion 99, 101861 2023
Citations: 553

Rwkv: Reinventing rnns for the transformer era
B Peng, E Alcaide, Q Anthony, A Albalak, S Arcadinho, S Biderman, ...
arXiv preprint arXiv:2305.13048 2023
Citations: 411

Offensive, aggressive, and hate speech analysis: From data-centric to human-centered approach
J Kocoń, A Figas, M Gruza, D Puchalska, T Kajdanowicz, P Kazienko
Information Processing & Management 58 (5), 102643 2021
Citations: 111

Liner2–a customizable framework for proper names recognition for Polish
M Marcińczuk, J Kocoń, M Janicki
Intelligent Tools for Building a Scientific Information Platform: Advanced 2013
Citations: 60

Multi-level sentiment analysis of PolEmo 2.0: Extended corpus of multi-domain consumer reviews
J Kocoń, P Miłkowski, M Zaśko-Zielińska
Proceedings of the 23rd Conference on Computational Natural Language 2019
Citations: 56

Cross-lingual deep neural transfer learning in sentiment analysis
K Kanclerz, P Miłkowski, J Kocoń
Procedia Computer Science 176, 128-137 2020
Citations: 48

Learning personal human biases and representations for subjective tasks in natural language processing
J Kocoń, M Gruza, J Bielaniewicz, D Grimling, K Kanclerz, P Miłkowski, ...
2021 IEEE International Conference on Data Mining (ICDM), 1168-1173 2021
Citations: 43

Personal bias in prediction of emotions elicited by textual opinions
P Miłkowski, M Gruza, K Kanclerz, P Kazienko, D Grimling, J Kocoń
Proceedings of the 59th annual meeting of the association for computational 2021
Citations: 38

Controversy and conformity: from generalized to personalized aggressiveness detection
K Kanclerz, A Figas, M Gruza, T Kajdanowicz, J Kocoń, D Puchalska, ...
Proceedings of the 59th Annual Meeting of the Association for Computational 2021
Citations: 36

Neuro-symbolic models for sentiment analysis
J Kocoń, J Baran, M Gruza, A Janz, M Kajstura, P Kazienko, W Korczyński, ...
International conference on computational science, 667-681 2022
Citations: 26

Mapping WordNet onto human brain connectome in emotion processing and semantic similarity recognition
J Kocoń, M Maziarz
Information Processing & Management 58 (3), 102530 2021
Citations: 26

plWordNet as a basis for large emotive lexicons of Polish
A Janz, J Kocon, M Piasecki, M Zasko-Zielinska
Proceedings of Human Language Technologies as a Challenge for Computer 2017
Citations: 26

Inforex-a web-based tool for text corpus management and semantic annotation.
M Marcinczuk, J Kocon, B Broda
LREC, 224-230 2012
Citations: 26

What if ground truth is subjective? personalized deep neural hate speech detection
K Kanclerz, M Gruza, K Karanowski, J Bielaniewicz, P Miłkowski, J Kocoń, ...
Proceedings of the 1st Workshop on Perspectivist Approaches to NLP@ LREC2022 2022
Citations: 25

Evaluating KGR10 Polish word embeddings in the recognition of temporal expressions using BiLSTM-CRF
J Kocoń, M Gawor
arXiv preprint arXiv:1904.04055 2019
Citations: 25

Recognition of emotions, valence and arousal in large-scale multi-domain text reviews
J Kocoń, A Janz, P Miłkowski, M Riegel, M Wierzba, A Marchewka, ...
9th Language & Technology Conference (LTC 2019): Human Language Technologies 2019
Citations: 23

Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence
B Peng, D Goldstein, Q Anthony, A Albalak, E Alcaide, S Biderman, ...
arXiv preprint arXiv:2404.05892 2024
Citations: 22

Multiemo: Multilingual, multilevel, multidomain sentiment analysis corpus of consumer reviews
J Kocoń, P Miłkowski, K Kanclerz
International Conference on Computational Science, 297-312 2021
Citations: 22