The advantages of our proposed Saturn coefficient over continuity and trustworthiness for UMAP dimensionality reduction evaluation Davide Chicco, Simone Melzi, Francesca Gasparini, Giuseppe Jurman Peerj Computer Science, 2026 Understanding the structure of a dataset is an easy task when the dimensions are two or three, but it can become extremely difficult when a dataset consists of tens, hundreds, or thousands of variables. Dimensionality reduction methods are computational techniques with solid mathematical foundations that allow for the projection of high-dimensional datasets into smaller data spaces. These low-dimensional representations of the original data, usually consisting of two variables, can then be plotted and inspected by researchers to gain an understanding of the original data structure. Uniform Manifold Approximation and Projection (UMAP) is one of the most effective and popular algorithms for dimensionality reduction, and has been proven effective on biomedical datasets, in particular. Even though UMAP is commonly utilized by thousands of researchers worldwide, no consensus has been reached on how to assess the output of dimensionality reduction informatively: to date, researchers often evaluate UMAP’s outcomes by eyeballing its two-dimensional plots each time. Of course, this approach is rather arbitrary, as different individuals might interpret a 2D plot in a different way. Some numerical coefficients for assessing UMAP’s conservation of global and local structure exist (continuity and trustworthiness, respectively), but they suffer from several flaws and can be misleading in multiple cases. To address these issues, we present here our Saturn coefficient, a new simple statistical metric that expresses the conservation of local structure and the conservation of global structure in UMAP through a real value ranging from 0 (no preservation) to 1 (complete preservation). In this study, we describe the rationale behind our Saturn coefficient and validate its results compared to continuity and trustworthiness on four artificial datasets and ten real-world biomedical datasets. Additionally, we propose a novel validation procedure based on the preservation of the clusters found by HDBSCAN (hierarchical density-based spatial clustering of applications with noise) in the original dataset within its dimensionality reduction representation (HDBSCANess). Our results demonstrate the validity of our Saturn coefficient across all artificial datasets and in seven out of fifteen real-world biomedical datasets. We therefore recommend the use of our Saturn coefficient to anyone wishing to assess UMAP results: our statistic, for example, can be used to test several sets of UMAP hyperparameters and to select the best configuration among them. Moreover, we also provide the software implementation of our Saturn coefficient as a standalone R package openly available on CRAN at https://doi.org/10.32614/CRAN.package.SaturnCoefficient . SaturnCoefficient and as a standalone Python package openly available on PyPI at https://pypi.org/project/SaturnScore .
Coherent cross-modal generation of synthetic biomedical data to advance multimodal precision medicine Raffaele Marchesi, Nicolò Lazzaro, Walter Endrizzi, Gianluca Leonardi, Matteo Pozzi, et al. Plos Computational Biology, 2026 Integration of multimodal, multi-omics data is critical for advancing precision medicine, yet its application is frequently limited by incomplete datasets where one or more modalities are missing. To address this challenge, we developed a generative framework capable of synthesizing any missing modality from an arbitrary subset of available modalities. We introduce Coherent Denoising, a novel ensemble-based generative diffusion method that aggregates predictions from multiple specialized, single-condition models and enforces consensus during the sampling process. We compare this approach against a multi-condition, generative model that uses a flexible masking strategy to handle arbitrary subsets of inputs. The results show that our architectures successfully generate high-fidelity data that preserve the complex biological signals required for downstream tasks. We demonstrate that the generated synthetic data can be used to maintain the performance of predictive models on incomplete patient profiles and can leverage counterfactual analysis to guide the prioritization of diagnostic tests. We validated the framework’s efficacy on a large-scale multimodal, multi-omics cohort from The Cancer Genome Atlas (TCGA) of over 10,000 samples spanning across 20 tumor types, using data modalities such as copy-number alterations (CNA), transcriptomics (RNA-Seq), proteomics (RPPA), and histopathology (WSI). This work establishes a robust and flexible generative framework to address sparsity in multimodal datasets, providing a key step toward improving precision oncology.
Comment on “Using genomic data and machine learning to predict antibiotic resistance: A tutorial paper” Davide Chicco, Giuseppe Jurman Plos Computational Biology, 2025 A recent study by Faye Orcales and colleagues proposes a teaching curriculum on supervised machine learning applied to genomics data aimed at predicting antibiotic resistance. The article describes a traditional machine learning pipeline step-by-step in a way that is accessible to anyone, including novices. However, the authors provide a misleading piece of advice in the “Evaluating model performance” section, where they recommend that readers use accuracy and the F1 score for binary classification. We write this short formal comment on that article to reaffirm and explain why accuracy and the F1 score should be avoided in the evaluation of binary classification and why the Matthews correlation coefficient (MCC) should be employed instead. We also take this opportunity to warn readers about the dangers of k -fold cross-validation, which is suggested as a standard method for dividing data into training set and test set, but has several flaws and pitfalls.
The Venus score for the assessment of the quality and trustworthiness of biomedical datasets Davide Chicco, Alessandro Fabris, Giuseppe Jurman Biodata Mining, 2025 Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, producing bad results in turn, which can harm patients and data subjects. To address this problem, policy-makers, researchers, and consortia have proposed diverse regulations, guidelines, and scores to assess the quality and increase the reliability of datasets. Although generally useful, however, they are often incomplete and impractical. The guidelines of Datasheets for Datasets, in particular, are too numerous; the requirements of the Kaggle Dataset Usability Score focus on non-scientific requisites (for example, including a cover image); and the European Union Artificial Intelligence Act (EU AI Act) sets forth sparse and general data governance requirements, which we tailored to datasets for biomedical AI. Against this backdrop, we introduce our new Venus score to assess the data quality and trustworthiness of biomedical datasets. Our score ranges from 0 to 10 and consists of ten questions that anyone developing a bioinformatics, medical informatics, or cheminformatics dataset should answer before the release. In this study, we first describe the EU AI Act, Datasheets for Datasets, and the Kaggle Dataset Usability Score, presenting their requirements and their drawbacks. To do so, we reverse-engineer the weights of the influential Kaggle Score for the first time and report them in this study. We distill the most important data governance requirements into ten questions tailored to the biomedical domain, comprising the Venus score. We apply the Venus score to twelve datasets from multiple subdomains, including electronic health records, medical imaging, microarray and bulk RNA-seq gene expression, cheminformatics, physiologic electrogram signals, and medical text. Analyzing the results, we surface fine-grained strengths and weaknesses of popular datasets, as well as aggregate trends. Most notably, we find a widespread tendency to gloss over sources of data inaccuracy and noise, which may hinder the reliable exploitation of data and, consequently, research results. Overall, our results confirm the applicability and utility of the Venus score to assess the trustworthiness of biomedical data.
Neuropsychological tests and machine learning: identifying predictors of MCI and dementia progression Carlotta Cazzolli, Marco Chierici, Monica Dallabona, Chiara Guella, Giuseppe Jurman Aging Clinical and Experimental Research, 2025 Background Early prediction of progression in dementia is of major importance for providing patients with adequate clinical care, with considerable impact on the organization of the whole healthcare system. Aims The main task is tailoring robust and consolidated machine learning models to detect which neuropsychological tests are more effective in predicting a patient’s mental status. In a translational medicine perspective, such identification tool should find its place in the clinician’s toolbox as a support throughout his daily diagnostic routine. A second objective involves predicting the patient’s diagnosis based on the results of the cognitive assessment. Methods 281 patients with MCI or dementia diagnosis were assessed through 14 commonly administered neuropsychological tests designed to evaluate different cognitive domains. A suite of machine learning models, trained on different subsets of data, was used to detect the most informative tests and to predict the patient’s diagnosis. Two external validation datasets containing MMSE and FAB tests were involved in this second task. Results The tests qualitatively and statistically associated to a cognitive decline are MMSE, FAB, BSTR, AM, and VSF, of which at least three were considered the most informative also by machine learning. 73% average accuracy was obtained in the diagnosis prediction on three subsets of original and external data. Discussion Detecting the most informative tests could reduce the visits’ time and prevent the cognitive assessment from being biased by external factors. Machine learning models’ prediction represents a useful baseline for the clinician’s actual diagnosis and a reliable insight into the future development of the patient’s cognitive status.
Machine Learning Analysis Applied to Prediction of Early Progression Independent of Relapse Activity in Multiple Sclerosis Patients Valentina Poretto, Walter Endrizzi, Matteo Betti, Stefano Bovo, Angelo Bellinvia, et al. European Journal of Neurology, 2025 Background Predicting prognosis in people with multiple sclerosis (pwMS) at early disease stages still remains an unmet need. Machine learning (ML) strategies demonstrated good reliability when applied for prediction in medicine. This study aimed at developing a predictive algorithm comparing different ML approaches, by using routine demographic, clinical and radiological data from a large multicentric cohort of newly diagnosed pwMS. Methods Demographic, clinical, radiological and biochemical data were retrospectively collected at three Italian MS centers at baseline and four timepoints thereafter (6, 12, 24, and 36 months). Data from the first evaluation and subsequent 2‐year follow‐up were analyzed, comparing different ML models (Random Forest, Extra Trees, XGBoost, Logistic Regression and Support Vector Classifier) to predict progression independent of relapse activity (PIRA) at year 3. To understand how features impacted the selected model's output, a ML explainability analysis was performed on the whole cohort and on specific subsets of patients, those aged under 45 and those NEDA‐3 at the 2‐year follow‐up. Results Data from 719 pwMS (age 34.6 ± 11.2 years); female sex 501 (70%) were analyzed. Ninety‐two pwMS (13%) developed PIRA at year 3. Random Forest achieved the highest score, with a test set area under the ROC curve (AUC) of 0.75 ± 0.06. Features with the highest predictive impact were Expanded Disability Status Scale at 24 months, age at symptom onset and disease duration at baseline. Conclusion Our results showed the feasibility of applying ML techniques to predict short‐term PIRA in newly diagnosed pwMS by using routine clinical practice data, paving the way for tailored and personalized approaches.
Mapping B cells and the immune landscape of tertiary lymphoid structures reveals their clinical impact in neuroblastoma Ombretta Melaiu, Marco Chierici, Paula Gragera, Nicolò Lazzaro, Lucia L Petrilli, et al. Journal for Immunotherapy of Cancer, 2025 Background Immunotherapy has transformed cancer treatment, highlighting the importance of effective antitumor immunity to fight cancer. However, its success in pediatric cancer remains limited, underscoring the urgent need to identify new immunotherapeutic targets. In this study, we explored the clinical relevance of B cells and tertiary lymphoid structures (TLS) in neuroblastoma (NB), a pediatric tumor with a heterogeneous immune landscape. Methods We analyzed 87 treatment-naïve NB specimens, spanning both localized and metastatic disease previously characterized for T-cell and dendritic cell (DC) infiltration. B cells were detected by immunohistochemistry, and plasma cells were quantified using multiple immunofluorescence. Spatial organization and functional status of immune cells within TLSs were assessed by imaging mass cytometry using a 29-antibody panel. In parallel, gene expression profiles were obtained through NanoString PanCancer Immune Profiling and further validated using publicly available bulk and single-cell RNA-sequencing data from untreated and treated NB samples. These transcriptomic datasets were used to support protein-level findings and to identify prognostic gene signatures. Results B-cell infiltration in NB tumors strongly correlated with the presence of T cells and DCs at both protein and transcriptomic levels, and was associated with improved prognosis. Similar to other solid tumors, B cells in NB were either scattered throughout the tumor or organized into TLSs of varying maturity. Spatial proteomic and transcriptomic analyses revealed that localized tumors often contain mature TLSs, with functional B cells able to antigen presentation and immunoglobulin expression, alongside high cytotoxic T cells. In contrast, metastatic tumors primarily exhibited immature TLSs, with evidence of B-cell and T-cell dysfunction. Importantly, we identified gene signatures associated with B cells and TLSs that not only predicted survival in NB but were also prognostic in multiple adult cancers. Conclusions Our findings highlight a central role for B cells and TLSs in shaping the immune microenvironment of NB. Their presence and maturation status are linked to clinical outcome, suggesting their potential as prognostic biomarkers and targets for novel immunotherapeutic strategies in pediatric oncology.
Bottlenecks in advancing and applying multiomic data integration—common data resources as rate-limiting drivers—the high-impact use case of atherosclerotic cardiovascular disease Stephanie Bezzina Wettinger, Kanita Karaduzovic-Hadziabdic, Ritienne Attard, Rosienne Farrugia, Brooke N Wolford, et al. Briefings in Bioinformatics, 2025 Despite striking successes in identifying novel biomarkers for improved patient stratification and predicting disease progression, numerous challenges remain in the effective integration and exploitation of multiomic data in biomedical applications beyond cancer, for which most bioinformatics strategies are developed and validated. That focus on cancer severely limits the effective development and advancement of algorithms in machine learning and artificial intelligence that do not suffer degraded out-of-domain performance. Generalizability and interpretability of models, however, are also required for robust insights that may translate into clinical practice. Work across different independent datasets is critical for establishing models robust towards unwanted variation in assays, protocols, and cohort populations. Disease-specific context like ethnicity, socioeconomic background, sex, lifestyle, disease phase, and tissue type also strongly affect molecular profiles. We here discuss atherosclerotic cardiovascular disease (ASCVD) as a high-impact non-cancer use case for the challenges remaining in the development and application of the latest bioinformatics approaches to multiomics data integration. ASCVD remains the leading cause of death globally. Disease aetiology, progression, and therapy outcome depend on a complex interplay of genetic, environmental, and lifestyle factors. Integrating these diverse data types effectively remains a challenge but holds transformative potential for personalized medicine. Discovery and access to data of sufficient diversity and extent form key bottlenecks. We here compile a first comprehensive overview of key data sets in ASCVD to complement the established cancer-focused resources as a foundation for future effective development and application of state-of-the-art bioinformatics tools for multiomic data integration.
Scoring Tumor-Infiltrating Lymphocytes in breast DCIS: A guideline-driven artificial intelligence approach Proceedings of Machine Learning Research, 2024
AI Slipping on Tiles: Data Leakage in Digital Pathology Nicole Bussola, Alessia Marcolini, Valerio Maggio, Giuseppe Jurman, Cesare Furlanello Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2021
Integrating deep and radiomics features in cancer bioimaging A. Bizzego, N. Bussola, D. Salvalai, M. Chierici, V. Maggio, et al. 2019 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology Cibcb 2019, 2019
A machine learning pipeline for discriminant pathways identification Annalisa Barla, Giuseppe Jurman, Roberto Visintainer, Margherita Squillario, Michele Filosi, et al. Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2012
Deriving the kernel from training data Stefano Merler, Giuseppe Jurman, Cesare Furlanello Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2007
Proteome profiling without selection bias A. Barla, B. Irler, S. Merler, G. Jurman, S. Paoli, et al. Proceedings IEEE Symposium on Computer Based Medical Systems, 2006
Semisupervised profiling of gene expressions and clinical data Silvano Paoli, Giuseppe Jurman, Davide Albanese, Stefano Merler, Cesare Furlanello Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2006
Exact bagging with k-Nearest neighbour classifiers Bruno Caprile, Stefano Merler, Cesare Furlanello, Giuseppe Jurman Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2004
Control of selection bias in microarray data analysis Minerva Biotecnologica, 2003
Gene Selection and Classification by Entropy-based Recursive Feature Elimination Proceedings of the International Joint Conference on Neural Networks, 2003