Scalable integration of multiomic single-cell data using generative adversarial networks Valentina Giansanti, Francesca Giannese, Oronza A Botrugno, Giorgia Gandolfi, Chiara Balestrieri, et al. Bioinformatics, 2024 Motivation Single-cell profiling has become a common practice to investigate the complexity of tissues, organs, and organisms. Recent technological advances are expanding our capabilities to profile various molecular layers beyond the transcriptome such as, but not limited to, the genome, the epigenome, and the proteome. Depending on the experimental procedure, these data can be obtained from separate assays or the very same cells. Yet, integration of more than two assays is currently not supported by the majority of the computational frameworks avaiable. Results We here propose a Multi-Omic data integration framework based on Wasserstein Generative Adversarial Networks suitable for the analysis of paired or unpaired data with a high number of modalities (>2). At the core of our strategy is a single network trained on all modalities together, limiting the computational burden when many molecular layers are evaluated. Availability and implementation Source code of our framework is available at https://github.com/vgiansanti/MOWGAN
Chromatin Velocity reveals epigenetic dynamics by single-cell profiling of heterochromatin and euchromatin Martina Tedesco, Francesca Giannese, Dejan Lazarević, Valentina Giansanti, Dalia Rosano, et al. Nature Biotechnology, 2022 Recent efforts have succeeded in surveying open chromatin at the single-cell level, but high-throughput, single-cell assessment of heterochromatin and its underlying genomic determinants remains challenging. We engineered a hybrid transposase including the chromodomain (CD) of the heterochromatin protein-1α (HP-1α), which is involved in heterochromatin assembly and maintenance through its binding to trimethylation of the lysine 9 on histone 3 (H3K9me3), and developed a single-cell method, single-cell genome and epigenome by transposases sequencing (scGET-seq), that, unlike single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq), comprehensively probes both open and closed chromatin and concomitantly records the underlying genomic sequences. We tested scGET-seq in cancer-derived organoids and human-derived xenograft (PDX) models and identified genetic events and plasticity-driven mechanisms contributing to cancer drug resistance. Next, building upon the differential enrichment of closed and open chromatin, we devised a method, Chromatin Velocity, that identifies the trajectories of epigenetic modifications at the single-cell level. Chromatin Velocity uncovered paths of epigenetic reorganization during stem cell reprogramming and identified key transcription factors driving these developmental processes. scGET-seq reveals the dynamics of genomic and epigenetic landscapes underlying any cellular processes.
Nested Stochastic Block Models applied to the analysis of single cell data Leonardo Morelli, Valentina Giansanti, Davide Cittaro BMC Bioinformatics, 2021 Single cell profiling has been proven to be a powerful tool in molecular biology to understand the complex behaviours of heterogeneous system. The definition of the properties of single cells is the primary endpoint of such analysis, cells are typically clustered to underpin the common determinants that can be used to describe functional properties of the cell mixture under investigation. Several approaches have been proposed to identify cell clusters; while this is matter of active research, one popular approach is based on community detection in neighbourhood graphs by optimisation of modularity. In this paper we propose an alternative and principled solution to this problem, based on Stochastic Block Models. We show that such approach not only is suitable for identification of cell groups, it also provides a solid framework to perform other relevant tasks in single cell analysis, such as label transfer. To encourage the use of Stochastic Block Models, we developed a python library, , that is compatible with the popular framework.
Fast analysis of scATAC-seq data using a predefined set of genomic regions Valentina Giansanti, Ming Tang, Davide Cittaro F1000research, 2020 Background: Analysis of scATAC-seq data has been recently scaled to thousands of cells. While processing of other types of single cell data was boosted by the implementation of alignment-free techniques, pipelines available to process scATAC-seq data still require large computational resources. We propose here an approach based on pseudoalignment, which reduces the execution times and hardware needs at little cost for precision. Methods: Public data for 10k PBMC were downloaded from 10x Genomics web site. Reads were aligned to various references derived from DNase I Hypersensitive Sites (DHS) using kallisto and quantified with bustools. We compared our results with the ones publicly available derived by cellranger-atac. We subsequently tested our approach on scATAC-seq data for K562 cell line. Results: We found that kallisto does not introduce biases in quantification of known peaks; cells groups identified are consistent with the ones identified from standard method. We also found that cell identification is robust when analysis is performed using DHS-derived reference in place of de novo identification of ATAC peaks. Lastly, we found that our approach is suitable for reliable quantification of gene activity based on scATAC-seq signal, thus allows for efficient labelling of cell groups based on marker genes. Conclusions: Analysis of scATAC-seq data by means of kallisto produces results in line with standard pipelines while being considerably faster; using a set of known DHS sites as reference does not affect the ability to characterize the cell populations.
SoC-based computing infrastructures for scientific applications and commercial services: Performance and economic evaluations Daniele D’Agostino, Alfonso Quarati, Andrea Clematis, Lucia Morganti, Elena Corni, et al. Future Generation Computer Systems, 2019 Energy consumption represents one of the most relevant issues by now in operating computing infrastructures, from traditional High Performance Computing Centers to Cloud Data Centers. Low power System-on-Chip (SoC) architectures, originally developed in the context of mobile and embedded technologies, are becoming attractive also for scientific and industrial applications given their increasing computing performances, coupled with relatively low costs and power demands. In this paper, we investigate the performance of the most representative SoCs for a computational intensive N-body benchmark, a simple deep learning based application and a real-life application taken from the field of molecular biology. The goal is to assess the trade-off among time-to-solution, energy-to-solution and economical aspects for both scientific and commercial purposes they are able to achieve in comparison to traditional server-grade architectures adopted in present infrastructures.
Parallel Computing in Deep Learning: Bioinformatics Case Studiesa Valentina Giansanti, Stefano Beretta, Daniele Cesini, Daniele D'Agostino, Ivan Merelli Proceedings 27th Euromicro International Conference on Parallel Distributed and Network Based Processing Pdp 2019, 2019 In the last two decades deep learning has attracted a lot of attention internationally, solving problems in different application domains and achieving results beyond expectations. For example it has been applied in bioinformatics, game playing, imaging processing, object detection, robotic and drug discovery. One of the main reasons for the incremented use of deep learning algorithms is the need to implement approaches for the analysis of the large amount of data produces in every field, bringing researchers to dedicate their work to deep learning development. One of the main topics discussed up today is the possibility to run the training of deep models in a parallel fashion, so to reduce the time otherwise needed to find the hyperparameters and to make the achievement of the result faster.
Comparing Deep and Machine Learning Approaches in Bioinformatics: A miRNA-Target Prediction Case Study Valentina Giansanti, Mauro Castelli, Stefano Beretta, Ivan Merelli Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2019 MicroRNAs (miRNAs) are small non-coding RNAs with a key role in the post-transcriptional gene expression regularization, thanks to their ability to link with the target mRNA through the complementary base pairing mechanism. Given their role, it is important to identify their targets and, to this purpose, different tools were proposed to solve this problem. However, their results can be very different, so the community is now moving toward the deployment of integration tools, which should be able to perform better than the single ones.
Integration of machine learning methods to dissect genetically imputed transcriptomic profiles in Alzheimer’s Disease Carlo Maj, Tiago Azevedo, Valentina Giansanti, Oleg Borisov, Giovanna Maria Dimitri, et al. Frontiers in Genetics, 2019 The genetic component of many common traits is associated with the gene expression and several variants act as expression quantitative loci, regulating the gene expression in a tissue specific manner. In this work, we applied tissue-specific cis-eQTL gene expression prediction models on the genotype of 808 samples including controls, subjects with mild cognitive impairment, and patients with Alzheimer's Disease. We then dissected the imputed transcriptomic profiles by means of different unsupervised and supervised machine learning approaches to identify potential biological associations. Our analysis suggests that unsupervised and supervised methods can provide complementary information, which can be integrated for a better characterization of the underlying biological system. In particular, a variational autoencoder representation of the transcriptomic profiles, followed by a support vector machine classification, has been used for tissue-specific gene prioritizations. Interestingly, the achieved gene prioritizations can be efficiently integrated as a feature selection step for improving the accuracy of deep learning classifier networks. The identified gene-tissue information suggests a potential role for inflammatory and regulatory processes in gut-brain axis related tissues. In line with the expected low heritability that can be apportioned to eQTL variants, we were able to achieve only relatively low prediction capability with deep learning classification models. However, our analysis revealed that the classification power strongly depends on the network structure, with recurrent neural networks being the best performing network class. Interestingly, cross-tissue analysis suggests a potentially greater role of models trained in brain tissues also by considering dementia-related endophenotypes. Overall, the present analysis suggests that the combination of supervised and unsupervised machine learning techniques can be used for the evaluation of high dimensional omics data.