scispace - formally typeset
Search or ask a question

Showing papers by "Gianluca Pollastri published in 2023"


Journal ArticleDOI
TL;DR: In this paper , the authors compare mutual information, the coefficients of two linear models, and three deep learning networks for single-sequence secondary structure prediction, using the DeepLIFT analysis to assess the effect of each residue at each position.
Abstract: Over the last several decades, predicting protein structures from amino acid sequences has been a core task in bioinformatics. Nowadays, the most successful methods employ multiple sequence alignments and can predict the structure with excellent performance. These predictions take advantage of all the amino acids at a given position and their frequencies. However, the effect of single amino acid substitutions in a specific protein tends to be hidden by the alignment profile. For this reason, single-sequence-based predictions attract interest even after accurate multiple-alignment methods have become available: the use of single sequences ensures that the effects of substitution are not confounded by homologous sequences. This work aims at understanding how the single-sequence secondary structure prediction of a residue is influenced by the surrounding ones. We aim at understanding how different prediction methods use single-sequence information to predict the structure. We compare mutual information, the coefficients of two linear models, and three deep learning networks. For the deep learning algorithms, we use the DeepLIFT analysis to assess the effect of each residue at each position in the prediction. Mutual information and linear models quantify direct effects, whereas DeepLIFT applied on deep learning networks quantifies both direct and indirect effects Our analysis shows how different network architectures use the information of single protein sequences and highlights their differences with respect to linear models. In particular, the deep learning implementations take into account context and single position information differently, with the best results obtained using the BERT architecture.

Journal ArticleDOI
TL;DR: In this article , a CCCTC-binding factor (CTCF) binding predictor based on Random Forest was proposed, which employed different epigenetic data and genomic features to predict the binding of CTCF.
Abstract: MotivationOne of the most relevant mechanisms involved in the determination of chromatin structure is the formation of structural loops that are also related with the conservation of chromatin states. Many of these loops are stabilized by CCCTC-binding factor (CTCF) proteins at their base. Despite the relevance of chromatin structure and the key role of CTCF, the role of the epigenetic factors that are involved in the regulation of CTCF binding, and thus, in the formation of structural loops in the chromatin, is not thoroughly understood.ResultsHere we describe a CTCF binding predictor based on Random Forest that employs different epigenetic data and genomic features. Importantly, given the ability of Random Forests to determine the relevance of features for the prediction, our approach also shows how the different types of descriptors impact the binding of CTCF, confirming previous knowledge on the relevance of chromatin accessibility and DNA methylation, but demonstrating the effect of epigenetic modifications on the activity of CTCF. We compared our approach against other predictors and found improved performance in terms of areas under PR and ROC curves (PRAUC-ROCAUC), outperforming current state-of-the-art methods.

Posted ContentDOI
16 Jan 2023-bioRxiv
TL;DR: In this article , a cascade method based on a random forest algorithm was proposed to infer epigenetic marks, and by doing so, to reduce the number of experimentally determined marks required to assign chromatin states.
Abstract: Structural changes of chromatin modulate access to DNA for all proteins involved in transcription. These changes are linked to variations in epigenetic marks that allow to classify chromatin in different functional states depending on the pattern of these marks. Importantly, alterations in chromatin states are known to be linked with various diseases. For example, there are abnormalities in epigenetic patterns in different types of cancer. For most of these diseases, there is not enough epigenomic data available to accurately determine chromatin states for the cells affected in each of them, mainly due to high costs of performing this type of experiments but also because of lack of a sufficient amount of sample or degradation thereof. In this work we describe a cascade method based on a random forest algorithm to infer epigenetic marks, and by doing so, to reduce the number of experimentally determined marks required to assign chromatin states. Our approach identified several relationships between patterns of different marks, which strengthens the evidence in favor of a redundant epigenetic code.