scispace - formally typeset
Search or ask a question
Author

Samantha Riccadonna

Other affiliations: University of Trento
Bio: Samantha Riccadonna is an academic researcher from fondazione bruno kessler. The author has contributed to research in topics: Metric (mathematics) & Support vector machine. The author has an hindex of 17, co-authored 40 publications receiving 2154 citations. Previous affiliations of Samantha Riccadonna include University of Trento.

Papers
More filters
Journal ArticleDOI
Leming Shi1, Gregory Campbell1, Wendell D. Jones, Fabien Campagne2  +198 moreInstitutions (55)
TL;DR: P predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans are generated.
Abstract: Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.

753 citations

Journal ArticleDOI
TL;DR: RNA-seq outperforms microarray in DEG verification as assessed by quantitative PCR, with the gain mainly due to its improved accuracy for low-abundance transcripts, and classifiers to predict MOAs perform similarly when developed using data from either platform.
Abstract: The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differential gene expression has not been rigorously assessed using a range of chemical treatment conditions. Here we use a comprehensive study design to generate Illumina RNA-seq and Affymetrix microarray data from the same liver samples of rats exposed in triplicate to varying degrees of perturbation by 27 chemicals representing multiple modes of action (MOAs). The cross-platform concordance in terms of differentially expressed genes (DEGs) or enriched pathways is linearly correlated with treatment effect size (R(2)0.8). Furthermore, the concordance is also affected by transcript abundance and biological complexity of the MOA. RNA-seq outperforms microarray (93% versus 75%) in DEG verification as assessed by quantitative PCR, with the gain mainly due to its improved accuracy for low-abundance transcripts. Nonetheless, classifiers to predict MOAs perform similarly when developed using data from either platform. Therefore, the endpoint studied and its biological complexity, transcript abundance and the genomic application are important factors in transcriptomic research and for clinical and regulatory decision making.

410 citations

Journal ArticleDOI
08 Aug 2012-PLOS ONE
TL;DR: It is shown that the Confusion Entropy, a measure of performance in multiclass problems has a strong (monotone) relation with the multiclass generalization of a classical metric, the Matthews Correlation Coefficient.
Abstract: We show that the Confusion Entropy, a measure of performance in multiclass problems has a strong (monotone) relation with the multiclass generalization of a classical metric, the Matthews Correlation Coefficient. Analytical results are provided for the limit cases of general no-information (n-face dice rolling) of the binary classification. Computational evidence supports the claim in the general case.

310 citations

Journal ArticleDOI
TL;DR: A novel implementation in ANSI C of the MINE family of algorithms for computing maximal information-based measures of dependence between two variables in large datasets, with the aim of a low memory footprint and ease of integration within bioinformatics pipelines is introduced.
Abstract: Summary: We introduce a novel implementation in ANSI C of the MINE family of algorithms for computing maximal information-based measures of dependence between two variables in large datasets, with the aim of a low memory footprint and ease of integration within bioinformatics pipelines. We provide the libraries minerva (with the R interface) and minepy for Python, MATLAB, Octave and C++. The C solution reduces the large memory requirement of the original Java implementation, has good upscaling properties and offers a native parallelization for the R interface. Low memory requirements are demonstrated on the MINE benchmarks as well as on large ( = 1340) microarray and Illumina GAII RNA-seq transcriptomics datasets. Availability and implementation: Source code and binaries are freely available for download under GPL3 licence at http://minepy.sourceforge.net for minepy and through the CRAN repository http://cran.r-project.org for the R package minerva. All software is multiplatform (MS Windows, Linux and OSX). Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

180 citations

Journal ArticleDOI
TL;DR: A methodological description of nutritional metabolomics is provided that reflects on the state-of-the-art techniques used in the laboratories of the Food Biomarker Alliance as well as points of reflections to harmonize this field.
Abstract: The life sciences are currently being transformed by an unprecedented wave of developments in molecular analysis, which include important advances in instrumental analysis as well as biocomputing. In light of the central role played by metabolism in nutrition, metabolomics is rapidly being established as a key analytical tool in human nutritional studies. Consequently, an increasing number of nutritionists integrate metabolomics into their study designs. Within this dynamic landscape, the potential of nutritional metabolomics (nutrimetabolomics) to be translated into a science, which can impact on health policies, still needs to be realized. A key element to reach this goal is the ability of the research community to join, to collectively make the best use of the potential offered by nutritional metabolomics. This article, therefore, provides a methodological description of nutritional metabolomics that reflects on the state-of-the-art techniques used in the laboratories of the Food Biomarker Alliance (funded by the European Joint Programming Initiative "A Healthy Diet for a Healthy Life" (JPI HDHL)) as well as points of reflections to harmonize this field. It is not intended to be exhaustive but rather to present a pragmatic guidance on metabolomic methodologies, providing readers with useful "tips and tricks" along the analytical workflow.

161 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This article shows how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario.
Abstract: To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets. The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.

2,358 citations

Journal ArticleDOI
21 Apr 2016-Cell
TL;DR: It is concluded that transcript levels by themselves are not sufficient to predict protein levels in many scenarios and to thus explain genotype-phenotype relationships and that high-quality data quantifying different levels of gene expression are indispensable for the complete understanding of biological processes.

1,996 citations

Journal ArticleDOI
TL;DR: By digitally separating tumor, stromal and normal gene expression, two tumor subtypes are identified and validated, including a 'basal-like' subtype that has worse outcome and is molecularly similar to basal tumors in bladder and breast cancers.
Abstract: Pancreatic ductal adenocarcinoma (PDAC) remains a lethal disease with a 5-year survival rate of 4%. A key hallmark of PDAC is extensive stromal involvement, which makes capturing precise tumor-specific molecular information difficult. Here we have overcome this problem by applying blind source separation to a diverse collection of PDAC gene expression microarray data, including data from primary tumor, metastatic and normal samples. By digitally separating tumor, stromal and normal gene expression, we have identified and validated two tumor subtypes, including a 'basal-like' subtype that has worse outcome and is molecularly similar to basal tumors in bladder and breast cancers. Furthermore, we define 'normal' and 'activated' stromal subtypes, which are independently prognostic. Our results provide new insights into the molecular composition of PDAC, which may be used to tailor therapies or provide decision support in a clinical setting where the choice and timing of therapies are critical.

1,333 citations

Journal Article
TL;DR: Why interactome networks are important to consider in biology, how they can be mapped and integrated with each other, what global properties are starting to emerge from interactome network models, and how these properties may relate to human disease are detailed.
Abstract: Complex biological systems and cellular networks may underlie most genotype to phenotype relationships. Here, we review basic concepts in network biology, discussing different types of interactome networks and the insights that can come from analyzing them. We elaborate on why interactome networks are important to consider in biology, how they can be mapped and integrated with each other, what global properties are starting to emerge from interactome network models, and how these properties may relate to human disease.

1,323 citations

Journal Article
TL;DR: Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming, which features interactive data analysis and component-based assembly of data mining procedures.
Abstract: Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. In the selection and design of components, we focus on the flexibility of their reuse: our principal intention is to let the user write simple and clear scripts in Python, which build upon C++ implementations of computationally-intensive tasks. Orange is intended both for experienced users and programmers, as well as for students of data mining.

1,294 citations