scispace - formally typeset
Open AccessJournal ArticleDOI

The Perseus computational platform for comprehensive analysis of (prote)omics data.

Reads0
Chats0
TLDR
The Perseus software platform was developed to support biological and biomedical researchers in interpreting protein quantification, interaction and post-translational modification data and it is anticipated that Perseus's arsenal of algorithms and its intuitive usability will empower interdisciplinary analysis of complex large data sets.
Abstract
A main bottleneck in proteomics is the downstream biological analysis of highly multivariate quantitative protein abundance data generated using mass-spectrometry-based analysis. We developed the Perseus software platform (http://www.perseus-framework.org) to support biological and biomedical researchers in interpreting protein quantification, interaction and post-translational modification data. Perseus contains a comprehensive portfolio of statistical tools for high-dimensional omics data analysis covering normalization, pattern recognition, time-series analysis, cross-omics comparisons and multiple-hypothesis testing. A machine learning module supports the classification and validation of patient groups for diagnosis and prognosis, and it also detects predictive protein signatures. Central to Perseus is a user-friendly, interactive workflow environment that provides complete documentation of computational methods used in a publication. All activities in Perseus are realized as plugins, and users can extend the software by programming their own, which can be shared through a plugin store. We anticipate that Perseus's arsenal of algorithms and its intuitive usability will empower interdisciplinary analysis of complex large data sets.

read more

Content maybe subject to copyright    Report

Perseus platform for proteomics data
1
The Perseus computational platform for comprehensive
analysis of (prote)omics data
Stefka Tyanova
1
, Tikira Temu
1
, Pavel Sinitcyn
1
, Arthur Carlson
1
, Marco Y. Hein
2
, Tamar
Geiger
3
, Matthias Mann
4
and Jürgen Cox
1*
1
Computational Systems Biochemistry, Max-Planck Institute of Biochemistry,
Martinsried, Germany.
2
Cellular and Molecular Pharmacology, University of California San Francisco, San
Francisco, CA, USA
3
Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel Aviv
University, Tel Aviv, Israel.
4
Proteomics and Signal Transduction, Max-Planck Institute of Biochemistry, Martinsried,
Germany.
*Correspondence: cox@biochem.mpg.de
Running title: Perseus platform for proteomics data

Perseus platform for proteomics data
2
Abstract
A main bottleneck in proteomics is the downstream biological analysis of highly
multivariate quantitative protein abundance data. The Perseus software supports
researchers in interpreting protein quantification, interaction and posttranslational
modification data. It contains a comprehensive portfolio of statistical tools for high-
dimensional omics data analysis covering normalization, pattern recognition, time
series analysis, cross-omics comparisons and multiple hypothesis testing. A machine
learning module supports classification and validation of patient groups for
diagnosis and prognosis, also detecting predictive protein signatures. Central to
Perseus is a user-friendly, interactive workflow environment providing complete
documentation of computational methods used in a publication. All activities in
Perseus are realized as plugins and users can extend the software by programming
their own, which can be shared through a plugin store. Perseus combines a powerful
arsenal of algorithms with intuitive usability by biomedical domain experts, making
it suitable for interdisciplinary analysis of complex large datasets.

Perseus platform for proteomics data
3
A decade ago proteomics projects were still labor-intensive and cumbersome, and high
quality results required semi-manual analysis of spectra for identification and
quantification. Today, mass spectrometry (MS)-based shotgun proteomics is reaching a
level of maturity that makes it a powerful and broadly applicable technology for
researchers in biology and biomedical sciences
1, 2
. Consistent automatic processing of
spectra and the identification of peptides, proteins and posttranslational modifications
(PTMs) with the help of search engines
3-7
and reliable workflows have become standard
computational tasks for which satisfactory solutions exist for single studies as well as
community-wide data re-analysis
8-10
. Sophisticated computational proteomics platforms
offer complete solutions including the quantification of proteins and PTMs over many
samples in a large variety of labeling or label-free formats
11
. Public repositories for the
storage and dissemination of MS-based proteomics data exist in practical forms
12, 13
.
Yeast systems biology can make use of complete proteome quantification
14
in many
different conditions or stimuli with modest measurement effort
15
. Starting with a cohort
of human samples protein expression matrices with sample-wise ratios or relative
abundances can readily be obtained for more than 10,000 proteins
16-19
.
These advances have shifted the bottleneck to the biological interpretation of quantitative
abundance and PTM data and to translating the high-dimensional molecular data into
relevant findings within the domain of a particular biological or medical investigator.
Many potentially important findings are not currently extracted from the data simply
because the computational methods and algorithms that would highlight them are not in
the hand of the researcher with the necessary domain knowledge to appreciate the
meaning of the findings. There are often barriers between informatics and biological
researchers, which need to be bridged in order to translate omics technologies to valuable
biological or medical discoveries.
Here, we address this problem by creating a computational platform that fulfils two
potentially conflicting objectives: (1) All methods should be statistically sound, powerful
and comprehensive. (2) It should still be intuitive and easy to use for the domain expert in
a biomedical discipline who is not a computational expert. To reach these goals we

Perseus platform for proteomics data
4
developed the Perseus platform in close collaboration with biologists, with whom we
analyzed projects involving multiple, diverse and distinct data types and experimental
approaches. Experienced Perseus users can perform essentially all the computational
tasks alone, even with little or no formal bioinformatic training. They can still involve
programmers and bioinformatics specialists to extend the functionality of Perseus with
plug-ins that add to the Perseus workflow as custom activities. Here we describe the
functionalities available in version 1.5.4.0 of Perseus.
Comprehensive workflow-based data analysis platform
Downstream analysis of proteomic data is a multi-faceted and demanding field that
integrates many aspects of bioinformatics, statistics and machine learning. It is common
practice to hire bioinformaticians with a view to help the biological researchers with
various analytical problems. Often these efforts result in multiple small scripts that are
tedious to maintain and scale and that require the help of the developer to be re-used or
stitched together. This approach is bound to turn downstream data analysis into a major
bottleneck for scientific projects and discoveries. Furthermore the results may be of
questionable validity when there is no clear documentation and transparency about the
methods and scripts employed. We thus set out to develop the Perseus platform as a
holistic software that allows continuous expansion of scalable analytical tools, their
smooth integration and re-usability while providing the user with explicit documentation
of the analysis steps and parameters. Greater detail on the implementation and download
of Perseus is provided in Box 1.
Perseus offers a wide range of algorithmic activities that cover topics ranging from data
normalization through exploratory multivariate data analysis to integration with other
omics levels (Fig. 1). The following sub-sections describe the various computational and
statistical tools in Perseus. Several complete analysis workflows are available on our
DokuWiki pages (http://coxdocs.org/doku.php?id=perseus:user:use_cases) that contain
step-by-step descriptions of three standard proteomics project types and together with the
YouTube videos

Perseus platform for proteomics data
5
(https://www.youtube.com/channel/UCKYzYTm1cnmc0CFAMhxDO8w) represent a
valuable resource for first time users. Many activities produce interactive graphical
output for the visualization of data analysis results, which scale easily to very large sets
of input data and therefore allow for thorough inspection by the user even for large-scale
experiments with complex experimental designs and many measured variables. Any plot
can be exported in a number of graphical formats and edited in standard vector graphics
editors upon release of all clipping masks.
The central data type in Perseus is the ‘augmented data matrix’, which typically
represents expression or abundance values of genes or proteins (rows) and biological
samples or technical replicates (columns). It is supplemented by additional data
containers for annotation of the rows, columns and cells of the matrix (see Box 2). These
annotation containers are automatically filled in Perseus with gene or protein information
derived from the publicly available ontologies, pathways and annotation databases.
Sample annotation are used in many activities to define the study design, such as to
designate which samples are replicates, or which belong to different treatments or time
points in a time series analysis.
The main navigation tool is the workflow panel, which is composed of matrices and
activities, and controls the information-flow in a Perseus session (Supplementary Fig.
1). The interactive workflow allows the user to keep track of all steps in the analysis and
to navigate through data matrices and visualization components. It facilitates revisiting
intermediate steps in a complex computational workflow, branching off with alternative
parameter settings or a different combination of activities, and comparing results of
alternative branches to each other. The matrix objects move through the workflow and
are transformed and modified by activities. The workflow itself is a bipartite graph in
which every matrix is connected via an activity to the next matrix. A matrix can have
interactive local visualizations attached (e.g. plots, histograms and heat maps). Activities
can be of a simple single-input structure or they can receive inputs from several matrices
for the purpose of data integration when merging data from two or more different omics
levels (see Box 3).

Citations
More filters
Journal ArticleDOI

The MaxQuant computational platform for mass spectrometry-based shotgun proteomics.

TL;DR: An updated protocol covering the most important basic computational workflows for mass-spectrometry-based proteomics data analysis, including those designed for quantitative label-free proteomics, MS1-level labeling and isobaric labeling techniques is presented.

Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling

TL;DR: Using singular value decomposition in transforming genome-wide expression data from genes x arrays space to reduced diagonalized "eigengenes" x "eigenarrays" space gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype.
Journal ArticleDOI

Mass-spectrometric exploration of proteome structure and function

TL;DR: Powerful mass-spectrometry-based technologies now provide unprecedented insights into the composition, structure, function and control of the proteome, shedding light on complex biological processes and phenotypes.
Journal ArticleDOI

Proteomics of SARS-CoV-2-infected host cells reveals therapy targets.

TL;DR: The cellular infection profile of SARS-CoV-2 is revealed and the identification of drugs that inhibit viral replication is enabled, enabling the development of therapies for the treatment of COVID-19.
Journal ArticleDOI

Papain-like protease regulates SARS-CoV-2 viral spread and innate immunity.

TL;DR: Biochemical, structural and functional studies on the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) papain-like protease PLpro reveal that it regulates host antiviral responses by preferentially cleaving the ubiquitin-like interferon-stimulated gene 15 protein (ISG15) and identify this protease as a potential therapeutic target for coronav virus disease 2019 (COVID-19).
References
More filters
Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Journal ArticleDOI

Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks

TL;DR: Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What are the contributions mentioned in the paper "The perseus computational platform for comprehensive analysis of (prote)omics data" ?

Cox et al. this paper proposed a computational system for signal transduction in the Max-Planck Institute of Biochemistry in Germany. 

Their guiding principle was to put the expertise of bioinformatics scientists in the hands of all life science researchers, allowing them to focus on their biological questions while benefitting from both powerful statistical tools and cutting edge scalable analytic possibilities without depending on often unavailable specialists. In the future, metabolomics data with relative quantification profiles for a global set of metabolites over several samples, which is similar to label free quantification proteomics data, will be accommodated by Perseus with only slight adaptations such as customization of the annotation of molecular species. One major challenge and opportunity that will drive the future development of Perseus is to bridge the currently existing gap between large-scale proteomics data generation and modeling of signaling pathways and biochemical reactions. As the experimental designs become more and more complex, the functionality of Perseus will be enriched accordingly, building upon its extensible architecture to offer more tools and to support future data types. 

The Learning plug-in in Perseus provides implementation of classification and regression analyses and implements various feature selection methods. 

The machine learning section of Perseus has a cross validation structure for the purpose of measuring how the prediction performance of classification or regression will generalize to independent data that have not been used for model building, thereby avoiding notorious problems such as over-fitting61. 

The time series set of plug-ins of Perseus contains a periodicity analysis component that allows detection of periodic oscillations in protein expression over time. 

25 GSEA is the forerunner of many methods for analyzing molecular profiling data to determine which sets of genes or proteins are correlated with a phenotypic class distinction. 

Other numerical values that serve as annotations such as sequence length, number of identified peptides or posterior error probabilities are stored in ‘Numerical columns’. 

To derive the length of the cycle from the data, a Fourier-based periodicity analysis can be performed that determines the base frequency of periodic expression changes and also allows screening for possible other cycle lengths (e.g. harmonics of the base frequency). 

Their hope is that this novel platform will contribute to better communication between disciplines and more effective application of computational tools. 

Perseus platform for proteomics data18Box 3. Data integrationOne of the most laborious and error-prone steps in data analysis is matching and integration of different data types. 

The authors believe the latter feature is crucial for the scientific community as it fosters transparency and reproducibility of the reported results. 

Once users have programmed a new plugin they can make it available through the Perseus pluginPerseus platform for proteomics data7store (www.perseus-framework.org/plugins). 

The authors provide a core set of plugins containing more than 100 activities that are bundled with the standard Perseus download and that can also be re-used in newly developed activities (SupplementaryTable 1). 

In addition to scores reflecting the reliability of identification and the confidence in the localization of each site in the protein sequence28, 29, quantitative information is crucial for understanding the functional role of the modification sites. 

(a) The amplitude (expression level) and phase (up- or downregulation) are determined by the software by optimizing a cosine function fit to the data. 

Once an interesting cluster of proteins has been identified, enrichment analysis25 of biological processes, complexes or pathways is done in a variety of ways, for instance with the Fisher’s exact test checking for contingency between cluster membership and the property of interest. 

The phosphorylation site table is another example, in which such filtering is desirable, as sites with occupancy errors larger than a fixed threshold can be filtered out using a ‘Quality matrix’ containing the site-specific errors. 

Perseus can be downloaded for free from www.perseus-framework.org under acceptance of their freeware license agreement and user account registration.