Open AccessJournal ArticleDOI

The Perseus computational platform for comprehensive analysis of (prote)omics data.

- 01 Sep 2016 -

- Vol. 13, Iss: 9, pp 731-740

Chats0

TLDR

The Perseus software platform was developed to support biological and biomedical researchers in interpreting protein quantification, interaction and post-translational modification data and it is anticipated that Perseus's arsenal of algorithms and its intuitive usability will empower interdisciplinary analysis of complex large data sets.

Abstract:

A main bottleneck in proteomics is the downstream biological analysis of highly multivariate quantitative protein abundance data generated using mass-spectrometry-based analysis. We developed the Perseus software platform (http://www.perseus-framework.org) to support biological and biomedical researchers in interpreting protein quantification, interaction and post-translational modification data. Perseus contains a comprehensive portfolio of statistical tools for high-dimensional omics data analysis covering normalization, pattern recognition, time-series analysis, cross-omics comparisons and multiple-hypothesis testing. A machine learning module supports the classification and validation of patient groups for diagnosis and prognosis, and it also detects predictive protein signatures. Central to Perseus is a user-friendly, interactive workflow environment that provides complete documentation of computational methods used in a publication. All activities in Perseus are realized as plugins, and users can extend the software by programming their own, which can be shared through a plugin store. We anticipate that Perseus's arsenal of algorithms and its intuitive usability will empower interdisciplinary analysis of complex large data sets.

Content maybe subject to copyright Report

Perseus platform for proteomics data

The Perseus computational platform for comprehensive

analysis of (prote)omics data

Stefka Tyanova

, Tikira Temu

, Pavel Sinitcyn

, Arthur Carlson

, Marco Y. Hein

, Tamar

Geiger

, Matthias Mann

and Jürgen Cox

Computational Systems Biochemistry, Max-Planck Institute of Biochemistry,

Martinsried, Germany.

Cellular and Molecular Pharmacology, University of California San Francisco, San

Francisco, CA, USA

Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel Aviv

University, Tel Aviv, Israel.

Proteomics and Signal Transduction, Max-Planck Institute of Biochemistry, Martinsried,

Germany.

*Correspondence: cox@biochem.mpg.de

Running title: Perseus platform for proteomics data

Perseus platform for proteomics data

Abstract

A main bottleneck in proteomics is the downstream biological analysis of highly

multivariate quantitative protein abundance data. The Perseus software supports

researchers in interpreting protein quantification, interaction and posttranslational

modification data. It contains a comprehensive portfolio of statistical tools for high-

dimensional omics data analysis covering normalization, pattern recognition, time

series analysis, cross-omics comparisons and multiple hypothesis testing. A machine

learning module supports classification and validation of patient groups for

diagnosis and prognosis, also detecting predictive protein signatures. Central to

Perseus is a user-friendly, interactive workflow environment providing complete

documentation of computational methods used in a publication. All activities in

Perseus are realized as plugins and users can extend the software by programming

their own, which can be shared through a plugin store. Perseus combines a powerful

arsenal of algorithms with intuitive usability by biomedical domain experts, making

it suitable for interdisciplinary analysis of complex large datasets.

Perseus platform for proteomics data

A decade ago proteomics projects were still labor-intensive and cumbersome, and high

quality results required semi-manual analysis of spectra for identification and

quantification. Today, mass spectrometry (MS)-based shotgun proteomics is reaching a

level of maturity that makes it a powerful and broadly applicable technology for

researchers in biology and biomedical sciences

1, 2

. Consistent automatic processing of

spectra and the identification of peptides, proteins and posttranslational modifications

(PTMs) with the help of search engines

3-7

and reliable workflows have become standard

computational tasks for which satisfactory solutions exist for single studies as well as

community-wide data re-analysis

8-10

. Sophisticated computational proteomics platforms

offer complete solutions including the quantification of proteins and PTMs over many

samples in a large variety of labeling or label-free formats

. Public repositories for the

storage and dissemination of MS-based proteomics data exist in practical forms

12, 13

Yeast systems biology can make use of complete proteome quantification

in many

different conditions or stimuli with modest measurement effort

. Starting with a cohort

of human samples protein expression matrices with sample-wise ratios or relative

abundances can readily be obtained for more than 10,000 proteins

16-19

These advances have shifted the bottleneck to the biological interpretation of quantitative

abundance and PTM data and to translating the high-dimensional molecular data into

relevant findings within the domain of a particular biological or medical investigator.

Many potentially important findings are not currently extracted from the data simply

because the computational methods and algorithms that would highlight them are not in

the hand of the researcher with the necessary domain knowledge to appreciate the

meaning of the findings. There are often barriers between informatics and biological

researchers, which need to be bridged in order to translate omics technologies to valuable

biological or medical discoveries.

Here, we address this problem by creating a computational platform that fulfils two

potentially conflicting objectives: (1) All methods should be statistically sound, powerful

and comprehensive. (2) It should still be intuitive and easy to use for the domain expert in

a biomedical discipline who is not a computational expert. To reach these goals we

Perseus platform for proteomics data

developed the Perseus platform in close collaboration with biologists, with whom we

analyzed projects involving multiple, diverse and distinct data types and experimental

approaches. Experienced Perseus users can perform essentially all the computational

tasks alone, even with little or no formal bioinformatic training. They can still involve

programmers and bioinformatics specialists to extend the functionality of Perseus with

plug-ins that add to the Perseus workflow as custom activities. Here we describe the

functionalities available in version 1.5.4.0 of Perseus.

Comprehensive workflow-based data analysis platform

Downstream analysis of proteomic data is a multi-faceted and demanding field that

integrates many aspects of bioinformatics, statistics and machine learning. It is common

practice to hire bioinformaticians with a view to help the biological researchers with

various analytical problems. Often these efforts result in multiple small scripts that are

tedious to maintain and scale and that require the help of the developer to be re-used or

stitched together. This approach is bound to turn downstream data analysis into a major

bottleneck for scientific projects and discoveries. Furthermore the results may be of

questionable validity when there is no clear documentation and transparency about the

methods and scripts employed. We thus set out to develop the Perseus platform as a

holistic software that allows continuous expansion of scalable analytical tools, their

smooth integration and re-usability while providing the user with explicit documentation

of the analysis steps and parameters. Greater detail on the implementation and download

of Perseus is provided in Box 1.

Perseus offers a wide range of algorithmic activities that cover topics ranging from data

normalization through exploratory multivariate data analysis to integration with other

omics levels (Fig. 1). The following sub-sections describe the various computational and

statistical tools in Perseus. Several complete analysis workflows are available on our

DokuWiki pages (http://coxdocs.org/doku.php?id=perseus:user:use_cases) that contain

step-by-step descriptions of three standard proteomics project types and together with the

YouTube videos

Perseus platform for proteomics data

(https://www.youtube.com/channel/UCKYzYTm1cnmc0CFAMhxDO8w) represent a

valuable resource for first time users. Many activities produce interactive graphical

output for the visualization of data analysis results, which scale easily to very large sets

of input data and therefore allow for thorough inspection by the user even for large-scale

experiments with complex experimental designs and many measured variables. Any plot

can be exported in a number of graphical formats and edited in standard vector graphics

editors upon release of all clipping masks.

The central data type in Perseus is the ‘augmented data matrix’, which typically

represents expression or abundance values of genes or proteins (rows) and biological

samples or technical replicates (columns). It is supplemented by additional data

containers for annotation of the rows, columns and cells of the matrix (see Box 2). These

annotation containers are automatically filled in Perseus with gene or protein information

derived from the publicly available ontologies, pathways and annotation databases.

Sample annotation are used in many activities to define the study design, such as to

designate which samples are replicates, or which belong to different treatments or time

points in a time series analysis.

The main navigation tool is the workflow panel, which is composed of matrices and

activities, and controls the information-flow in a Perseus session (Supplementary Fig.

1). The interactive workflow allows the user to keep track of all steps in the analysis and

to navigate through data matrices and visualization components. It facilitates revisiting

intermediate steps in a complex computational workflow, branching off with alternative

parameter settings or a different combination of activities, and comparing results of

alternative branches to each other. The matrix objects move through the workflow and

are transformed and modified by activities. The workflow itself is a bipartite graph in

which every matrix is connected via an activity to the next matrix. A matrix can have

interactive local visualizations attached (e.g. plots, histograms and heat maps). Activities

can be of a simple single-input structure or they can receive inputs from several matrices

for the purpose of data integration when merging data from two or more different omics

levels (see Box 3).

HTML Viewer

Citations

PDF

Open Access

More filters

Journal ArticleDOI

The MaxQuant computational platform for mass spectrometry-based shotgun proteomics.

Stefka Tyanova, +2 more

- 01 Dec 2016 -

Nature Protocols

TL;DR: An updated protocol covering the most important basic computational workflows for mass-spectrometry-based proteomics data analysis, including those designed for quantitative label-free proteomics, MS1-level labeling and isobaric labeling techniques is presented.

...read moreread less

Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling

Orly Alter, +2 more

TL;DR: Using singular value decomposition in transforming genome-wide expression data from genes x arrays space to reduced diagonalized "eigengenes" x "eigenarrays" space gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype.

...read moreread less

Journal ArticleDOI

Mass-spectrometric exploration of proteome structure and function

Ruedi Aebersold, +2 more

- 15 Sep 2016 -

Nature

TL;DR: Powerful mass-spectrometry-based technologies now provide unprecedented insights into the composition, structure, function and control of the proteome, shedding light on complex biological processes and phenotypes.

...read moreread less

Journal ArticleDOI

Proteomics of SARS-CoV-2-infected host cells reveals therapy targets.

Denisa Bojkova, +7 more

- 14 May 2020 -

Nature

TL;DR: The cellular infection profile of SARS-CoV-2 is revealed and the identification of drugs that inhibit viral replication is enabled, enabling the development of therapies for the treatment of COVID-19.

...read moreread less

Journal ArticleDOI

Papain-like protease regulates SARS-CoV-2 viral spread and innate immunity.

Donghyuk Shin, +24 more

- 29 Jul 2020 -

Nature

TL;DR: Biochemical, structural and functional studies on the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) papain-like protease PLpro reveal that it regulates host antiviral responses by preferentially cleaving the ubiquitin-like interferon-stimulated gene 15 protein (ISG15) and identify this protease as a potential therapeutic target for coronav virus disease 2019 (COVID-19).

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Yoav Benjamini, +1 more

- 01 Jan 1995 -

Journal of the royal statistical society...

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.

...read moreread less

Journal ArticleDOI

LIBSVM: A library for support vector machines

Chih-Chung Chang, +1 more

- 06 May 2011 -

ACM Transactions on Intelligent Systems ...

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

...read moreread less

Book

The Nature of Statistical Learning Theory

Vladimir Vapnik

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?

...read moreread less

Journal ArticleDOI

Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks

Paul Shannon, +8 more

- 01 Nov 2003 -

Genome Research

TL;DR: Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.

...read moreread less

Collapse

MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification.

Jiirgen Cox, +1 more

- 30 Nov 2008 -

Nature Biotechnology

Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks

Paul Shannon, +8 more

- 01 Nov 2003 -

Genome Research

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Da-Wei Huang, +2 more

- 01 Jan 2009 -

Nature Protocols

Universal sample preparation method for proteome analysis

Jacek R. Wiśniewski, +3 more

- 01 May 2009 -

Nature Methods

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I. Love, +3 more

- 05 Dec 2014 -

Genome Biology

Frequently Asked Questions (18)

Q1. What are the contributions mentioned in the paper "The perseus computational platform for comprehensive analysis of (prote)omics data" ?

Cox et al. this paper proposed a computational system for signal transduction in the Max-Planck Institute of Biochemistry in Germany.

Q2. What have the authors stated for future works in "The perseus computational platform for comprehensive analysis of (prote)omics data" ?

Their guiding principle was to put the expertise of bioinformatics scientists in the hands of all life science researchers, allowing them to focus on their biological questions while benefitting from both powerful statistical tools and cutting edge scalable analytic possibilities without depending on often unavailable specialists. In the future, metabolomics data with relative quantification profiles for a global set of metabolites over several samples, which is similar to label free quantification proteomics data, will be accommodated by Perseus with only slight adaptations such as customization of the annotation of molecular species. One major challenge and opportunity that will drive the future development of Perseus is to bridge the currently existing gap between large-scale proteomics data generation and modeling of signaling pathways and biochemical reactions. As the experimental designs become more and more complex, the functionality of Perseus will be enriched accordingly, building upon its extensible architecture to offer more tools and to support future data types.

Q3. What is the learning plug-in for perseus?

The Learning plug-in in Perseus provides implementation of classification and regression analyses and implements various feature selection methods.

Q4. What is the purpose of the machine learning section of Perseus?

The machine learning section of Perseus has a cross validation structure for the purpose of measuring how the prediction performance of classification or regression will generalize to independent data that have not been used for model building, thereby avoiding notorious problems such as over-fitting61.

Q5. What is the function of the time series set of plug-ins?

The time series set of plug-ins of Perseus contains a periodicity analysis component that allows detection of periodic oscillations in protein expression over time.

Q6. What is the forerunner of many methods for analyzing molecular profiling data?

25 GSEA is the forerunner of many methods for analyzing molecular profiling data to determine which sets of genes or proteins are correlated with a phenotypic class distinction.

Q7. What are the other numerical values that serve as annotations?

Other numerical values that serve as annotations such as sequence length, number of identified peptides or posterior error probabilities are stored in ‘Numerical columns’.

Q8. What is the way to determine the length of the cycle?

To derive the length of the cycle from the data, a Fourier-based periodicity analysis can be performed that determines the base frequency of periodic expression changes and also allows screening for possible other cycle lengths (e.g. harmonics of the base frequency).

Q9. What is the hope of the authors?

Their hope is that this novel platform will contribute to better communication between disciplines and more effective application of computational tools.

Q10. What is the laborious step in data analysis?

Perseus platform for proteomics data18Box 3. Data integrationOne of the most laborious and error-prone steps in data analysis is matching and integration of different data types.

Q11. What is the main reason why Perseus is important to the scientific community?

The authors believe the latter feature is crucial for the scientific community as it fosters transparency and reproducibility of the reported results.

Q12. What is the common way to make a plugin available?

Once users have programmed a new plugin they can make it available through the Perseus pluginPerseus platform for proteomics data7store (www.perseus-framework.org/plugins).

Q13. What is the core set of plugins in perseus?

The authors provide a core set of plugins containing more than 100 activities that are bundled with the standard Perseus download and that can also be re-used in newly developed activities (SupplementaryTable 1).

Q14. What is the importance of quantitative information for understanding the functional role of the modification sites?

In addition to scores reflecting the reliability of identification and the confidence in the localization of each site in the protein sequence28, 29, quantitative information is crucial for understanding the functional role of the modification sites.

Q15. How is the amplitude and phase determined by the software?

(a) The amplitude (expression level) and phase (up- or downregulation) are determined by the software by optimizing a cosine function fit to the data.

Q16. What is the common way to analyze protein clusters?

Once an interesting cluster of proteins has been identified, enrichment analysis25 of biological processes, complexes or pathways is done in a variety of ways, for instance with the Fisher’s exact test checking for contingency between cluster membership and the property of interest.

Q17. What is the way to filter out phosphorylation site errors?

The phosphorylation site table is another example, in which such filtering is desirable, as sites with occupancy errors larger than a fixed threshold can be filtered out using a ‘Quality matrix’ containing the site-specific errors.

Q18. What is the free version of Perseus?

Perseus can be downloaded for free from www.perseus-framework.org under acceptance of their freeware license agreement and user account registration.

The Perseus computational platform for comprehensive analysis of (prote)omics data.

Citations

The MaxQuant computational platform for mass spectrometry-based shotgun proteomics.

Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling

Mass-spectrometric exploration of proteome structure and function

Proteomics of SARS-CoV-2-infected host cells reveals therapy targets.

Papain-like protease regulates SARS-CoV-2 viral spread and innate immunity.

References

Controlling the false discovery rate: a practical and powerful approach to multiple testing

LIBSVM: A library for support vector machines

The Nature of Statistical Learning Theory

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks

Related Papers (5)

MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification.

Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Universal sample preparation method for proteome analysis

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Frequently Asked Questions (18)

Q1. What are the contributions mentioned in the paper "The perseus computational platform for comprehensive analysis of (prote)omics data" ?

Q2. What have the authors stated for future works in "The perseus computational platform for comprehensive analysis of (prote)omics data" ?

Q3. What is the learning plug-in for perseus?

Q4. What is the purpose of the machine learning section of Perseus?

Q5. What is the function of the time series set of plug-ins?

Q6. What is the forerunner of many methods for analyzing molecular profiling data?

Q7. What are the other numerical values that serve as annotations?

Q8. What is the way to determine the length of the cycle?

Q9. What is the hope of the authors?

Q10. What is the laborious step in data analysis?

Q11. What is the main reason why Perseus is important to the scientific community?

Q12. What is the common way to make a plugin available?

Q13. What is the core set of plugins in perseus?

Q14. What is the importance of quantitative information for understanding the functional role of the modification sites?

Q15. How is the amplitude and phase determined by the software?

Q16. What is the common way to analyze protein clusters?

Q17. What is the way to filter out phosphorylation site errors?

Q18. What is the free version of Perseus?