scispace - formally typeset

Journal ArticleDOI

Reply to: Examining microbe-metabolite correlations by linear methods

04 Jan 2021-Nature Methods (Springer Science and Business Media LLC)-Vol. 18, Iss: 1, pp 40-41

Abstract: Author(s): Morton, James T; McDonald, Daniel; Aksenov, Alexander A; Nothias, Louis Felix; Foulds, James R; Quinn, Robert A; Badri, Michelle H; Swenson, Tami L; Van Goethem, Marc W; Northen, Trent R; Vazquez-Baeza, Yoshiki; Wang, Mingxun; Bokulich, Nicholas A; Watters, Aaron; Song, Se Jin; Bonneau, Richard; Dorrestein, Pieter C; Knight, Rob

Summary (2 min read)

Introduction

  • Lawrence Berkeley National Laboratory Recent Work Title Reply to: Examining microbe-metabolite correlations by linear methods.
  • 8Department of Biology, New York University, New York, NY, USA.
  • 15Department of Computer Science, Courant Institute, New York, NY, USA.
  • 18Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, USA.
  • The authors have found that MMvec is a powerful discovery tool, as demonstrated by the other real datasets.

Reply to: Examining microbe–metabolite

  • Correlations by linear methods James T. Morton 1,2, Daniel McDonald1,3, Alexander A. Aksenov 4,5, Louis Felix Nothias 4,5, James R. Foulds6, Robert A. Quinn7, Michelle H. Badri8, Tami L. Swenson9, Marc W. Van Goethem9, Trent R. Northen9,10, Yoshiki Vazquez-Baeza3,11, Mingxun Wang4,5, Nicholas A. Bokulich12,13, Aaron Watters14, Se Jin Song1,3, Richard Bonneau8,14,15,16, Pieter C. Dorrestein4,5 and Rob Knight 1,2,17,18 ✉ replying to T. P. Quinn & I. Erb Nature Methods https://doi.org/10.1038/s41592-020-01006-1 (2020) NATuRE METHoDS | www.nature.com/naturemethods.
  • Matters arising Nature Methods the authors evaluated in the original article.
  • It is critical that the authors provide accurate guidance to the community so that scenarios where one method works better than others are well understood.
  • While there may be scenarios where linear methods outperform neural networks, the authors show that there are scenarios where neural networks outperform linear methods.
  • Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/.

Methods

  • The simulations were created by using the generative form of MMvec; the microbe and metabolite factor loadings were randomly generated from a normal distribution to parameterize the MMvec parameters.
  • Microbial counts were then drawn from a multinomial logistic normal distribution and fed into MMvec to generate the metabolite counts.
  • To identify scenarios where CLR correlations underperformed in comparison to MMvec, the authors used Bayesian Optimization to tune the distributions used to generate the simulations.
  • The CLR-transformed correlations suggested by Quinn and Erb were benchmarked on the desert biocrust soils dataset using the R scripts provided in ref.
  • Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

  • The datasets to reproduce the results presented here can be found at https://github.
  • Com/knightlab-analyses/multiomic-cooccurrences.

Code availability

  • The analysis software to reproduce the results presented here can be found at https://github.com/knightlab-analyses/multiomic-cooccurrences.

Author contributions

  • J.T.M. performed all analyses and wrote the manuscript.
  • All authors have contributed edits to the manuscript.

Competing interests

  • The remaining authors declare no competing interests.

Additional information

  • Supplementary information is available for this paper at https://doi.org/10.1038/.
  • The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one- or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
  • All manuscripts must include a data availability statement.
  • For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf.
  • 2 nature research | reporting sum m ary O ctober 2018 Life sciences study design Randomization Randomization was not necessary, since the data was simulated, not collected.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Lawrence Berkeley National Laboratory
Recent Work
Title
Reply to: Examining microbe-metabolite correlations by linear methods.
Permalink
https://escholarship.org/uc/item/3827k7p9
Journal
Nature methods, 18(1)
ISSN
1548-7091
Authors
Morton, James T
McDonald, Daniel
Aksenov, Alexander A
et al.
Publication Date
2021
DOI
10.1038/s41592-020-01007-0
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California

Matters arising
https://doi.org/10.1038/s41592-020-01007-0
1
Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA.
2
Department of Computer Science and Engineering, University of
California, San Diego, La Jolla, CA, USA.
3
Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, USA.
4
Collaborative Mass
Spectrometry Innovation Center, University of California, San Diego, La Jolla, CA, USA.
5
Skaggs School of Pharmacy and Pharmaceutical Sciences,
University of California, San Diego, La Jolla, CA, USA.
6
Department of Information Systems, University of Maryland–Baltimore County, Baltimore, MD,
USA.
7
Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA.
8
Department of Biology, New York University,
New York, NY, USA.
9
Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
10
DOE Joint
Genome Institute, Walnut Creek, CA, USA.
11
Jacobs School of Engineering, University of California, San Diego, La Jolla, CA, USA.
12
The Pathogen and
Microbiome Institute, Northern Arizona University, Flagstaff, AZ, USA.
13
Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ,
USA.
14
Flatiron Institute, Simons Foundation, New York, NY, USA.
15
Department of Computer Science, Courant Institute, New York, NY, USA.
16
Center for
Data Science, New York University, New York, NY, USA.
17
Department of Bioengineering, University of California, San Diego, La Jolla, CA, USA.
18
Center for
Microbiome Innovation, University of California, San Diego, La Jolla, CA, USA.
e-mail: rknight@ucsd.edu
Quinn and Erb
1
propose to apply a centered log-ratio (CLR) trans-
form before performing correlation analysis and make the case
that, when used correctly, correlation and proportionality can out
-
perform MMvec in identifying microbe–metabolite interactions.
While this may be an appealing strategy, it is important to note that
the correlations estimated from CLR-transformed data will have a
fundamentally different interpretation than the true correlations in
the environment, namely:
Cov x
i
; y
j

Cov clr x
ðÞ
i
; clr y
ðÞ
j

where x
i
and y
j
are the absolute abundances for microbe abundances
x and metabolite abundances y in taxon i and metabolite j. Because
the absolute abundances are often not available, inferring the true
correlations between microbes and metabolites is not tractable
(Supplementary Note 1). This phenomenon has been extensively
studied in refs.
24
, and one of our recent studies provides the intu-
ition behind this in the case of differential abundance
5
. Because of
this discrepancy, we proposed to use co-occurrence probabilities
instead of correlation.
We relied on simulated data in the original paper
6
as an artifi-
cial ground truth, as is common in the evaluation of omics tools.
However, simulated data will always have limitations because
of the inability to model unknown features of the real system or
because of deliberate simplifications that clarify key points in
the model system. Furthermore, it is possible to identify simu
-
lations where a proposed model is optimal. In Fig. 1, we used
Bayesian Optimization
7
to identify simulations where MMvec was
able to accurately estimate the correct parameters and Pearson
underperformed. If the appropriate assumptions are satisfied,
MMvec can correctly estimate the co-occurrence probabilities with
machine precision.
Therefore, a crucial aspect of the MMvec manuscript was to test
performance both on simulations and on real data. Performance on
real data is the ultimate test of methods, and we recommend that
simulated datasets be complemented with experimentally vali
-
dated datasets where possible. Accordingly, we applied the same
proportionality-based scripts described by Quinn and Erb
1
and eval-
uated them on one of the real datasets we used in the MMvec paper.
A major obstacle to analyzing real-world microbiome and
metabolomics data is sparsity. Traditional compositional methods
such as the proposed CLR transform cannot automatically deal with
zeros and require imputation as a preprocessing step. This imputa
-
tion adds bias and is impractical for the sparse datasets typically
encountered
8,9
. Microbiome and untargeted metabolomics datas-
ets are generally sparse: in large studies, such as the American Gut
Project
10
, the sparsity for stool samples alone is 99.946%. MMvec
was designed to handle sparse data. In the desert biocrust soils data
-
set (sparsity of 51%; ref.
11
) that was used in the MMvec publication,
we observe that MMvec dramatically outperformed the newly pro
-
posed linear methods (Fig. 2).
Contrary to the argument by Quinn and Erb
1
regarding the
complexity of neural networks, the MMvec model
6
is not much
more complex than the proposed regression techniques. It is a
simple one-layer neural network, which is in effect a two-stage
log–bilinear regression.
Methods similar to MMvec have been successful at the task of
learning word co-occurrences. Since Mikolov et al.
12
, these mod-
els have been designed with an emphasis on practical methods for
learning useful word representations at scale, rather than on per
-
fectly modeling the data distribution.
MMvec is only one tool in the arsenal of correlative methods.
It is not perfect for every correlation type or dataset and is not a
one-size-fits-all solution. However, we have found that MMvec is a
powerful discovery tool, as demonstrated by the other real datasets
Reply to: Examining microbe–metabolite
correlations by linear methods
James T. Morton
1,2
, Daniel McDonald
1,3
, Alexander A. Aksenov
4,5
, Louis Felix Nothias
4,5
,
James R. Foulds
6
, Robert A. Quinn
7
, Michelle H. Badri
8
, Tami L. Swenson
9
, Marc W. Van Goethem
9
,
Trent R. Northen
9,10
, Yoshiki Vazquez-Baeza
3,11
, Mingxun Wang
4,5
, Nicholas A. Bokulich
12,13
,
Aaron Watters
14
, Se Jin Song
1,3
, Richard Bonneau
8,14,15,16
, Pieter C. Dorrestein
4,5
and Rob Knight
1,2,17,18
 ✉
replying to T. P. Quinn & I. Erb Nature Methods https://doi.org/10.1038/s41592-020-01006-1 (2020)
NATURE METHODS | www.nature.com/naturemethods

Matters arising
Nature Methods
we evaluated in the original article. It is critical that we provide
accurate guidance to the community so that scenarios where one
method works better than others are well understood. While there
may be scenarios where linear methods outperform neural net
-
works, we show that there are scenarios where neural networks
outperform linear methods. We appreciate the communication on
the topic to the extent that it helps the community better under
-
stand the advantages and limitations of the different approaches and
prompts the community to continue to innovate in this area.
Online content
Any methods, additional references, Nature Research report-
ing summaries, source data, extended data, supplementary infor-
mation, acknowledgements, peer review information; details of
author contributions and competing interests; and statements of
data and code availability are available at https://doi.org/10.1038/
s41592-020-01007-0.
Received: 17 March 2020; Accepted: 27 October 2020;
Published: xx xx xxxx
References
1. Quinn, T. P. & Erb, I. Examining microbe–metabolite correlations by
linear methods. Nat. Methods https://doi.org/10.1038/s41592-020-01006-1
(2020).
2. Aitchison, J. A concise guide to compositional data analysis. http://www.leg.
ufpr.br/lib/exe/fetch.php/pessoais:abtmartins:a_concise_guide_to_
compositional_data_analysis.pdf (2003).
3. Filzmoser, P. & Hron, K. Correlation analysis for compositional data.
Math. Geosci. 41, 905 (2009).
4. Friedman, J. & Alm, E. J. Inferring correlation networks from genomic survey
data. PLoS Comput. Biol. 8, e1002687 (2012).
5. Morton, J. T. et al. Establishing microbial composition measurement
standards with reference frames. Nat. Commun. 10, 2719 (2019).
6. Morton, J. T. et al. Learning representations of microbe–metabolite
interactions. Nat. Methods 16, 1306–1314 (2019).
7. Nogueira, F. Bayesian Optimization: open source constrained global
optimization tool for Python. https://github.com/fmfn/BayesianOptimization
(2014).
8. Martın-Fernández, J. A., Barceló-Vidal, C. & Pawlowsky-Glahn, V. Dealing
with zeros and missing values in compositional data sets using nonparametric
imputation. Math. Geol. 35, 253–278 (2003).
9. Silverman, J. D., Roche, K., Mukherjee, S. & David, L. A. Naught all
zeros in sequence count data are the same. Preprint at bioRxiv
https://doi.org/10.1101/477794 (2018).
10. McDonald, D. et al. American Gut: an open platform for citizen science
microbiome research. mSystems 3, e00031-18 (2018).
11. Swenson, T. L., Karaoz, U., Swenson, J. M., Bowen, B. P. & Northen, T. R.
Linking soil biology and chemistry in biological soil crust using isolate
exometabolomics. Nat. Commun. 9, 19 (2018).
12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed
representations of words and phrases and their compositionality. in
Advances in Neural Information Processing Systems 3111–3119
(2013).
13. Aitchison, J. e statistical analysis of compositional data. J. R. Stat. Soc. B
44, 139–160 (1982).
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
© The Author(s), under exclusive licence to Springer Nature America, Inc. 2021
a b c
d e f
MMvec
(R = 0.999, P = 0.000)
MMvec
(R = 0.947, P = 0.000)
–4
–5
–6
–7
–4
–5
–6
–7
–4
–5
–6
–7
–6
0 0–5–10–15
–60
–40
–20
0
–30
–10
–20
0
–60
–40
–20
0
–5 –4 –6–8–10 –2–4
–6–8–10 –2–4
–6–8–10 –2–4
Estimated co-occurrences Estimated co-occurrences Estimated co-occurrences
Estimated co-occurrences Estimated co-occurrences Estimated co-occurrences
Ground truth
co-occurrences
Ground truth
co-occurrences
Pearson
(R = 0.002, P = 0.809)
Pearson
(R = 0.150, P = 0.000)
CLR-transformed Pearson
(R = 0.004, P = 0.714)
CLR-transformed Pearson
(R = 0.050, P = 0.000)
–6–8–10 –2–4
Fig. 1 | A simulation benchmark comparing MMvec to Pearson. Simulations were obtained through Bayesian Optimization
7
to showcase scenarios
where MMvec outperforms Pearson. ac, Simulation of a scenario where the microbiome dataset is 99% dense. df, Simulation of a scenario where the
microbiome dataset is 60% dense. All axes are represented on a log scale. Pearson’s R is used to measure the agreement between the simulated ground
truth co-occurrences and the estimated co-occurrences.
12
True positives
8
4
15
Microcoleus molecular detection rate
30
MMvec
Spearman
Pearson
φ
ρ
Top K hits
0
0
Fig. 2 | Biocrust soils benchmark. A comparison of MMvec to metrics
proposed by Quinn and Erb
1
. These proposed metrics include Spearman,
Pearson, φ and ρ applied after a CLR transformation
13
.
NATURE METHODS | www.nature.com/naturemethods

Matters arising
Nature Methods
Methods
The simulations were created by using the generative form of MMvec; the
microbe and metabolite factor loadings were randomly generated from a normal
distribution to parameterize the MMvec parameters. Microbial counts were then
drawn from a multinomial logistic normal distribution and fed into MMvec to
generate the metabolite counts. To identify scenarios where CLR correlations
underperformed in comparison to MMvec, we used Bayesian Optimization to tune
the distributions used to generate the simulations.
The CLR-transformed correlations suggested by Quinn and Erb were
benchmarked on the desert biocrust soils dataset using the R scripts provided in ref.
1
.
Reporting Summary. Further information on research design is available in the
Nature Research Reporting Summary linked to this article.
Data availability
The datasets to reproduce the results presented here can be found at https://github.
com/knightlab-analyses/multiomic-cooccurrences.
Code availability
The analysis software to reproduce the results presented here can be found at
https://github.com/knightlab-analyses/multiomic-cooccurrences.
Author contributions
J.T.M. performed all analyses and wrote the manuscript. All authors have contributed
edits to the manuscript.
Competing interests
M.W. is the founder of Ometa Labs. The remaining authors declare no competing interests.
Additional information
Supplementary information is available for this paper at https://doi.org/10.1038/
s41592-020-01007-0.
Correspondence and requests for materials should be addressed to R.K.
Reprints and permissions information is available at www.nature.com/reprints.
NATURE METHODS | www.nature.com/naturemethods

1
nature research | reporting summary
October 2018
Corresponding author(s):
Rob Knight
Last updated by author(s):
9/10/2020
Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.
Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a
Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested
A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient)
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.
Software and code
Policy information about availability of computer code
Data collection
Only simulation data was used.
Data analysis
All data analysis scripts can be found here: https://github.com/knightlab-analyses/multiomic-cooccurences
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers.
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.
Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
The biocrust soils data was retrieved from the supplemental section in Swenson et al
Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Citations
More filters

Journal ArticleDOI
21 Jun 2021-Metabolites
Abstract: Trillions of microorganisms, termed the “microbiota”, reside in the mammalian gastrointestinal tract, and collectively participate in regulating the host phenotype. It is now clear that the gut microbiota, metabolites, and intestinal immune function are correlated, and that alterations of the complex and dynamic host-microbiota interactions can have deep consequences for host health. However, the mechanisms by which the immune system regulates the microbiota and by which the microbiota shapes host immunity are still not fully understood. This article discusses the contribution of metabolites in the crosstalk between gut microbiota and immune cells. The identification of key metabolites having a causal effect on immune responses and of the mechanisms involved can contribute to a deeper insight into host-microorganism relationships. This will allow a better understanding of the correlation between dysbiosis, microbial-based dysmetabolism, and pathogenesis, thus creating opportunities to develop microbiota-based therapeutics to improve human health. In particular, we systematically review the role of soluble and membrane-bound microbial metabolites in modulating host immunity in the gut, and of immune cells-derived metabolites affecting the microbiota, while discussing evidence of the bidirectional impact of this crosstalk. Furthermore, we discuss the potential strategies to hear the sound of such metabolite-mediated crosstalk.

1 citations


References
More filters

Posted Content
Tomas Mikolov1, Ilya Sutskever1, Kai Chen1, Greg S. Corrado1  +1 moreInstitutions (1)
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

7,602 citations


Journal ArticleDOI
John Aitchison1Institutions (1)
Abstract: 1 Compositional data: some challenging problems.- 1.1 Introduction.- 1.2 Geochemical compositions of rocks.- 1.3 Sediments at different depths.- 1.4 Ternary diagrams.- 1.5 Partial analyses and subcompositions.- 1.6 Supervisory behaviour.- 1.7 Household budget surveys.- 1.8 Steroid metabolite patterns in adults and children.- 1.9 Activity patterns of a statistician.- 1.10 Calibration of white-cell compositions.- 1.11 Fruit evaluation.- 1.12 Firework mixtures.- 1.13 Clam ecology.- 1.14 Bibliographic notes.- Problems.- 2 The simplex as sample space.- 2.1 Choice of sample space.- 2.2 Compositions and simplexes.- 2.3 Spaces, vectors, matrices.- 2.4 Bases and compositions.- 2.5 Subcompositions.- 2.6 Amalgamations.- 2.7 Partitions.- 2.8 Perturbations.- 2.9 Geometrical representations of compositional data.- 2.10 Bibliographic notes.- Problems.- 3 The special difficulties of compositional data analysis.- 3.1 Introduction.- 3.2 High dimensionality.- 3.3 Absence of an interpretable covariance structure.- 3.4 Difficulty of parametric modelling.- 3.5 The mixture variation difficulty.- 3.6 Bibliographic notes.- Problems.- 4 Covariance structure.- 4.1 Fundamentals.- 4.2 Specification of the covariance structure.- 4.3 The compositional variation array.- 4.4 Recovery of the compositional variation array from the crude mean vector and covariance matrix.- 4.5 Subcompositional analysis.- 4.6 Matrix specifications of covariance structures.- 4.7 Some important elementary matrices.- 4.8 Relationships between the matrix specifications.- 4.9 Estimated matrices for hongite compositions.- 4.10 Logratios and logcontrasts.- 4.11 Covariance structure of a basis.- 4.12 Commentary.- 4.13 Bibliographic notes.- Problems.- 5 Properties of matrix covariance specifications.- 5.1 Logratio notation.- 5.2 Logcontrast variances and covariances.- 5.3 Permutations.- 5.4 Properties of P and QP matrices.- 5.5 Permutation invariants involving ?.- 5.6 Covariance matrix inverses.- 5.7 Subcompositions.- 5.8 Equivalence of characteristics of ?, ?, ?.- 5.9 Logratio-uncorrelated compositions.- 5.10 Isotropic covariance structures.- 5.11 Bibliographic notes.- Problems.- 6 Logistic normal distributions on the simplex.- 6.1 Introduction.- 6.2 The additive logistic normal class.- 6.3 Density function.- 6.4 Moment properties.- 6.5 Composition of a lognormal basis.- 6.6 Class-preserving properties.- 6.7 Conditional subcompositional properties.- 6.8 Perturbation properties.- 6.9 A central limit theorem.- 6.10 A characterization by logcontrasts.- 6.11 Relationships with the Dirichlet class.- 6.12 Potential for statistical analysis.- 6.13 The multiplicative logistic normal class.- 6.14 Partitioned logistic normal classes.- 6.15 Some notation.- 6.16 Bibliographic notes.- Problems.- 7 Logratio analysis of compositions.- 7.1 Introduction.- 7.2 Estimation of ? and ?.- 7.3 Validation: tests of logistic normality.- 7.4 Hypothesis testing strategy and techniques.- 7.5 Testing hypotheses about ? and ?.- 7.6 Logratio linear modelling.- 7.7 Testing logratio linear hypotheses.- 7.8 Further aspects of logratio linear modelling.- 7.9 An application of logratio linear modelling.- 7.10 Predictive distributions, atypicality indices and outliers.- 7.11 Statistical discrimination.- 7.12 Conditional compositional modelling.- 7.13 Bibliographic notes.- Problems.- 8 Dimension-reducing techniques.- 8.1 Introduction.- 8.2 Crude principal component analysis.- 8.3 Logcontrast principal component analysis.- 8.4 Applications of logcontrast principal component analysis.- 8.5 Subcompositional analysis.- 8.6 Applications of subcompositional analysis.- 8.7 Canonical component analysis.- 8.8 Bibliographic notes.- Problems.- 9 Bases and compositions.- 9.1 Fundamentals.- 9.2 Covariance relationships.- 9.3 Principal and canonical component comparisons.- 9.4 Distributional relationships.- 9.5 Compositional invariance.- 9.6 An application to household budget analysis.- 9.7 An application to clinical biochemistry.- 9.8 Reappraisal of an early shape and size analysis.- 9.9 Bibliographic notes.- Problems.- 10 Subcompositions and partitions.- 10.1 Introduction.- 10.2 Complete subcompositional independence.- 10.3 Partitions of order 1.- 10.4 Ordered sequences of partitions.- 10.5 Caveat.- 10.6 Partitions of higher order.- 10.7 Bibliographic notes.- Problems.- 11 Irregular compositional data.- 11.1 Introduction.- 11.2 Modelling imprecision in compositions.- 11.3 Analysis of sources of imprecision.- 11.4 Imprecision and tests of independence.- 11.5 Rounded or trace zeros.- 11.6 Essential zeros.- 11.7 Missing components.- 11.8 Bibliographic notes.- Problems.- 12 Compositions in a covariate role.- 12.1 Introduction.- 12.2 Calibration.- 12.3 A before-and-after treatment problem.- 12.4 Experiments with mixtures.- 12.5 An application to firework mixtures.- 12.6 Classification from compositions.- 12.7 An application to geological classification.- 12.8 Bibliographic notes.- Problems.- 13 Further distributions on the simplex.- 13.1 Some generalizations of the Dirichlet class.- 13.2 Some generalizations of the logistic normal classes.- 13.3 Recapitulation.- 13.4 The Ad(?,B) class.- 13.5 Maximum likelihood estimation.- 13.6 Neutrality and partition independence.- 13.7 Subcompositional independence.- 13.8 A generalized lognormal gamma distribution with compositional in variance.- 13.9 Discussion.- 13.10 Bibliographic notes.- Problems.- 14 Miscellaneous problems.- 14.1 Introduction.- 14.2 Multi-way compositions.- 14.3 Multi-stage compositions.- 14.4 Multiple compositions.- 14.5 Kernel density estimation for compositional data.- 14.6 Compositional stochastic processes.- 14.7 Relation to Bayesian statistical analysis.- 14.8 Compositional and directional data.- Problems.- Appendices.- A Algebraic properties of elementary matrices.- B Bibliography.- C Computer software for compositional data analysis.- D Data sets.- Author index.

1,173 citations


Journal ArticleDOI
Jonathan Friedman1, Eric J. Alm1, Eric J. Alm2Institutions (2)
Abstract: High-throughput sequencing based techniques, such as 16S rRNA gene profiling, have the potential to elucidate the complex inner workings of natural microbial communities - be they from the world's oceans or the human gut. A key step in exploring such data is the identification of dependencies between members of these communities, which is commonly achieved by correlation analysis. However, it has been known since the days of Karl Pearson that the analysis of the type of data generated by such techniques (referred to as compositional data) can produce unreliable results since the observed data take the form of relative fractions of genes or species, rather than their absolute abundances. Using simulated and real data from the Human Microbiome Project, we show that such compositional effects can be widespread and severe: in some real data sets many of the correlations among taxa can be artifactual, and true correlations may even appear with opposite sign. Additionally, we show that community diversity is the key factor that modulates the acuteness of such compositional effects, and develop a new approach, called SparCC (available at https://bitbucket.org/yonatanf/sparcc), which is capable of estimating correlation values from compositional data. To illustrate a potential application of SparCC, we infer a rich ecological network connecting hundreds of interacting species across 18 sites on the human body. Using the SparCC network as a reference, we estimated that the standard approach yields 3 spurious species-species interactions for each true interaction and misses 60% of the true interactions in the human microbiome data, and, as predicted, most of the erroneous links are found in the samples with the lowest diversity.

1,149 citations


Journal ArticleDOI
26 Jun 2018-
TL;DR: The utility of the living data resource and cross-cohort comparison is demonstrated to confirm existing associations between the microbiome and psychiatric illness and to reveal the extent of microbiome change within one individual during surgery, providing a paradigm for open microbiome research and education.
Abstract: Although much work has linked the human microbiome to specific phenotypes and lifestyle variables, data from different projects have been challenging to integrate and the extent of microbial and molecular diversity in human stool remains unknown. Using standardized protocols from the Earth Microbiome Project and sample contributions from over 10,000 citizen-scientists, together with an open research network, we compare human microbiome specimens primarily from the United States, United Kingdom, and Australia to one another and to environmental samples. Our results show an unexpected range of beta-diversity in human stool microbiomes compared to environmental samples; demonstrate the utility of procedures for removing the effects of overgrowth during room-temperature shipping for revealing phenotype correlations; uncover new molecules and kinds of molecular communities in the human stool metabolome; and examine emergent associations among the microbiome, metabolome, and the diversity of plants that are consumed (rather than relying on reductive categorical variables such as veganism, which have little or no explanatory power). We also demonstrate the utility of the living data resource and cross-cohort comparison to confirm existing associations between the microbiome and psychiatric illness and to reveal the extent of microbiome change within one individual during surgery, providing a paradigm for open microbiome research and education. IMPORTANCE We show that a citizen science, self-selected cohort shipping samples through the mail at room temperature recaptures many known microbiome results from clinically collected cohorts and reveals new ones. Of particular interest is integrating n = 1 study data with the population data, showing that the extent of microbiome change after events such as surgery can exceed differences between distinct environmental biomes, and the effect of diverse plants in the diet, which we confirm with untargeted metabolomics on hundreds of samples.

357 citations


Journal ArticleDOI
TL;DR: Existing nonparametric imputation methods—both for the additive and the multiplicative approach—are revised and essential properties of the last method are given and for missing values a generalization of themultiplicative approach is proposed.
Abstract: The statistical analysis of compositional data based on logratios of parts is not suitable when zeros are present in a data set. Nevertheless, if there is interest in using this modeling approach, several strategies have been published in the specialized literature which can be used. In particular, substitution or imputation strategies are available for rounded zeros. In this paper, existing nonparametric imputation methods—both for the additive and the multiplicative approach—are revised and essential properties of the last method are given. For missing values a generalization of the multiplicative approach is proposed.

343 citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20211