Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding

doi:10.1534/GENETICS.112.143313

Home
/
Papers
/
Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding

Journal Article•DOI•

Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding

Gustavo de los Campos¹, John M. Hickey², Ricardo Pong-Wong³, Hans D. Daetwyler, Mario P. L. Calus⁴ - Show less +1 more•Institutions (4)

University of Alabama at Birmingham¹, University of New England (Australia)², University of Edinburgh³, Wageningen University and Research Centre⁴

01 Feb 2013-Genetics (Genetics Society of America)-Vol. 193, Iss: 2, pp 327-345

TL;DR: An overview of available methods for implementing parametric WGR models is provided, selected topics that emerge in applications are discussed, and a general discussion of lessons learned from simulation and empirical data analysis in the last decade are presented.

read less

Abstract: Genomic-enabled prediction is becoming increasingly important in animal and plant breeding and is also receiving attention in human genetics. Deriving accurate predictions of complex traits requires implementing whole-genome regression (WGR) models where phenotypes are regressed on thousands of markers concurrently. Methods exist that allow implementing these large-p with small-n regressions, and genome-enabled selection (GS) is being implemented in several plant and animal breeding programs. The list of available methods is long, and the relationships between them have not been fully addressed. In this article we provide an overview of available methods for implementing parametric WGR models, discuss selected topics that emerge in applications, and present a general discussion of lessons learned from simulation and empirical data analysis in the last decade.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Efficient Bayesian mixed-model analysis increases association power in large cohorts

[...]

Po-Ru Loh¹, George Tucker¹, Brendan Bulik-Sullivan¹, Bjarni J. Vilhjálmsson², Bjarni J. Vilhjálmsson¹, Hilary K. Finucane³, Rany M. Salem⁴, Daniel I. Chasman⁵, Paul M. Ridker⁵, Benjamin M. Neale¹, Benjamin M. Neale², Bonnie Berger³, Nick Patterson², Alkes L. Price¹ - Show less +10 more•Institutions (5)

Harvard University¹, Broad Institute², Massachusetts Institute of Technology³, Boston Children's Hospital⁴, Brigham and Women's Hospital⁵

01 Mar 2015-Nature Genetics

TL;DR: BOLT-LMM is presented, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes.

...read moreread less

Abstract: Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts and may not optimize power. All existing methods require time cost O(MN(2)) (where N is the number of samples and M is the number of SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here we present a far more efficient mixed-model association method, BOLT-LMM, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to 9 quantitative traits in 23,294 samples from the Women's Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for genome-wide association studies in large cohorts.

...read moreread less

1,232 citations

Journal Article•DOI•

Genome-Wide Regression and Prediction with the BGLR Statistical Package

[...]

Paulino Pérez, Gustavo de los Campos¹•Institutions (1)

University of Alabama at Birmingham¹

01 Oct 2014-Genetics

TL;DR: The BGLR R-package implements a large collection of Bayesian regression models, including parametric variable selection and shrinkage methods and semiparametric procedures, which allows integrating various parametric and nonparametric shrinkage and variable selection procedures in a unified and consistent manner.

...read moreread less

Abstract: Many modern genomic data analyses require implementing regressions where the number of parameters (p, e.g., the number of marker effects) exceeds sample size (n). Implementing these large-p-with-small-n regressions poses several statistical and computational challenges, some of which can be confronted using Bayesian methods. This approach allows integrating various parametric and nonparametric shrinkage and variable selection procedures in a unified and consistent manner. The BGLR R-package implements a large collection of Bayesian regression models, including parametric variable selection and shrinkage methods and semiparametric procedures (Bayesian reproducing kernel Hilbert spaces regressions, RKHS). The software was originally developed for genomic applications; however, the methods implemented are useful for many nongenomic applications as well. The response can be continuous (censored or not) or categorical (either binary or ordinal). The algorithm is based on a Gibbs sampler with scalar updates and the implementation takes advantage of efficient compiled C and Fortran routines. In this article we describe the methods implemented in BGLR, present examples of the use of the package, and discuss practical issues emerging in real-data analysis.

...read moreread less

987 citations

Cites background or methods from "Whole-Genome Regression and Predict..."

...…et al. 2001) are becoming increasingly popular for the analysis and prediction of complex traits in plants (e.g., Crossa et al. 2010), animals (e.g., Hayes et al. 2009, VanRaden et al. 2009), and humans (e.g., Yang et al. 2010; Makowsky et al. 2011; Vazquez et al. 2012; de los Campos et al. 2013b)....
[...]
...Next we briefly describe the built-in rules implemented in BGLR; these are based on formulas similar to those described by de los Campos et al. (2013) implemented using the prior mode instead of the prior mean....
[...]
...Indeed, the choice of the model depends on multiple factors such as the genetic architecture of the trait, marker density, sample size and the span of linkage disequilibrium (e.g., de los Campos et al. 2013a)....
[...]

Journal Article•DOI•

Genomic Selection in Plant Breeding: Methods, Models, and Perspectives

[...]

José Crossa¹, Paulino Pérez-Rodríguez, Jaime Cuevas², Osval A. Montesinos-López³, Diego Jarquin⁴, Gustavo de los Campos⁵, Juan Burgueño¹, Juan Manuel González-Camacho, Sergio Pérez-Elizalde, Yoseph Beyene¹, Susanne Dreisigacker¹, Ravi P. Singh¹, Xuecai Zhang¹, Manje Gowda¹, Manish Roorkiwal⁶, Jessica Rutkoski⁷, Rajeev K. Varshney⁶ - Show less +13 more•Institutions (7)

International Maize and Wheat Improvement Center¹, University of Quintana Roo², University of Colima³, University of Nebraska–Lincoln⁴, Michigan State University⁵, International Crops Research Institute for the Semi-Arid Tropics⁶, International Rice Research Institute⁷

01 Nov 2017-Trends in Plant Science

TL;DR: Based on GP results, it is speculated how GS in germplasm enhancement programs could accelerate the flow of genes from gene bank accessions to elite lines and recent advances in hyperspectral image technology could be combined with GS and pedigree-assisted breeding.

...read moreread less

826 citations

Cites background from "Whole-Genome Regression and Predict..."

...A large body of GP research has focused on developing efficient parametric and nonparametric statistical and computational models with increased accuracy for predicting nonphenotyped genotypes [13]....
[...]

Journal Article•DOI•

Polygenic modeling with bayesian sparse linear mixed models.

[...]

Xiang Zhou¹, Peter Carbonetto¹, Matthew Stephens¹•Institutions (1)

University of Chicago¹

07 Feb 2013-PLOS Genetics

TL;DR: This work applies Bayesian sparse linear mixed model (BSLMM) and compares it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction, and demonstrates that BSLMM considerably outperforms either of the other two methods.

...read moreread less

Abstract: Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given dataset one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a “Bayesian sparse linear mixed model” (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from http://stephenslab.uchicago.edu/software.html.

...read moreread less

764 citations

Journal Article•DOI•

Pitfalls of predicting complex traits from SNPs

[...]

Naomi R. Wray¹, Jian Yang¹, Ben J. Hayes², Ben J. Hayes³, Alkes L. Price⁴, Michael E. Goddard⁵, Peter M. Visscher¹ - Show less +3 more•Institutions (5)

University of Queensland¹, Cooperative Research Centre², La Trobe University³, Harvard University⁴, University of Melbourne⁵

01 Jul 2013-Nature Reviews Genetics

TL;DR: Some of the limitations and pitfalls of prediction analysis are discussed and how naive implementations can lead to severe bias and misinterpretation of results are shown.

...read moreread less

Abstract: The success of genome-wide association studies (GWASs) has led to increasing interest in making predictions of complex trait phenotypes, including disease, from genotype data. Rigorous assessment of the value of predictors is crucial before implementation. Here we discuss some of the limitations and pitfalls of prediction analysis and show how naive implementations can lead to severe bias and misinterpretation of results.

...read moreread less

657 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Maximum likelihood from incomplete data via the EM algorithm

[...]

Arthur P. Dempster¹, Nan M. Laird¹, Donald B. Rubin¹•Institutions (1)

Harvard University¹

01 Sep 1977-Journal of the royal statistical society series b-methodological

49,597 citations

Journal Article•DOI•

Regression Shrinkage and Selection via the Lasso

[...]

Robert Tibshirani

01 Jan 1996-Journal of the royal statistical society series b-methodological

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.

...read moreread less

Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

...read moreread less

40,785 citations

"Whole-Genome Regression and Predict..." refers background in this paper

...Many of the penalized regressions (e.g., LASSO and EN) also allow this, but their use in GS is much more limited....
[...]
... A similar reasoning can be used to show the equivalence for the LASSO and in general for Bridge regression....
[...]
...Using this penalty induces a solution that may involve zeroing out some regression coefficients and shrinkage estimates of the remaining effects; therefore LASSO combines variable selection and shrinkage of estimates....
[...]
...However, LASSO and subset selection approaches have two important limitations....
[...]
...Another special case, known as least absolute angle and selection operator (LASSO) (Tibshirani 1996), occurs with g ¼ 1, yielding the L1 penalty: JðbÞ ¼Ppj¼1kbjk....
[...]

Book•

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

[...]

Trevor Hastie¹, Robert Tibshirani, Jerome H. Friedman•Institutions (1)

University of New South Wales¹

28 Jul 2013

TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.

...read moreread less

Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

...read moreread less

19,261 citations

Book•

The Elements of Statistical Learning

[...]

Trevor Hastie, Robert Tibshirani, Jerome H. Friedman

01 Jan 2001

19,211 citations

Journal Article•DOI•

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

[...]

Stuart Geman¹, Donald Geman²•Institutions (2)

Brown University¹, University of Massachusetts Amherst²

01 Nov 1984-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.

...read moreread less

Abstract: We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.

...read moreread less

18,761 citations

"Whole-Genome Regression and Predict..." refers methods in this paper

...Gibbs sampler: Among the many MCMC algorithms the Gibbs sampler (Geman and Geman 1984; Casella and George 1992) is the most commonly used....
[...]