Journal Article•DOI•

Empirical Asset Pricing via Machine Learning

Shihao Gu¹, Bryan T. Kelly², Bryan T. Kelly³, Dacheng Xiu¹•Institutions (3)

University of Chicago¹, Yale University², National Bureau of Economic Research³

01 Dec 2018-Social Science Research Network-

TL;DR: The authors performed a comparative analysis of machine learning methods for the canonical problem of empirical asset pricing: measuring asset risk premia, and demonstrated large economic gains to investors using machine learning forecasts, in some cases doubling the performance of leading regression-based strategies from the literature.

read less

Abstract: We perform a comparative analysis of machine learning methods for the canonical problem of empirical asset pricing: measuring asset risk premia. We demonstrate large economic gains to investors using machine learning forecasts, in some cases doubling the performance of leading regression-based strategies from the literature. We identify the best performing methods (trees and neural networks) and trace their predictive gains to allowance of nonlinear predictor interactions that are missed by other methods. All methods agree on the same set of dominant predictive signals which includes variations on momentum, liquidity, and volatility. Improved risk premium measurement through machine learning simplifies the investigation into economic mechanisms of asset pricing and highlights the value of machine learning in financial innovation.

...read moreread less

Summary (7 min read)

Jump to: [1.1 Primary Contributions] – [1.2 What is Machine Learning?] – [1.3 Why Apply Machine Learning to Asset Pricing?] – [1.4 What Specific Machine Learning Methods Do We Study?] – [1.5 Main Empirical Findings] – [1.6 What Machine Learning Cannot Do] – [1.7 Literature] – [2 Methodology] – [2.1 Sample Splitting and Tuning via Validation] – [2.2 Simple Linear] – [2.2.1 Extension: Robust Objective Functions] – [2.3 Penalized Linear] – [2.4 Dimension Reduction: PCR and PLS] – [2.5 Generalized Linear] – [, θ N] – [2.6 Boosted Regression Trees and Random Forests] – [2.7 Neural Networks] – [2.8 Performance Evaluation] – [2.9 Variable Importance and Marginal Relationships] – [3.1 Data and Over-arching Model] – [3.2 The Cross Section of Individual Stocks] – [3.3 Which Covariates Matter?] – [3.3.1 Marginal Association Between Characteristics and Expected Returns] – [3.3.2 Interaction Effects] – [3.4 Portfolio Forecasts] – [3.4.1 Pre-specified Portfolios] – [3.4.2 Machine Learning Portfolios] and [4 Conclusion]

1.1 Primary Contributions

First, the authors provide a new set of benchmarks for the predictive accuracy of machine learning methods in measuring risk premia of the aggregate market and individual stocks.
A portfolio strategy that times the S&P 500 with neural network forecasts enjoys an annualized out-of-sample Sharpe ratio of 0.77, versus the 0.51 Sharpe ratio of a buy-and-hold investor.
The authors research highlights gains that can be achieved in prediction and identifies the most informative predictor variables.
The authors view is that the best way for researchers to understand the usefulness of machine learning in the field of asset pricing is to apply and compare the performance of each of its methods in familiar empirical problems.
One may be interested in potentially distinguishing among different components of expected returns such as those due to systematic risk compensation, idiosyncratic risk compensation, or even due to mispricing.

1.2 What is Machine Learning?

The definition of "machine learning" is inchoate and is often context specific.
The authors use the term to describe (i) a diverse collection of high-dimensional models for statistical prediction, combined with (ii) so-called "regularization" methods for model selection and mitigation of overfit, and (iii) efficient algorithms for searching among a vast number of potential model specifications.
This flexibility brings hope of better approximating the unknown and likely complex data generating process underlying equity risk premia.
Finally, with many predictors it becomes infeasible to exhaustively traverse and compare all model permutations.
Element (iii) describes clever machine learning tools designed to approximate an optimal specification with manageable computational cost.

1.3 Why Apply Machine Learning to Asset Pricing?

A number of aspects of empirical asset pricing make it a particularly attractive field for analysis with machine learning methods.
The second focuses on dynamics of the aggregate market equity risk premium.
3) Further complicating the problem is ambiguity regarding functional forms through which the high-dimensional predictor set enter into risk premia.
Welch and Goyal (2008) analyze nearly 20 predictors for the aggregate market return.
Second, with methods ranging from generalized linear models to regression trees and neural networks, machine learning is explicitly designed to approximate complex nonlinear associations.

1.4 What Specific Machine Learning Methods Do We Study?

The authors select a set of candidate models that are potentially well suited to address the three empirical challenges outlined above.
They constitute the canon of methods one would encounter in a graduate level machine learning textbook.
This includes linear regression, generalized linear models with penalization, dimension reduction via principal components regression (PCR) and partial least squares (PLS), regression trees (including boosted trees and random forests), and neural networks.
The authors exclude support vector machines as these share an equivalence with other methods that they study 4 and are primarily used for classification problems.
Nonetheless, their list is designed to be representative of predictive analytics tools from various branches of the machine learning toolkit.

1.5 Main Empirical Findings

The authors conduct a large scale empirical analysis, investigating nearly 30,000 individual stocks over 60 years from 1957 to 2016.
This suggests that allowing for (potentially complex) interactions among the baseline predictors is a crucial aspect of nonlinearities in the expected return function.
Trees and neural networks improve upon this further, generating monthly out-of-sample R 2 's between 1.08% to 1.80% per month.
The evidence for economic gains from machine learning forecasts-in the form of portfolio Sharpe ratios-are likewise impressive.
The most successful predictors are price trends, liquidity, and volatility.

1.6 What Machine Learning Cannot Do

Machine learning has great potential for improving risk premium measurement, which is fundamentally a problem of prediction.
But these improved predictions are only measurements.
The measurements do not tell us about economic mechanisms or equilibria.
Machine learning methods on their own do not identify deep fundamental associations among asset prices and conditioning variables.
When the objective is to understand economic mechanisms, machine learning may still be useful.

1.7 Literature

The authors work extends the empirical literature on stock return prediction, which comes in two basic strands.
These traditional methods have potentially severe limitations that more advanced statistical tools in machine learning can help overcome.
Khandani et al. (2010) and Butaru et al. (2016) use regression trees to predict consumer credit card delinquencies and defaults.
Recently, variations of machine learning methods have been used to study the cross section of stock returns.

2 Methodology

This section describes the collection of machine learning methods that the authors use in their analysis.
The third element in each subsection describes computational algorithms for efficiently identifying the optimal specification among the permutations encompassed by a given method.
As the authors present each method, they aim to provide a sufficiently in-depth description of the statistical model so that a reader having no machine learning background can understand the basic model structure without needing to consult outside sources.
By maintaining the same form over time and across different stocks, the model leverages information from the entire panel which lends stability to estimates of risk premia for any individual asset.

2.1 Sample Splitting and Tuning via Validation

Important preliminary steps (prior to discussing specific models and regularization approaches) are to understand how the authors design disjoint sub-samples for estimation and testing and to introduce the notion of "hyperparameter tuning.".
In particular, the authors divide their sample into three disjoint time periods that maintain the temporal ordering of the data.
The authors construct forecasts for data points in the validation sample based on the estimated model from the training sample.
Tuning parameters are chosen from the validation sample taking into account estimated parameters, but the parameters are estimated from the training data alone.
Hyperparameter tuning amounts to searching for a degree of model complexity that tends to produce reliable out-of-sample performance.

2.2 Simple Linear

The authors begin their model description with the least complex method in their analysis, the simple linear predictive regression model estimated via ordinary least squares (OLS).
While the authors expect this to perform poorly in their high dimension problem, they use it as a reference point for emphasizing the distinctive features of more sophisticated methods.
The simple linear model imposes that conditional expectations g can be approximated by a linear function of the raw predictor variables and the parameter vector, θ, g(z i,t ; θ) = z i,t θ. (3) This model imposes a simple regression specification and does not allow for nonlinear effects or interactions between predictors.
The convenience of the baseline l 2 objective function is that it offers analytical estimates and thus avoids sophisticated optimization and computation.

2.2.1 Extension: Robust Objective Functions

This allows the econometrician to tilt estimates towards observations that are more statistically or economically informative.
This imposes that every month has the same contribution to the model regardless of how many stocks are available that month.
This value weighted loss function underweights small stocks in favor of large stocks, and is motivated by the economic rational that small stocks represent a large fraction of the traded universe by count while constituting a tiny fraction of aggregate market capitalization.
Convexity of the least squares objective (4) places extreme emphasis on large errors, thus outliers can undermine the stability of OLS-based predictions.
The Huber loss, H, is a hybrid of squared loss for relatively small errors and absolute loss for relatively large errors, where the combination is controlled by a tuning parameter, ξ, that can be optimized adaptively from the data.

2.3 Penalized Linear

The simple linear model is bound to fail in the presence of many predictors.
Crucial for avoiding overfit is reducing the number of estimated parameters.
The statistical model for their penalized linear model is the same as the simple linear model in equation (3).
The authors focus on the popular "elastic net" penalty, which takes the form EQUATION.
Ridge is a shrinkage method that helps prevent coefficients from becoming unduly large in magnitude.

2.4 Dimension Reduction: PCR and PLS

Penalized linear models use shrinkage and variable selection to manage high dimensionality by forcing the coefficients on most regressors near or exactly to zero.
This can produce suboptimal forecasts when predictors are highly correlated.
In the first step, principal components analysis (PCA) combines regressors into a small set of linear combinations that best preserve the covariance structure among the predictors.
Likewise, the predictive coefficient θ Objective Function and Computational Algorithm.
Intuitively, PCR seeks the K linear combinations of Z that most faithfully mimic the full predictor set.

2.5 Generalized Linear

13 When the "true" model is complex and nonlinear, restricting the functional form to be linear introduces approximation error due to model misspecification.
Let g (z i,t ) denote the true model and g(z i,t ; θ) the functional form specified by the econometrician.
And let g(z i,t ; θ) and r i,t+1 denote the fitted model and its ensuing return forecast.

, θ N

There are many potential choices for spline functions.
Because higher order terms enter additively, forecasting with the generalized linear model can be approached with the same estimation tools as in Section 2.2.
Because series expansion quickly multiplies the number of model parameters, the authors use penalization to control degrees of freedom.
The authors choice of penalization function is specialized for the spline expansion setting and is known as the group lasso.

2.6 Boosted Regression Trees and Random Forests

The model in (13) captures individual predictors' nonlinear impact on expected returns, but does not account for interactions among predictors.
While expanding univariate predictors with K basis functions multiplies the number of parameters by a factor of K, multi-way interactions increase the parameterization combinatorially.
At each new level, the authors choose a sorting variable from the set of predictors and the split value to maximize the discrepancy among average outcomes in each bin.
Branching halts when the number of leaves or the depth of the tree reach a pre-specified threshold that can be selected adaptively using a validation sample.
Next, a second simple tree (with the same shallow depth L) is used to fit the prediction residuals from the first tree.

2.7 Neural Networks

The final nonlinear method that the authors analyze is the artificial neural network.
Arguably the most powerful modeling device in machine learning, neural networks have theoretical underpinnings as "universal approximators" for any smooth predictive association (Hornik et al., 1989; Cybenko, 1989) .
The right panel of Figure 2 shows an example with one hidden layer that contains five neurons.
Training a very deep neural network is challenging because it typically involves a large number of parameters, because the objective function is highly non-convex, and because the recursive calculation of derivatives (known as "back-propagation") is prone to exploding or vanishing gradients.
In each step of the optimization algorithm, the parameter guesses are gradually updated to reduce prediction errors in the training sample.

2.8 Performance Evaluation

To assess predictive performance for individual excess stock return forecasts, the authors calculate the out- EQUATION where T 3 indicates that fits are only assessed on the testing subsample, whose data never enter into model estimation or tuning.
In certain circumstances, early stopping and weight-decay are shown to be equivalent.
The authors adapt Diebold-Mariano to their setting by comparing the cross-sectional average of prediction errors from each model, instead of comparing errors among individual returns.
This modified Diebold-Mariano test statistic, which is now based on a single time series d 12,t+1 of error differences with little autocorrelation, is more likely to satisfy the mild regularity conditions needed for asymptotic normality and in turn provide appropriate p-values for their model comparison tests.

2.9 Variable Importance and Marginal Relationships

The authors goal in interpreting machine learning models is modest.
The authors aim to identify covariates that have an important influence on the cross-section of expected returns while simultaneously controlling for the many other predictors in the system.
The authors consider two different notions of importance.
The first is the reduction in panel predictive R 2 from setting all values of predictor j to zero, while holding the remaining model estimates fixed (used, for example, in the context of dimension reduction by Kelly et al., 2019) .
The second, proposed in the neural networks literature by Dimopoulos et al. (1995) , is the sum of squared partial derivatives (SSD) of the model to each input variable j, which summarizes the sensitivity of model fits to changes in that variable.

3.1 Data and Over-arching Model

The authors obtain monthly total individual equity returns from CRSP for all firms listed in the NYSE, AMEX, and NASDAQ.
These include 94 characteristics 29 (61 of which are updated annually, 13 updated quarterly, and 20 updated monthly).
The structure of their feature set in ( 21) allows for purely stock-level information to enter expected returns via c i,t in analogy with the risk exposure function β i,t , and also allows aggregate economic conditions to enter in analogy with the dynamic risk premium λ t .
Each time the authors refit, they increase the training sample by one year.

3.2 The Cross Section of Individual Stocks

The authors also report these R 2 oos within subsamples that include only the top 1,000 stocks or bottom 1,000 stocks by market value.
To quantify the complexity of GBRT, the authors report the number of features used in the boosted tree ensemble at each re-estimation point.
The baseline patterns that OLS fares poorly, regularized linear models are an improvement, and nonlinear models dominate carries over into subsamples.
The authors highlight how inference changes under a conservative Bonferroni multiple comparisons correction that divides the significance level by the number of comparisons.

3.3 Which Covariates Matter?

The authors now investigate the relative importance of individual covariates for the performance of each model using the importance measures described in Section 2.9.
Characteristics are ordered so that the highest total ranks are on top and the lowest ranking characteristics are at the bottom.
Within each model, the authors calculate the Pearson correlation between relative importances from SSD and the R 2 measure.
Noise variables appear among the least informative characteristics, along with sin stocks, dividend initiation/omission, cashflow volatility, and other accounting variables.
Nonlinear methods (trees and neural networks) place great emphasis on exactly those predictors ignored by linear methods, such as term spreads and issuance activity.

3.3.1 Marginal Association Between Characteristics and Expected Returns

Figure 6 traces out the model-implied marginal impact of individual characteristics on expected excess returns.
First, Figure 6 illustrates that machine learning methods identify patterns similar to some well known empirical phenomena.
Second, the linear model finds no predictive association between returns and either size or volatility, while trees and neural networks find large sensitivity of expected returns to both of these variables.
A firm that drops from median size to the 20 th percentile of the size distribution experiences an increase in its annualized expected return of roughly 2.4% (0.002×12×100), and a firm whose volatility rises from median to 80 th percentile experiences a decrease of around 3.0% per year, according to NN3, and these methods detect nonlinear predictive associations.

3.3.2 Interaction Effects

The favorable performance of trees and neural networks indicates a benefit to allowing for potentially complex interactions among predictors.
Machine learning models are often referred to as "black boxes.".
The upper-left figure shows that the short-term reversal effect is strongest and is essentially linear among small stocks (blue line).
Finally, the lower-right shows that NN3 estimates no interaction effect between size and accruals-the size lines are simply vertical shifts of the univariate accruals curve.
Furthermore, the dominant macroeconomic interactions are stable over time.

3.4 Portfolio Forecasts

Next, the authors compare forecasting performance of machine learning methods for aggregate portfolio returns.
First, because all of their models are optimized for stock-level forecasts, portfolio forecasts provide an additional indirect evaluation of the model and its robustness.
Second, aggregate portfolios tend to be of broader economic interest because they represent the risky-asset savings vehicles most commonly held by investors (via mutual funds, ETFs, and hedge funds).
Last but not least, the portfolio results are one step further "out-of-sample" in that the optimization routine does not directly account for the predictive performance of the portfolios.

3.4.1 Pre-specified Portfolios

The authors build bottom-up forecasts by aggregating individual stock return predictions into portfolios.
The authors form bottom-up forecasts for 30 of the most well known portfolios in the empirical finance Note: Improvement in annualized Sharpe ratio (SR * − SR).
38 Table 5 reports the monthly out-of-sample R 2 over their 30-year testing sample.
Con-sistent with their other results, the strongest and most consistent trading strategies are those based on nonlinear models, with neural networks the best overall.
In the case of NN3, the Sharpe ratio from timing the S&P 500 index 0.77, or 26 percentage points higher than a buy-and-hold position.

3.4.2 Machine Learning Portfolios

Next, rather than assessing forecast performance among pre-specified portfolios, the authors design a new set of portfolios to directly exploit machine learning forecasts.
All stocks are sorted into deciles based on their predicted returns for the next month.
In a linear factor model, the tangency portfolio of the factors themselves represents the maximum Sharpe ratio portfolio in the economy.
The value weighted decile spread Sharpe ratio is 1.33, which is slightly lower than that for the NN4 model.

4 Conclusion

At the highest level, their findings demonstrate that machine learning methods can help improve their empirical understanding of asset prices.
Neural networks and, to a lesser extent, regression trees, are the best performing methods.
The authors track down the source of their predictive advantage to accommodation of nonlinear interactions that are missed by other methods.
Lastly, the authors find that all methods agree on a fairly small set of dominant predictive signals, the most powerful predictors being associated with price trends including return reversal and momentum.
With better measurement through machine learning, risk premia are less shrouded in approximation and estimation error, thus the challenge of identifying reliable economic mechanisms behind asset pricing phenomena becomes less steep.

Did you find this useful? Give us your feedback

Figures (34)

Table A.7: Implied Sharpe Ratio Improvements

Table A.3: Comparison of Average Variable Selection Frequencies in Simulations

Table 7: Performance of Machine Learning Portfolios

Figure 8: Expected Returns and Characteristic/Macroeconomic Variable Interactions (NN3)

Table A.4: Comparison of Average Variable Importance in Simulations

Table A.10: Performance of Machine Learning Portfolios (Equally Weighted, Excluding Microcaps)

Table 3: Comparison of Monthly Out-of-Sample Prediction using Diebold-Mariano Tests

Figure A.5: Time Variation in Stock/Macroeconomic Interactions

Table 2: Annual Out-of-sample Stock-level Prediction Performance (Percentage R2oos)

Table 5: Monthly Portfolio-level Out-of-Sample Predictive R2

Figure A.1: Characteristic Importance over Time by NN3

Table 4: Variable Importance for Macroeconomic Predictors

Table A.5: Hyperparameters For All Methods

Table A.9: Performance of Machine Learning Portfolios (Equally Weighted)

Figure A.4: Stock/Macroeconomic Interactions

Figure 9: Cumulative Return of Machine Learning Portfolios

Figure A.2: Variable Importance Using SSD of Dimopoulos et al. (1995)

Table A.1: Comparison of Predictive R2s for Machine Learning Algorithms in Simulations

Table A.8: Annual Portfolio-level Out-of-Sample Predictive R2

Content maybe subject to copyright Report

NBER WORKING PAPER SERIES

EMPIRICAL ASSET PRICING VIA MACHINE LEARNING

Shihao Gu

Bryan Kelly

Dacheng Xiu

Working Paper 25398

http://www.nber.org/papers/w25398

NATIONAL BUREAU OF ECONOMIC RESEARCH

1050 Massachusetts Avenue

Cambridge, MA 02134

December 2018, Revised September 2019

We benefitted from discussions with Joseph Babcock, Si Chen (Discussant), Rob Engle, Andrea

Frazzini, Amit Goyal (Discussant), Lasse Pedersen, Lin Peng (Discussant), Alberto Rossi

(Discussant), Guofu Zhou (Discussant), and seminar and conference participants at Erasmus

School of Economics, NYU, Northwestern, Imperial College, National University of Singapore,

UIBE, Nanjing University, Tsinghua PBC School of Finance, Fannie Mae, U.S. Securities and

Exchange Commission, City University of Hong Kong, Shenzhen Finance Institute at CUHK,

NBER Summer Institute, New Methods for the Cross Section of Returns Conference, Chicago

Quantitative Alliance Conference, Norwegian Financial Research Conference, EFA, China

International Conference in Finance, 10th World Congress of the Bachelier Finance Society,

Financial Engineering and Risk Management International Symposium, Toulouse Financial

Econometrics Conference, Chicago Conference on New Aspects of Statistics, Financial

Econometrics, and Data Science, Tsinghua Workshop on Big Data and Internet Economics, Q

group, IQ-KAP Research Prize Symposium, Wolfe Re- search, INQUIRE UK, Australasian

Finance and Banking Conference, Goldman Sachs Global Alternative Risk Premia Conference,

AFA, and Swiss Finance Institute. We gratefully acknowledge the computing support from the

Research Computing Center at the University of Chicago. The views expressed herein are those

of the authors and do not necessarily reflect the views of the National Bureau of Economic

Research.

At least one co-author has disclosed a financial relationship of potential relevance for this

research. Further information is available online at http://www.nber.org/papers/w25398.ack

NBER working papers are circulated for discussion and comment purposes. They have not been

peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies

official NBER publications.

not to exceed two paragraphs, may be quoted without explicit permission provided that full

credit, including © notice, is given to the source.

Empirical Asset Pricing via Machine Learning

Shihao Gu, Bryan Kelly, and Dacheng Xiu

NBER Working Paper No. 25398

December 2018, Revised September 2019

JEL No. C45,C55,C58,G11,G12

ABSTRACT

We perform a comparative analysis of machine learning methods for the canonical problem of

empirical asset pricing: measuring asset risk premia. We demonstrate large economic gains to

investors using machine learning forecasts, in some cases doubling the performance of leading

regression-based strategies from the literature. We identify the best performing methods (trees

and neural networks) and trace their predictive gains to allowance of nonlinear predictor

interactions that are missed by other methods. All methods agree on the same set of dominant

predictive signals which includes variations on momentum, liquidity, and volatility. Improved

risk premium measurement through machine learning simplifies the investigation into economic

mechanisms of asset pricing and highlights the value of machine learning in financial innovation.

Shihao Gu

University of Chicago Booth School of Business

5807 S. Woodlawn

Chicago, IL 60637

shihaogu@chicagobooth.edu

Bryan Kelly

Yale School of Management

165 Whitney Ave.

New Haven, CT 06511

and NBER

bryan.kelly@yale.edu

Dacheng Xiu

Booth School of Business

University of Chicago

5807 South Woodlaswn Avenue

Chicago, IL 60637

dachxiu@chicagobooth.edu

1 Introduction

In this article, we conduct a comparative analysis of machine learning methods for ﬁnance. We do

so in the context of perhaps the most widely studied problem in ﬁnance, that of measuring equity

risk premia.

1.1 Primary Contributions

Our primary contributions are two-fold. First, we provide a new set of benchmarks for the predictive

accuracy of machine learning methods in measuring risk premia of the aggregate market and indi-

vidual stocks. This accuracy is summarized two ways. The ﬁrst is a high out-of-sample predictive

relative to preceding literature that is robust across a variety of machine learning speciﬁcations.

Second, and more importantly, we demonstrate the large economic gains to investors using machine

learning forecasts. A portfolio strategy that times the S&P 500 with neural network forecasts enjoys

an annualized out-of-sample Sharpe ratio of 0.77, versus the 0.51 Sharpe ratio of a buy-and-hold

investor. And a value-weighted long-short decile spread strategy that takes positions based on stock-

level neural network forecasts earns an annualized out-of-sample Sharpe ratio of 1.35, more than

doubling the performance of a leading regression-based strategy from the literature.

Return prediction is economically meaningful. The fundamental goal of asset pricing is to un-

derstand the behavior of risk premia.

If expected returns were perfectly observed, we would still

need theories to explain their behavior and empirical analysis to test those theories. But risk premia

are notoriously diﬃcult to measure—market eﬃciency forces return variation to be dominated by

unforecastable news that obscures risk premia. Our research highlights gains that can be achieved

in prediction and identiﬁes the most informative predictor variables. This helps resolve the prob-

lem of risk premium measurement, which then facilitates more reliable investigation into economic

mechanisms of asset pricing.

Second, we synthesize the empirical asset pricing literature with the ﬁeld of machine learning.

Relative to traditional empirical methods in asset pricing, machine learning accommodates a far

more expansive list of potential predictor variables and richer speciﬁcations of functional form. It is

this ﬂexibility that allows us to push the frontier of risk premium measurement. Interest in machine

learning methods for ﬁnance has grown tremendously in both academia and industry. This article

provides a comparative overview of machine learning methods applied to the two canonical problems

of empirical asset pricing: predicting returns in the cross section and time series. Our view is that the

best way for researchers to understand the usefulness of machine learning in the ﬁeld of asset pricing

is to apply and compare the performance of each of its methods in familiar empirical problems.

Our focus is on measuring conditional expected stock returns in excess of the risk-free rate. Academic ﬁnance

traditionally refers to this quantity as the “risk premium” due to its close connection with equilibrium compensation for

bearing equity investment risk. We use the terms “expected return” and “risk premium” interchangeably. One may be

interested in potentially distinguishing among diﬀerent components of expected returns such as those due to systematic

risk compensation, idiosyncratic risk compensation, or even due to mispricing. For machine learning approaches to this

problem, see Gu et al. (2019) and Kelly et al. (2019).

1.2 What is Machine Learning?

The deﬁnition of “machine learning” is inchoate and is often context speciﬁc. We use the term to

describe (i) a diverse collection of high-dimensional models for statistical prediction, combined with

(ii) so-called “regularization” methods for model selection and mitigation of overﬁt, and (iii) eﬃcient

algorithms for searching among a vast number of potential model speciﬁcations.

The high-dimensional nature of machine learning methods (element (i) of this deﬁnition) enhances

their ﬂexibility relative to more traditional econometric prediction techniques. This ﬂexibility brings

hope of better approximating the unknown and likely complex data generating process underlying

equity risk premia. With enhanced ﬂexibility, however, comes a higher propensity of overﬁtting

the data. Element (ii) of our machine learning deﬁnition describes reﬁnements in implementation

that emphasize stable out-of-sample performance to explicitly guard against overﬁt. Finally, with

many predictors it becomes infeasible to exhaustively traverse and compare all model permutations.

Element (iii) describes clever machine learning tools designed to approximate an optimal speciﬁcation

with manageable computational cost.

1.3 Why Apply Machine Learning to Asset Pricing?

A number of aspects of empirical asset pricing make it a particularly attractive ﬁeld for analysis with

machine learning methods.

1) Two main research agendas have monopolized modern empirical asset pricing research. The

ﬁrst seeks to describe and understand diﬀerences in expected returns across assets. The second

focuses on dynamics of the aggregate market equity risk premium. Measurement of an asset’s risk

premium is fundamentally a problem of prediction—the risk premium is the conditional expectation

of a future realized excess return. Machine learning, whose methods are largely specialized for

prediction tasks, is thus ideally suited to the problem of risk premium measurement.

2) The collection of candidate conditioning variables for the risk premium is large. The profession

has accumulated a staggering list of predictors that various researchers have argued possess forecast-

ing power for returns. The number of stock-level predictive characteristics reported in the literature

numbers in the hundreds and macroeconomic predictors of the aggregate market number in the

dozens.

Additionally, predictors are often close cousins and highly correlated. Traditional predic-

tion methods break down when the predictor count approaches the observation count or predictors

are highly correlated. With an emphasis on variable selection and dimension reduction techniques,

machine learning is well suited for such challenging prediction problems by reducing degrees of free-

dom and condensing redundant variation among predictors.

3) Further complicating the problem is ambiguity regarding functional forms through which the

high-dimensional predictor set enter into risk premia. Should they enter linearly? If nonlinearities

Green et al. (2013) count 330 stock-level predictive signals in published or circulated drafts. Harvey et al. (2016)

study 316 “factors,” which include ﬁrm characteristics and common factors, for describing stock return behavior. They

note that this is only a subset of those studied in the literature. Welch and Goyal (2008) analyze nearly 20 predictors

for the aggregate market return. In both stock and aggregate return predictions, there presumably exists a much larger

set of predictors that were tested but failed to predict returns and were thus never reported.

are needed, which form should they take? Must we consider interactions among predictors? Such

questions rapidly proliferate the set of potential model speciﬁcations. The theoretical literature oﬀers

little guidance for winnowing the list of conditioning variables and functional forms. Three aspects

of machine learning make it well suited for problems of ambiguous functional form. The ﬁrst is its

diversity. As a suite of dissimilar methods it casts a wide net in its speciﬁcation search. Second, with

methods ranging from generalized linear models to regression trees and neural networks, machine

learning is explicitly designed to approximate complex nonlinear associations. Third, parameter

penalization and conservative model selection criteria complement the breadth of functional forms

spanned by these methods in order to avoid overﬁt biases and false discovery.

1.4 What Speciﬁc Machine Learning Methods Do We Study?

We select a set of candidate models that are potentially well suited to address the three empirical

challenges outlined above. They constitute the canon of methods one would encounter in a graduate

level machine learning textbook.

This includes linear regression, generalized linear models with pe-

nalization, dimension reduction via principal components regression (PCR) and partial least squares

(PLS), regression trees (including boosted trees and random forests), and neural networks. This is

not an exhaustive analysis of all methods. For example, we exclude support vector machines as these

share an equivalence with other methods that we study

and are primarily used for classiﬁcation

problems. Nonetheless, our list is designed to be representative of predictive analytics tools from

various branches of the machine learning toolkit.

1.5 Main Empirical Findings

We conduct a large scale empirical analysis, investigating nearly 30,000 individual stocks over 60

years from 1957 to 2016. Our predictor set includes 94 characteristics for each stock, interactions

of each characteristic with eight aggregate time series variables, and 74 industry sector dummy

variables, totaling more than 900 baseline signals. Some of our methods expand this predictor set

much further by including nonlinear transformations and interactions of the baseline signals. We

establish the following empirical facts about machine learning for return prediction.

Machine learning shows great promise for empirical asset pricing. At the broadest level, our

main empirical ﬁnding is that machine learning as a whole has the potential to improve our empirical

understanding of expected asset returns. It digests our predictor data set, which is massive from

the perspective of the existing literature, into a return forecasting model that dominates traditional

approaches. The immediate implication is that machine learning aids in solving practical investments

problems such as market timing, portfolio choice, and risk management, justifying its role in the

business architecture of the ﬁntech industry.

Consider as a benchmark a panel regression of individual stock returns onto three lagged stock-

level characteristics: size, book-to-market, and momentum. This benchmark has a number of attrac-

See, for example, Hastie et al. (2009).

See, for example, Jaggi (2013) and Hastie et al. (2009), who discuss the equivalence of support vector machines

with the lasso. For an application of the kernel trick to the cross section of returns, see Kozak (2019).

HTML Viewer

Frequently Asked Questions (12)

Q1. What have the authors contributed in "Nber working paper series empirical asset pricing via machine learning" ?

The authors benefitted from discussions with Joseph Babcock, Si Chen ( Discussant ), Rob Engle, Andrea Frazzini, Amit Goyal ( Discussant ), Lasse Pedersen, Lin Peng ( Discussant ), Alberto Rossi ( Discussant ), Guofu Zhou ( Discussant ), and seminar and conference participants at Erasmus School of Economics, NYU, Northwestern, Imperial College, National University of Singapore, UIBE, Nanjing University, Tsinghua PBC School of Finance, Fannie Mae, U. S. Securities and Exchange Commission, City University of Hong Kong, Shenzhen Finance Institute at CUHK, NBER Summer Institute, New Methods for the Cross Section of Returns Conference, Chicago Quantitative Alliance Conference, Norwegian Financial Research Conference, EFA, China International Conference in Finance, 10th World Congress of the Bachelier Finance Society, Financial Engineering and Risk Management International Symposium, Toulouse Financial Econometrics Conference, Chicago Conference on New Aspects of Statistics, Financial Econometrics, and Data Science, Tsinghua Workshop on Big Data and Internet Economics, Q group, IQ-KAP Research Prize Symposium, Wolfe Research, INQUIRE UK, Australasian Finance and Banking Conference, Goldman Sachs Global Alternative Risk Premia Conference, AFA, and Swiss Finance Institute. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.

Q2. What is the way to add multi-way interactions to the generalized linear model?

While expanding univariate predictors with K basis functions multiplies the number of parameters by a factor of K, multi-way interactions increase the parameterization combinatorially.

Q3. What is the way to improve the Sharpe ratio for a portfolio?

For characteristic-based portfolios, nonlinear machine learning methods help improve Sharpe ratios by anywhere from a few percentage points to over 24 percentage points.

Q4. What is the way to measure the performance of value-weight portfolios?

While value-weight portfolios are less sensitive to trading cost considerations, it is perhaps more natural to study equal weights in their analysis because their statistical objective functions minimize equally weighted forecast errors.

Q5. What is the common machine learning device for imposing parameter parsimony?

The most common machine learning device for imposing parameter parsimony is to append a penalty to the objective function in order to favor more parsimonious specifications.

Q6. What makes it a particularly attractive field for analysis with machine learning methods?

A number of aspects of empirical asset pricing make it a particularly attractive field for analysis with machine learning methods.

Q7. What are the advantages of ensemble methods?

Ensemble methods demonstrate more reliable performance and are scalable for very large datasets, leading to their increased popularity in recent literature.

Q8. What is the advantage of analyzing predictability at the portfolio level?

The final advantage of analyzing predictability at the portfolio level is that the authors can assess the economic contribution of each method via its contribution to risk-adjusted portfolio return performance.

Q9. What is the advantage of bottom-up portfolio forecasts?

Bottom-up portfolio forecasts allow us to evaluate a model’s ability to transport its asset predictions, which occur at the finestasset level, into broader investment contexts.

Q10. What is the challenge of assessing the incremental predictive content of a newly proposed predictor?

The challenge is how to assess the incremental predictive content of a newly proposed predictor while jointly controlling for the gamut of extant signals (or, relatedly, handling the multiple comparisons and false discovery problem).

Q11. How many units in the input layer is equal to the dimension of the predictors?

The number of units in the input layer is equal to the dimension of the predictors, which the authors set to four in this example (denoted z1, ..., z4).

Q12. How do Eldan and Shamir (2016) demonstrate that depth can be exponentially more valuable than increasing?

To initialize the network, similarly define the input layer using the raw predictors, x(0) = (1, z1, ..., zN ) ′.20Eldan and Shamir (2016) formally demonstrate that depth—even if increased by one layer—can be exponentially more valuable than increasing width in standard feed-forward neural networks.

Empirical Asset Pricing via Machine Learning

Summary (7 min read)

1.1 Primary Contributions

1.2 What is Machine Learning?

1.3 Why Apply Machine Learning to Asset Pricing?

1.4 What Specific Machine Learning Methods Do We Study?

1.5 Main Empirical Findings

1.6 What Machine Learning Cannot Do

1.7 Literature

2 Methodology

2.1 Sample Splitting and Tuning via Validation

2.2 Simple Linear

2.2.1 Extension: Robust Objective Functions

2.3 Penalized Linear

2.4 Dimension Reduction: PCR and PLS

2.5 Generalized Linear

, θ N

2.6 Boosted Regression Trees and Random Forests

2.7 Neural Networks

2.8 Performance Evaluation

2.9 Variable Importance and Marginal Relationships

3.1 Data and Over-arching Model

3.2 The Cross Section of Individual Stocks

3.3 Which Covariates Matter?

3.3.1 Marginal Association Between Characteristics and Expected Returns

3.3.2 Interaction Effects

3.4 Portfolio Forecasts

3.4.1 Pre-specified Portfolios

3.4.2 Machine Learning Portfolios

4 Conclusion

Figures (34)

Citations

References

Related Papers (5)

Frequently Asked Questions (12)

Q1. What have the authors contributed in "Nber working paper series empirical asset pricing via machine learning" ?

Q2. What is the way to add multi-way interactions to the generalized linear model?

Q3. What is the way to improve the Sharpe ratio for a portfolio?

Q4. What is the way to measure the performance of value-weight portfolios?

Q5. What is the common machine learning device for imposing parameter parsimony?

Q6. What makes it a particularly attractive field for analysis with machine learning methods?

Q7. What are the advantages of ensemble methods?

Q8. What is the advantage of analyzing predictability at the portfolio level?

Q9. What is the advantage of bottom-up portfolio forecasts?

Q10. What is the challenge of assessing the incremental predictive content of a newly proposed predictor?

Q11. How many units in the input layer is equal to the dimension of the predictors?

Q12. How do Eldan and Shamir (2016) demonstrate that depth can be exponentially more valuable than increasing?

Trending Questions (1)