scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Empirical Asset Pricing via Machine Learning

TL;DR: The authors performed a comparative analysis of machine learning methods for the canonical problem of empirical asset pricing: measuring asset risk premia, and demonstrated large economic gains to investors using machine learning forecasts, in some cases doubling the performance of leading regression-based strategies from the literature.
Abstract: We perform a comparative analysis of machine learning methods for the canonical problem of empirical asset pricing: measuring asset risk premia. We demonstrate large economic gains to investors using machine learning forecasts, in some cases doubling the performance of leading regression-based strategies from the literature. We identify the best performing methods (trees and neural networks) and trace their predictive gains to allowance of nonlinear predictor interactions that are missed by other methods. All methods agree on the same set of dominant predictive signals which includes variations on momentum, liquidity, and volatility. Improved risk premium measurement through machine learning simplifies the investigation into economic mechanisms of asset pricing and highlights the value of machine learning in financial innovation.

Summary (7 min read)

1.1 Primary Contributions

  • First, the authors provide a new set of benchmarks for the predictive accuracy of machine learning methods in measuring risk premia of the aggregate market and individual stocks.
  • A portfolio strategy that times the S&P 500 with neural network forecasts enjoys an annualized out-of-sample Sharpe ratio of 0.77, versus the 0.51 Sharpe ratio of a buy-and-hold investor.
  • The authors research highlights gains that can be achieved in prediction and identifies the most informative predictor variables.
  • The authors view is that the best way for researchers to understand the usefulness of machine learning in the field of asset pricing is to apply and compare the performance of each of its methods in familiar empirical problems.
  • One may be interested in potentially distinguishing among different components of expected returns such as those due to systematic risk compensation, idiosyncratic risk compensation, or even due to mispricing.

1.2 What is Machine Learning?

  • The definition of "machine learning" is inchoate and is often context specific.
  • The authors use the term to describe (i) a diverse collection of high-dimensional models for statistical prediction, combined with (ii) so-called "regularization" methods for model selection and mitigation of overfit, and (iii) efficient algorithms for searching among a vast number of potential model specifications.
  • This flexibility brings hope of better approximating the unknown and likely complex data generating process underlying equity risk premia.
  • Finally, with many predictors it becomes infeasible to exhaustively traverse and compare all model permutations.
  • Element (iii) describes clever machine learning tools designed to approximate an optimal specification with manageable computational cost.

1.3 Why Apply Machine Learning to Asset Pricing?

  • A number of aspects of empirical asset pricing make it a particularly attractive field for analysis with machine learning methods.
  • The second focuses on dynamics of the aggregate market equity risk premium.
  • 3) Further complicating the problem is ambiguity regarding functional forms through which the high-dimensional predictor set enter into risk premia.
  • Welch and Goyal (2008) analyze nearly 20 predictors for the aggregate market return.
  • Second, with methods ranging from generalized linear models to regression trees and neural networks, machine learning is explicitly designed to approximate complex nonlinear associations.

1.4 What Specific Machine Learning Methods Do We Study?

  • The authors select a set of candidate models that are potentially well suited to address the three empirical challenges outlined above.
  • They constitute the canon of methods one would encounter in a graduate level machine learning textbook.
  • This includes linear regression, generalized linear models with penalization, dimension reduction via principal components regression (PCR) and partial least squares (PLS), regression trees (including boosted trees and random forests), and neural networks.
  • The authors exclude support vector machines as these share an equivalence with other methods that they study 4 and are primarily used for classification problems.
  • Nonetheless, their list is designed to be representative of predictive analytics tools from various branches of the machine learning toolkit.

1.5 Main Empirical Findings

  • The authors conduct a large scale empirical analysis, investigating nearly 30,000 individual stocks over 60 years from 1957 to 2016.
  • This suggests that allowing for (potentially complex) interactions among the baseline predictors is a crucial aspect of nonlinearities in the expected return function.
  • Trees and neural networks improve upon this further, generating monthly out-of-sample R 2 's between 1.08% to 1.80% per month.
  • The evidence for economic gains from machine learning forecasts-in the form of portfolio Sharpe ratios-are likewise impressive.
  • The most successful predictors are price trends, liquidity, and volatility.

1.6 What Machine Learning Cannot Do

  • Machine learning has great potential for improving risk premium measurement, which is fundamentally a problem of prediction.
  • But these improved predictions are only measurements.
  • The measurements do not tell us about economic mechanisms or equilibria.
  • Machine learning methods on their own do not identify deep fundamental associations among asset prices and conditioning variables.
  • When the objective is to understand economic mechanisms, machine learning may still be useful.

1.7 Literature

  • The authors work extends the empirical literature on stock return prediction, which comes in two basic strands.
  • These traditional methods have potentially severe limitations that more advanced statistical tools in machine learning can help overcome.
  • Khandani et al. (2010) and Butaru et al. (2016) use regression trees to predict consumer credit card delinquencies and defaults.
  • Recently, variations of machine learning methods have been used to study the cross section of stock returns.

2 Methodology

  • This section describes the collection of machine learning methods that the authors use in their analysis.
  • The third element in each subsection describes computational algorithms for efficiently identifying the optimal specification among the permutations encompassed by a given method.
  • As the authors present each method, they aim to provide a sufficiently in-depth description of the statistical model so that a reader having no machine learning background can understand the basic model structure without needing to consult outside sources.
  • By maintaining the same form over time and across different stocks, the model leverages information from the entire panel which lends stability to estimates of risk premia for any individual asset.

2.1 Sample Splitting and Tuning via Validation

  • Important preliminary steps (prior to discussing specific models and regularization approaches) are to understand how the authors design disjoint sub-samples for estimation and testing and to introduce the notion of "hyperparameter tuning.".
  • In particular, the authors divide their sample into three disjoint time periods that maintain the temporal ordering of the data.
  • The authors construct forecasts for data points in the validation sample based on the estimated model from the training sample.
  • Tuning parameters are chosen from the validation sample taking into account estimated parameters, but the parameters are estimated from the training data alone.
  • Hyperparameter tuning amounts to searching for a degree of model complexity that tends to produce reliable out-of-sample performance.

2.2 Simple Linear

  • The authors begin their model description with the least complex method in their analysis, the simple linear predictive regression model estimated via ordinary least squares (OLS).
  • While the authors expect this to perform poorly in their high dimension problem, they use it as a reference point for emphasizing the distinctive features of more sophisticated methods.
  • The simple linear model imposes that conditional expectations g can be approximated by a linear function of the raw predictor variables and the parameter vector, θ, g(z i,t ; θ) = z i,t θ. (3) This model imposes a simple regression specification and does not allow for nonlinear effects or interactions between predictors.
  • The convenience of the baseline l 2 objective function is that it offers analytical estimates and thus avoids sophisticated optimization and computation.

2.2.1 Extension: Robust Objective Functions

  • This allows the econometrician to tilt estimates towards observations that are more statistically or economically informative.
  • This imposes that every month has the same contribution to the model regardless of how many stocks are available that month.
  • This value weighted loss function underweights small stocks in favor of large stocks, and is motivated by the economic rational that small stocks represent a large fraction of the traded universe by count while constituting a tiny fraction of aggregate market capitalization.
  • Convexity of the least squares objective (4) places extreme emphasis on large errors, thus outliers can undermine the stability of OLS-based predictions.
  • The Huber loss, H, is a hybrid of squared loss for relatively small errors and absolute loss for relatively large errors, where the combination is controlled by a tuning parameter, ξ, that can be optimized adaptively from the data.

2.3 Penalized Linear

  • The simple linear model is bound to fail in the presence of many predictors.
  • Crucial for avoiding overfit is reducing the number of estimated parameters.
  • The statistical model for their penalized linear model is the same as the simple linear model in equation (3).
  • The authors focus on the popular "elastic net" penalty, which takes the form EQUATION.
  • Ridge is a shrinkage method that helps prevent coefficients from becoming unduly large in magnitude.

2.4 Dimension Reduction: PCR and PLS

  • Penalized linear models use shrinkage and variable selection to manage high dimensionality by forcing the coefficients on most regressors near or exactly to zero.
  • This can produce suboptimal forecasts when predictors are highly correlated.
  • In the first step, principal components analysis (PCA) combines regressors into a small set of linear combinations that best preserve the covariance structure among the predictors.
  • Likewise, the predictive coefficient θ Objective Function and Computational Algorithm.
  • Intuitively, PCR seeks the K linear combinations of Z that most faithfully mimic the full predictor set.

2.5 Generalized Linear

  • 13 When the "true" model is complex and nonlinear, restricting the functional form to be linear introduces approximation error due to model misspecification.
  • Let g (z i,t ) denote the true model and g(z i,t ; θ) the functional form specified by the econometrician.
  • And let g(z i,t ; θ) and r i,t+1 denote the fitted model and its ensuing return forecast.

, θ N

  • There are many potential choices for spline functions.
  • Because higher order terms enter additively, forecasting with the generalized linear model can be approached with the same estimation tools as in Section 2.2.
  • Because series expansion quickly multiplies the number of model parameters, the authors use penalization to control degrees of freedom.
  • The authors choice of penalization function is specialized for the spline expansion setting and is known as the group lasso.

2.6 Boosted Regression Trees and Random Forests

  • The model in (13) captures individual predictors' nonlinear impact on expected returns, but does not account for interactions among predictors.
  • While expanding univariate predictors with K basis functions multiplies the number of parameters by a factor of K, multi-way interactions increase the parameterization combinatorially.
  • At each new level, the authors choose a sorting variable from the set of predictors and the split value to maximize the discrepancy among average outcomes in each bin.
  • Branching halts when the number of leaves or the depth of the tree reach a pre-specified threshold that can be selected adaptively using a validation sample.
  • Next, a second simple tree (with the same shallow depth L) is used to fit the prediction residuals from the first tree.

2.7 Neural Networks

  • The final nonlinear method that the authors analyze is the artificial neural network.
  • Arguably the most powerful modeling device in machine learning, neural networks have theoretical underpinnings as "universal approximators" for any smooth predictive association (Hornik et al., 1989; Cybenko, 1989) .
  • The right panel of Figure 2 shows an example with one hidden layer that contains five neurons.
  • Training a very deep neural network is challenging because it typically involves a large number of parameters, because the objective function is highly non-convex, and because the recursive calculation of derivatives (known as "back-propagation") is prone to exploding or vanishing gradients.
  • In each step of the optimization algorithm, the parameter guesses are gradually updated to reduce prediction errors in the training sample.

2.8 Performance Evaluation

  • To assess predictive performance for individual excess stock return forecasts, the authors calculate the out- EQUATION where T 3 indicates that fits are only assessed on the testing subsample, whose data never enter into model estimation or tuning.
  • In certain circumstances, early stopping and weight-decay are shown to be equivalent.
  • The authors adapt Diebold-Mariano to their setting by comparing the cross-sectional average of prediction errors from each model, instead of comparing errors among individual returns.
  • This modified Diebold-Mariano test statistic, which is now based on a single time series d 12,t+1 of error differences with little autocorrelation, is more likely to satisfy the mild regularity conditions needed for asymptotic normality and in turn provide appropriate p-values for their model comparison tests.

2.9 Variable Importance and Marginal Relationships

  • The authors goal in interpreting machine learning models is modest.
  • The authors aim to identify covariates that have an important influence on the cross-section of expected returns while simultaneously controlling for the many other predictors in the system.
  • The authors consider two different notions of importance.
  • The first is the reduction in panel predictive R 2 from setting all values of predictor j to zero, while holding the remaining model estimates fixed (used, for example, in the context of dimension reduction by Kelly et al., 2019) .
  • The second, proposed in the neural networks literature by Dimopoulos et al. (1995) , is the sum of squared partial derivatives (SSD) of the model to each input variable j, which summarizes the sensitivity of model fits to changes in that variable.

3.1 Data and Over-arching Model

  • The authors obtain monthly total individual equity returns from CRSP for all firms listed in the NYSE, AMEX, and NASDAQ.
  • These include 94 characteristics 29 (61 of which are updated annually, 13 updated quarterly, and 20 updated monthly).
  • The structure of their feature set in ( 21) allows for purely stock-level information to enter expected returns via c i,t in analogy with the risk exposure function β i,t , and also allows aggregate economic conditions to enter in analogy with the dynamic risk premium λ t .
  • Each time the authors refit, they increase the training sample by one year.

3.2 The Cross Section of Individual Stocks

  • The authors also report these R 2 oos within subsamples that include only the top 1,000 stocks or bottom 1,000 stocks by market value.
  • To quantify the complexity of GBRT, the authors report the number of features used in the boosted tree ensemble at each re-estimation point.
  • The baseline patterns that OLS fares poorly, regularized linear models are an improvement, and nonlinear models dominate carries over into subsamples.
  • The authors highlight how inference changes under a conservative Bonferroni multiple comparisons correction that divides the significance level by the number of comparisons.

3.3 Which Covariates Matter?

  • The authors now investigate the relative importance of individual covariates for the performance of each model using the importance measures described in Section 2.9.
  • Characteristics are ordered so that the highest total ranks are on top and the lowest ranking characteristics are at the bottom.
  • Within each model, the authors calculate the Pearson correlation between relative importances from SSD and the R 2 measure.
  • Noise variables appear among the least informative characteristics, along with sin stocks, dividend initiation/omission, cashflow volatility, and other accounting variables.
  • Nonlinear methods (trees and neural networks) place great emphasis on exactly those predictors ignored by linear methods, such as term spreads and issuance activity.

3.3.1 Marginal Association Between Characteristics and Expected Returns

  • Figure 6 traces out the model-implied marginal impact of individual characteristics on expected excess returns.
  • First, Figure 6 illustrates that machine learning methods identify patterns similar to some well known empirical phenomena.
  • Second, the linear model finds no predictive association between returns and either size or volatility, while trees and neural networks find large sensitivity of expected returns to both of these variables.
  • A firm that drops from median size to the 20 th percentile of the size distribution experiences an increase in its annualized expected return of roughly 2.4% (0.002×12×100), and a firm whose volatility rises from median to 80 th percentile experiences a decrease of around 3.0% per year, according to NN3, and these methods detect nonlinear predictive associations.

3.3.2 Interaction Effects

  • The favorable performance of trees and neural networks indicates a benefit to allowing for potentially complex interactions among predictors.
  • Machine learning models are often referred to as "black boxes.".
  • The upper-left figure shows that the short-term reversal effect is strongest and is essentially linear among small stocks (blue line).
  • Finally, the lower-right shows that NN3 estimates no interaction effect between size and accruals-the size lines are simply vertical shifts of the univariate accruals curve.
  • Furthermore, the dominant macroeconomic interactions are stable over time.

3.4 Portfolio Forecasts

  • Next, the authors compare forecasting performance of machine learning methods for aggregate portfolio returns.
  • First, because all of their models are optimized for stock-level forecasts, portfolio forecasts provide an additional indirect evaluation of the model and its robustness.
  • Second, aggregate portfolios tend to be of broader economic interest because they represent the risky-asset savings vehicles most commonly held by investors (via mutual funds, ETFs, and hedge funds).
  • Last but not least, the portfolio results are one step further "out-of-sample" in that the optimization routine does not directly account for the predictive performance of the portfolios.

3.4.1 Pre-specified Portfolios

  • The authors build bottom-up forecasts by aggregating individual stock return predictions into portfolios.
  • The authors form bottom-up forecasts for 30 of the most well known portfolios in the empirical finance Note: Improvement in annualized Sharpe ratio (SR * − SR).
  • 38 Table 5 reports the monthly out-of-sample R 2 over their 30-year testing sample.
  • Con-sistent with their other results, the strongest and most consistent trading strategies are those based on nonlinear models, with neural networks the best overall.
  • In the case of NN3, the Sharpe ratio from timing the S&P 500 index 0.77, or 26 percentage points higher than a buy-and-hold position.

3.4.2 Machine Learning Portfolios

  • Next, rather than assessing forecast performance among pre-specified portfolios, the authors design a new set of portfolios to directly exploit machine learning forecasts.
  • All stocks are sorted into deciles based on their predicted returns for the next month.
  • In a linear factor model, the tangency portfolio of the factors themselves represents the maximum Sharpe ratio portfolio in the economy.
  • The value weighted decile spread Sharpe ratio is 1.33, which is slightly lower than that for the NN4 model.

4 Conclusion

  • At the highest level, their findings demonstrate that machine learning methods can help improve their empirical understanding of asset prices.
  • Neural networks and, to a lesser extent, regression trees, are the best performing methods.
  • The authors track down the source of their predictive advantage to accommodation of nonlinear interactions that are missed by other methods.
  • Lastly, the authors find that all methods agree on a fairly small set of dominant predictive signals, the most powerful predictors being associated with price trends including return reversal and momentum.
  • With better measurement through machine learning, risk premia are less shrouded in approximation and estimation error, thus the challenge of identifying reliable economic mechanisms behind asset pricing phenomena becomes less steep.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

NBER WORKING PAPER SERIES
EMPIRICAL ASSET PRICING VIA MACHINE LEARNING
Shihao Gu
Bryan Kelly
Dacheng Xiu
Working Paper 25398
http://www.nber.org/papers/w25398
NATIONAL BUREAU OF ECONOMIC RESEARCH
1050 Massachusetts Avenue
Cambridge, MA 02134
December 2018, Revised September 2019
We benefitted from discussions with Joseph Babcock, Si Chen (Discussant), Rob Engle, Andrea
Frazzini, Amit Goyal (Discussant), Lasse Pedersen, Lin Peng (Discussant), Alberto Rossi
(Discussant), Guofu Zhou (Discussant), and seminar and conference participants at Erasmus
School of Economics, NYU, Northwestern, Imperial College, National University of Singapore,
UIBE, Nanjing University, Tsinghua PBC School of Finance, Fannie Mae, U.S. Securities and
Exchange Commission, City University of Hong Kong, Shenzhen Finance Institute at CUHK,
NBER Summer Institute, New Methods for the Cross Section of Returns Conference, Chicago
Quantitative Alliance Conference, Norwegian Financial Research Conference, EFA, China
International Conference in Finance, 10th World Congress of the Bachelier Finance Society,
Financial Engineering and Risk Management International Symposium, Toulouse Financial
Econometrics Conference, Chicago Conference on New Aspects of Statistics, Financial
Econometrics, and Data Science, Tsinghua Workshop on Big Data and Internet Economics, Q
group, IQ-KAP Research Prize Symposium, Wolfe Re- search, INQUIRE UK, Australasian
Finance and Banking Conference, Goldman Sachs Global Alternative Risk Premia Conference,
AFA, and Swiss Finance Institute. We gratefully acknowledge the computing support from the
Research Computing Center at the University of Chicago. The views expressed herein are those
of the authors and do not necessarily reflect the views of the National Bureau of Economic
Research.
At least one co-author has disclosed a financial relationship of potential relevance for this
research. Further information is available online at http://www.nber.org/papers/w25398.ack
NBER working papers are circulated for discussion and comment purposes. They have not been
peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies
official NBER publications.
© 2018 by Shihao Gu, Bryan Kelly, and Dacheng Xiu. All rights reserved. Short sections of text,
not to exceed two paragraphs, may be quoted without explicit permission provided that full
credit, including © notice, is given to the source.

Empirical Asset Pricing via Machine Learning
Shihao Gu, Bryan Kelly, and Dacheng Xiu
NBER Working Paper No. 25398
December 2018, Revised September 2019
JEL No. C45,C55,C58,G11,G12
ABSTRACT
We perform a comparative analysis of machine learning methods for the canonical problem of
empirical asset pricing: measuring asset risk premia. We demonstrate large economic gains to
investors using machine learning forecasts, in some cases doubling the performance of leading
regression-based strategies from the literature. We identify the best performing methods (trees
and neural networks) and trace their predictive gains to allowance of nonlinear predictor
interactions that are missed by other methods. All methods agree on the same set of dominant
predictive signals which includes variations on momentum, liquidity, and volatility. Improved
risk premium measurement through machine learning simplifies the investigation into economic
mechanisms of asset pricing and highlights the value of machine learning in financial innovation.
Shihao Gu
University of Chicago Booth School of Business
5807 S. Woodlawn
Chicago, IL 60637
shihaogu@chicagobooth.edu
Bryan Kelly
Yale School of Management
165 Whitney Ave.
New Haven, CT 06511
and NBER
bryan.kelly@yale.edu
Dacheng Xiu
Booth School of Business
University of Chicago
5807 South Woodlaswn Avenue
Chicago, IL 60637
dachxiu@chicagobooth.edu

1 Introduction
In this article, we conduct a comparative analysis of machine learning methods for finance. We do
so in the context of perhaps the most widely studied problem in finance, that of measuring equity
risk premia.
1.1 Primary Contributions
Our primary contributions are two-fold. First, we provide a new set of benchmarks for the predictive
accuracy of machine learning methods in measuring risk premia of the aggregate market and indi-
vidual stocks. This accuracy is summarized two ways. The first is a high out-of-sample predictive
R
2
relative to preceding literature that is robust across a variety of machine learning specifications.
Second, and more importantly, we demonstrate the large economic gains to investors using machine
learning forecasts. A portfolio strategy that times the S&P 500 with neural network forecasts enjoys
an annualized out-of-sample Sharpe ratio of 0.77, versus the 0.51 Sharpe ratio of a buy-and-hold
investor. And a value-weighted long-short decile spread strategy that takes positions based on stock-
level neural network forecasts earns an annualized out-of-sample Sharpe ratio of 1.35, more than
doubling the performance of a leading regression-based strategy from the literature.
Return prediction is economically meaningful. The fundamental goal of asset pricing is to un-
derstand the behavior of risk premia.
1
If expected returns were perfectly observed, we would still
need theories to explain their behavior and empirical analysis to test those theories. But risk premia
are notoriously difficult to measure—market efficiency forces return variation to be dominated by
unforecastable news that obscures risk premia. Our research highlights gains that can be achieved
in prediction and identifies the most informative predictor variables. This helps resolve the prob-
lem of risk premium measurement, which then facilitates more reliable investigation into economic
mechanisms of asset pricing.
Second, we synthesize the empirical asset pricing literature with the field of machine learning.
Relative to traditional empirical methods in asset pricing, machine learning accommodates a far
more expansive list of potential predictor variables and richer specifications of functional form. It is
this flexibility that allows us to push the frontier of risk premium measurement. Interest in machine
learning methods for finance has grown tremendously in both academia and industry. This article
provides a comparative overview of machine learning methods applied to the two canonical problems
of empirical asset pricing: predicting returns in the cross section and time series. Our view is that the
best way for researchers to understand the usefulness of machine learning in the field of asset pricing
is to apply and compare the performance of each of its methods in familiar empirical problems.
1
Our focus is on measuring conditional expected stock returns in excess of the risk-free rate. Academic finance
traditionally refers to this quantity as the “risk premium” due to its close connection with equilibrium compensation for
bearing equity investment risk. We use the terms “expected return” and “risk premium” interchangeably. One may be
interested in potentially distinguishing among different components of expected returns such as those due to systematic
risk compensation, idiosyncratic risk compensation, or even due to mispricing. For machine learning approaches to this
problem, see Gu et al. (2019) and Kelly et al. (2019).
2

1.2 What is Machine Learning?
The definition of “machine learning” is inchoate and is often context specific. We use the term to
describe (i) a diverse collection of high-dimensional models for statistical prediction, combined with
(ii) so-called “regularization” methods for model selection and mitigation of overfit, and (iii) efficient
algorithms for searching among a vast number of potential model specifications.
The high-dimensional nature of machine learning methods (element (i) of this definition) enhances
their flexibility relative to more traditional econometric prediction techniques. This flexibility brings
hope of better approximating the unknown and likely complex data generating process underlying
equity risk premia. With enhanced flexibility, however, comes a higher propensity of overfitting
the data. Element (ii) of our machine learning definition describes refinements in implementation
that emphasize stable out-of-sample performance to explicitly guard against overfit. Finally, with
many predictors it becomes infeasible to exhaustively traverse and compare all model permutations.
Element (iii) describes clever machine learning tools designed to approximate an optimal specification
with manageable computational cost.
1.3 Why Apply Machine Learning to Asset Pricing?
A number of aspects of empirical asset pricing make it a particularly attractive field for analysis with
machine learning methods.
1) Two main research agendas have monopolized modern empirical asset pricing research. The
first seeks to describe and understand differences in expected returns across assets. The second
focuses on dynamics of the aggregate market equity risk premium. Measurement of an asset’s risk
premium is fundamentally a problem of prediction—the risk premium is the conditional expectation
of a future realized excess return. Machine learning, whose methods are largely specialized for
prediction tasks, is thus ideally suited to the problem of risk premium measurement.
2) The collection of candidate conditioning variables for the risk premium is large. The profession
has accumulated a staggering list of predictors that various researchers have argued possess forecast-
ing power for returns. The number of stock-level predictive characteristics reported in the literature
numbers in the hundreds and macroeconomic predictors of the aggregate market number in the
dozens.
2
Additionally, predictors are often close cousins and highly correlated. Traditional predic-
tion methods break down when the predictor count approaches the observation count or predictors
are highly correlated. With an emphasis on variable selection and dimension reduction techniques,
machine learning is well suited for such challenging prediction problems by reducing degrees of free-
dom and condensing redundant variation among predictors.
3) Further complicating the problem is ambiguity regarding functional forms through which the
high-dimensional predictor set enter into risk premia. Should they enter linearly? If nonlinearities
2
Green et al. (2013) count 330 stock-level predictive signals in published or circulated drafts. Harvey et al. (2016)
study 316 “factors,” which include firm characteristics and common factors, for describing stock return behavior. They
note that this is only a subset of those studied in the literature. Welch and Goyal (2008) analyze nearly 20 predictors
for the aggregate market return. In both stock and aggregate return predictions, there presumably exists a much larger
set of predictors that were tested but failed to predict returns and were thus never reported.
3

are needed, which form should they take? Must we consider interactions among predictors? Such
questions rapidly proliferate the set of potential model specifications. The theoretical literature offers
little guidance for winnowing the list of conditioning variables and functional forms. Three aspects
of machine learning make it well suited for problems of ambiguous functional form. The first is its
diversity. As a suite of dissimilar methods it casts a wide net in its specification search. Second, with
methods ranging from generalized linear models to regression trees and neural networks, machine
learning is explicitly designed to approximate complex nonlinear associations. Third, parameter
penalization and conservative model selection criteria complement the breadth of functional forms
spanned by these methods in order to avoid overfit biases and false discovery.
1.4 What Specific Machine Learning Methods Do We Study?
We select a set of candidate models that are potentially well suited to address the three empirical
challenges outlined above. They constitute the canon of methods one would encounter in a graduate
level machine learning textbook.
3
This includes linear regression, generalized linear models with pe-
nalization, dimension reduction via principal components regression (PCR) and partial least squares
(PLS), regression trees (including boosted trees and random forests), and neural networks. This is
not an exhaustive analysis of all methods. For example, we exclude support vector machines as these
share an equivalence with other methods that we study
4
and are primarily used for classification
problems. Nonetheless, our list is designed to be representative of predictive analytics tools from
various branches of the machine learning toolkit.
1.5 Main Empirical Findings
We conduct a large scale empirical analysis, investigating nearly 30,000 individual stocks over 60
years from 1957 to 2016. Our predictor set includes 94 characteristics for each stock, interactions
of each characteristic with eight aggregate time series variables, and 74 industry sector dummy
variables, totaling more than 900 baseline signals. Some of our methods expand this predictor set
much further by including nonlinear transformations and interactions of the baseline signals. We
establish the following empirical facts about machine learning for return prediction.
Machine learning shows great promise for empirical asset pricing. At the broadest level, our
main empirical finding is that machine learning as a whole has the potential to improve our empirical
understanding of expected asset returns. It digests our predictor data set, which is massive from
the perspective of the existing literature, into a return forecasting model that dominates traditional
approaches. The immediate implication is that machine learning aids in solving practical investments
problems such as market timing, portfolio choice, and risk management, justifying its role in the
business architecture of the fintech industry.
Consider as a benchmark a panel regression of individual stock returns onto three lagged stock-
level characteristics: size, book-to-market, and momentum. This benchmark has a number of attrac-
3
See, for example, Hastie et al. (2009).
4
See, for example, Jaggi (2013) and Hastie et al. (2009), who discuss the equivalence of support vector machines
with the lasso. For an application of the kernel trick to the cross section of returns, see Kozak (2019).
4

Citations
More filters
Reference EntryDOI
15 Oct 2004

2,118 citations

Journal ArticleDOI
TL;DR: In this paper, the authors propose an instrumented principal component analysis (IPCA) model that allows for latent factors and time-varying loadings by introducing observable characteristics that instrument for the unobservable dynamic loadings.

262 citations

Journal ArticleDOI
TL;DR: A review of the burgeoning literature dedicated to Energy Economics/Finance applications of ML suggests that Support Vector Machine, Artificial Neural Network, and Genetic Algorithms are among the most popular techniques used in energy economics papers.

220 citations

Journal ArticleDOI
TL;DR: This paper aims to provide a comprehensive overview of the challenges that ML techniques face in protecting cyberspace against attacks, by presenting a literature on ML techniques for cyber security including intrusion detection, spam detection, and malware detection on computer networks and mobile networks in the last decade.
Abstract: Pervasive growth and usage of the Internet and mobile applications have expanded cyberspace. The cyberspace has become more vulnerable to automated and prolonged cyberattacks. Cyber security techniques provide enhancements in security measures to detect and react against cyberattacks. The previously used security systems are no longer sufficient because cybercriminals are smart enough to evade conventional security systems. Conventional security systems lack efficiency in detecting previously unseen and polymorphic security attacks. Machine learning (ML) techniques are playing a vital role in numerous applications of cyber security. However, despite the ongoing success, there are significant challenges in ensuring the trustworthiness of ML systems. There are incentivized malicious adversaries present in the cyberspace that are willing to game and exploit such ML vulnerabilities. This paper aims to provide a comprehensive overview of the challenges that ML techniques face in protecting cyberspace against attacks, by presenting a literature on ML techniques for cyber security including intrusion detection, spam detection, and malware detection on computer networks and mobile networks in the last decade. It also provides brief descriptions of each ML method, frequently used security datasets, essential ML tools, and evaluation metrics to evaluate a classification model. It finally discusses the challenges of using ML techniques in cyber security. This paper provides the latest extensive bibliography and the current trends of ML in cyber security.

135 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations

Frequently Asked Questions (12)
Q1. What have the authors contributed in "Nber working paper series empirical asset pricing via machine learning" ?

The authors benefitted from discussions with Joseph Babcock, Si Chen ( Discussant ), Rob Engle, Andrea Frazzini, Amit Goyal ( Discussant ), Lasse Pedersen, Lin Peng ( Discussant ), Alberto Rossi ( Discussant ), Guofu Zhou ( Discussant ), and seminar and conference participants at Erasmus School of Economics, NYU, Northwestern, Imperial College, National University of Singapore, UIBE, Nanjing University, Tsinghua PBC School of Finance, Fannie Mae, U. S. Securities and Exchange Commission, City University of Hong Kong, Shenzhen Finance Institute at CUHK, NBER Summer Institute, New Methods for the Cross Section of Returns Conference, Chicago Quantitative Alliance Conference, Norwegian Financial Research Conference, EFA, China International Conference in Finance, 10th World Congress of the Bachelier Finance Society, Financial Engineering and Risk Management International Symposium, Toulouse Financial Econometrics Conference, Chicago Conference on New Aspects of Statistics, Financial Econometrics, and Data Science, Tsinghua Workshop on Big Data and Internet Economics, Q group, IQ-KAP Research Prize Symposium, Wolfe Research, INQUIRE UK, Australasian Finance and Banking Conference, Goldman Sachs Global Alternative Risk Premia Conference, AFA, and Swiss Finance Institute. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research. 

While expanding univariate predictors with K basis functions multiplies the number of parameters by a factor of K, multi-way interactions increase the parameterization combinatorially. 

For characteristic-based portfolios, nonlinear machine learning methods help improve Sharpe ratios by anywhere from a few percentage points to over 24 percentage points. 

While value-weight portfolios are less sensitive to trading cost considerations, it is perhaps more natural to study equal weights in their analysis because their statistical objective functions minimize equally weighted forecast errors. 

The most common machine learning device for imposing parameter parsimony is to append a penalty to the objective function in order to favor more parsimonious specifications. 

A number of aspects of empirical asset pricing make it a particularly attractive field for analysis with machine learning methods. 

Ensemble methods demonstrate more reliable performance and are scalable for very large datasets, leading to their increased popularity in recent literature. 

The final advantage of analyzing predictability at the portfolio level is that the authors can assess the economic contribution of each method via its contribution to risk-adjusted portfolio return performance. 

Bottom-up portfolio forecasts allow us to evaluate a model’s ability to transport its asset predictions, which occur at the finestasset level, into broader investment contexts. 

The challenge is how to assess the incremental predictive content of a newly proposed predictor while jointly controlling for the gamut of extant signals (or, relatedly, handling the multiple comparisons and false discovery problem). 

The number of units in the input layer is equal to the dimension of the predictors, which the authors set to four in this example (denoted z1, ..., z4). 

To initialize the network, similarly define the input layer using the raw predictors, x(0) = (1, z1, ..., zN ) ′.20Eldan and Shamir (2016) formally demonstrate that depth—even if increased by one layer—can be exponentially more valuable than increasing width in standard feed-forward neural networks. 

Trending Questions (1)
Canonical problem of empirical asset pricing

The paper discusses the empirical asset pricing problem and compares different machine learning methods for measuring asset risk premia.