The Replication Crisis That Wasn’t

Perspective

The Replication Crisis That Wasn’t

March 24, 2021 - Cliff Asness

The estimable Larry Swedroe and our friends at Alpha Architect have saved me a lot of work by wonderfully summarizing my colleagues’ 1 1 Close Well, some of them are my colleagues. I think this Theis guy may be totally made up. new paper and the literature it addresses. Thus, I’ll keep this mercifully brief.

Data mining is the idea that backtested results are generally overstated, as researchers try many different things and don’t report failures but, rather, have a large bias to report successes. If that is the case, then traditional inference (“hey it’s a two t-statistic!”) is not valid. 2 2 Close To perhaps state the obvious, if you try 100 random things on average one will appear 1/100 unlikely. It has always been a concern, and researchers have always taken it seriously. I can tell you that we have been obsessed with this issue since before the New York Rangers won their first Stanley Cup in 54 years (and now, sadly, a new 27 year drought is well underway), before anyone knew who Jerry Seinfeld was (unless they watched Johnny), and almost as important, since there was a nasty wall in Berlin. 3 3 Close For the last one I’m reaching back to my days in the finance Ph.D. program (that’s where I was in 1989). But, even back then we knew about this issue. The term “replication crisis” had not been invented yet, but we talked about overfitting, data mining, and the “file-drawer” effect (good results get published, bad ones go into the drawer from which nothing returns…). Whether one is just trying to understand markets, or actually implementing these strategies for clients and/or themselves, the concern that highly touted results may be exaggerated is an obvious worry!

We have always, as almost anyone who’s ever had to sit through a presentation from us can confirm, addressed this in two ways: out-of-sample evidence and theory/story. A factor found to have a positive realized return in one place over a limited time period, 4 4 Close Think of this as often the first paper published on a pretty new finding. no matter how good the return, is always suspect. How will it hold up in other geographies, time periods, or even asset classes? As the great economist Henny Youngman often said, take Fama and French’s original work on the value factor. 5 5 Close No, even as a Chicago hyper-partisan I don’t claim that Fama and French get sole credit for the value factor, or much more extreme, value investing in general (I promise that someone in the Roman Empire was saying “don’t buy that, it has a price-to-book of X!!!” and then someone else laughed at them for a while as it went to XX, the original cynic had to cover his position and accept exile, but ultimately it fell to III). But, they did produce the seminal work on it, and the dates and coverage of that work are a reasonable place to declare “in-sample.” They did this mainly using one value factor (price-to-book) in one country (the USA) in one asset class (within equities) over a period that now seems way too short (1963 through the late 1980s). 6 6 Close I often, jokingly but accurately, refer to 1990-present as my personal out-of-sample period as it does roughly qualify as out-of-sample for the original results and my own career. I know I’m old, but we now have more out-of-sample data than they had in-sample data. 7 7 Close I’m not yet talking about going to other geographies, asset classes, or back in time. Just about running their specific tests going forward. Since then, researchers have tested it in other countries, using other valuation measures, for selecting equity exposure across countries, and in multiple other asset classes. Researchers have even extended the original results back to 1929 or even further. And, of course, there’s that little matter of the factors holding up since the late 1980s too. 8 8 Close I’ve often told the story that standing at Goldman Sachs in the early to mid-1990s we didn’t have the option of observing the next 25-30 years out-of-sample data (if so, “buy Apple” would’ve trumped any factor strategy). A bad career strategy would’ve been for me to tell the partners I reported to “here’s the plan, we wait 25-30 more years, and if these things hold up out-of-sample, we pounce on them!” So we consoled ourselves with the other stuff: tests of other measures, geographies, asset classes, etc. That worked. But, it’s now nice to observe that the true out-of-sample through time results that we obviously couldn’t access back then (no Pym Particles) turned out ok. Recall, I think that even the famous disappointing out-of-sample results for value in the USA (it’s been much stronger ex USA) are actually pretty darn good.

Taken as a whole, the results have been an outstanding confirmation of the initial results for the value factor (and I will be discussing other factors soon). I say this as one very conscious of (nay, obsessed with) value’s recent difficulties (and as one with the temerity to claim these, until a recent partial resurgence, excruciating results don't really change our long-term estimate of the value factor’s expected return). And, while you need to account for valuation changes and its negative correlation with momentum to see the strength of the simple price-to-book value factor in the USA since 1990 (BTW, adjusting for valuation changes and examining value in combination with momentum are legit!), the totality of the out-of-sample evidence is way stronger than just that. 9 9 Close Again, across geographies, back as well as forward in time from the original tests, across assets classes, etc. Of course it’s better some places than others. For instance, again, the value factor has been much stronger ex-USA than in the USA post-1990. Even if the market was trying to set the expected return to value the same everywhere (which by no means are we saying it did or should) randomness and better measurement (e.g., measuring value for commodities is hard) will lead to quite varied outcomes. But, again, I don’t think anyone can read the post Fama-French (including subsequent work by Fama-French) literature and not agree with my positive assessment.

I’ve only discussed value but I could write a very similar paragraph for momentum (itself having some pretty cool super long-term backtests). Other stalwarts of the quant factor pantheon, like quality and low-risk investing, are similar, if not as exhaustive, success stories. 10 10 Close Quality has its own out-of-sample success stories but we are more limited in when and where we can test it. Though, I may have been too negative on low-risk investing as it might actually rival value in its ubiquity.

In addition to out-of-sample evidence, we also require 11 11 Close Ok, “require” may be too strong a word. If a factor, say a high frequency one, worked super well in tons of places over very different time periods, but we didn’t have a good story, we might put some weight on it, but less than if we also had a story! some understanding of why the factor should work to begin with. These don’t have to be provable single stories. For instance, both a believer in efficient markets and one who thinks markets are very warped by behavioral finance can have good (but separate) reasons to believe in the value factor. But there has to be some sense to it. Famous examples used to illustrate the perils of data mining include timing the market using butter production in Bangladesh or the winner of the Super Bowl. These are edifying and make us laugh. But it’s harder to judge when the researcher is dealing with less obviously ridiculous 12 12 Close Sorry, I refuse to be super bullish on stocks this year because Tom Brady became an even bigger G.O.A.T. (and I also refuse to even look up Bangladesh butter production, though I hope it’s doing well). measures. If your new factor is made up of measures from Compustat, CRSP, and FRED, it will likely never be as obviously ex ante ridiculous as these cautionary tales. All I can say here is that we try very hard to have high standards!

Data mining isn’t the only criticism faced by the factor literature. A more basic problem is that a backtest might never have been right to begin with. A “replication” crisis is most specifically about this – not being able to even replicate the original work. 13 13 Close An erroneous backtest probably won’t hold up out-of-sample for a somewhat more direct reason than just data mining. At AQR, we usually try to replicate academic research to see how it holds up, and, similarly, when an internal researcher finds a new result, we have someone else independently replicate the work to ensure that it is valid.

Next, a backtest could’ve been pristine (i.e., correctly implemented and unexaggerated by repeated iterations), had a great and in fact true story behind it, but cease to work going forward as more and more people learn about the pattern and invest in the factor; perhaps because of this self-same pristine research. Indeed, enough investment based on the factor can arbitrage away its efficacy, although a strategy can actually still work going forward even when “everyone” knows about it, but only up to some point (enough capital can drive away any edge). 14 14 Close We are lucky in the sense that what we do is pretty big capacity and it takes a heck of a lot of capital to make something like “value” go away. It’s very fair to worry that a strategy might be arbitraged away once revealed and over-invested in. 15 15 Close The second part is crucial. Many times I’ve mentioned that a strategy doesn’t get arbed away by people knowing about it but by people over-doing it. For instance, you still sometimes hear that value’s difficulties in the last decade (and especially 2018-2020) are because too many people now know about it (you also heard that in 1999-2000). That’s a plausible story before looking at any facts, but the facts are not kind to it. As I’ve beaten to death in quite a few places, value’s problems have not arisen because it’s too popular and well-known, but because it’s too hated. Kind of, you know, backwards from this typical wise-sounding but facile explanation. However, I do look forward to it potentially being arbitraged away over the next ten years as I’ll then be in my mid-60s and (perhaps! – it’ll be hard to drag me away) retiring with ten years of tail- instead of head-winds behind me, and behind our clients (pity my successors and the super-tight arbitraged away value spread they may inherit!). It’s a whole other thing to just assume it must be the case.

So, none of these are brand new issues. But there is a relatively new and growing literature on it. There have been numerous papers over the last few years (again, see Larry’s recent summary) examining the “factor zoo” (a term coined by the great John Cochrane). “Zoo” isn’t generally used as a compliment and is not so used here. While specific papers of course vary, the general consensus has been that factors have been disappointing since their “discovery” (either measured by out-of-sample returns as a fraction of back-tested returns, or by the fraction of factors that hold up out-of-sample, or by the really terrible ones that didn’t even replicate in-sample). Apparently the Zoo was indeed too big, data mining too prevalent, and we’ve all learned a well-deserved lesson. Time to move on.

Not so fast! These papers, most of them well-done, have often stirred up financial media reaction that is, well, less well-done. Stories like “investors only get 50-60% of a backtest going forward because it’s been r-b-trajed!!!”, with a tone that factor investors are ivory tower idiots who can’t cope when confronted with how the real world works beyond their backtests, are not uncommon. 16 16 Close Nor is my work here proof we’re not ivory tower idiots. ^, 17 17 Close Of course, I typically react very calmly and rationally to such silly headlines. As I will soon discuss, we think these same 50-60% type results are a cause for celebration, and we have thought so for near thirty years.

Since data mining has been such a concern of ours for decades, we take this burgeoning literature quite seriously. Some of our responses have been to note that we don’t believe in 100s of factors. We believe in a handful of factor types and not even in every factor beloved by the literature. If you measure value (again, using it as the example) 100 different ways and average them, you are not testing 100 independent factors as is, unfortunately, sometimes asserted. If you average them you are not cherry picking the best way but creating, in our view, a more robust factor (and thus, this should be viewed as more positive evidence for value not being data-mined). 18 18 Close If you chose the best few of a hundred and then believed they’d be as superior going forward that would be very different (and a bad idea). Thus we’ve always thought the factor zoo is smaller than some others do, and we note that we only believe in the parts of it that overcome our many hurdles. We have never felt the need or desire to defend every paper ever written on factor investing. 19 19 Close Please note, my colleagues’ new paper that is the subject of this blog actually goes further and does in fact test all the factors, including ones AQR doesn’t bet on, and finds the results are still, as a group, O.K. out-of-sample. Of course, we’d still argue for investing in our subset of this exhaustive list.

Finally, at its most simple, we have never, not for a second, believed that one should expect to get results going forward that are as good as a backtest. We may fight hard, and on net successfully, to control data mining, but that doesn’t mean we can eliminate it. Some always creeps in to even the best intentioned backtests. For rough (very rough) justice, we’ve always started from an assumption that you’d get half the backtested results going forward (of course, specific situations might call for different estimates 20 20 Close Perhaps while examining a strategy you are conscious that more data mining occurred than normal, or the opposite, that it was a very clean new backtest of something you already have tested in many other places. Also a hundred year backtest would probably get more weight than a hundred day backtest. ). Thus, oddly, when papers come out saying that if you invested in all the published factors right after they were known you’d only get, like, half the backtest going forward, others see a problem while we generally pump our fists in a kind of obnoxious “yes!!!”-like motion.

So, what’s left to do? I’ve handled the “it’s all data mining or arbitraged away” critics just fine without my sagacious colleagues, no? Well, you might have noticed that my counter-arguments to critics of factor investing are piecemeal. They aren’t a formal test showing that we’re likely right – they are just a series of guideposts that give us great confidence. Well, along come Jensen, Kelly, and Pedersen 21 21 Close Is it just me or do all listings of co-authors sound like a law firm? to test, and test brilliantly, what we have argued, largely anecdotally, for many years. 22 22 Close Anecdotally sounds too negative. We think they are really good anecdotes using a lot of data not just pontificating. But, still, anecdotes. They form a framework (warning, their paper is a lot more “mathy” than mine are these days, oh to be a young geek again) that is quite painstaking in broad factor inclusion and replication, that accounts for the correlations among factors (again, there ain’t really 100s of factors), and, most importantly imho, doesn’t start from the, frankly, silly notion that “it’s a failure if we don’t get 100% of a backtest out-of-sample.” 23 23 Close I’m not criticizing the academic papers that preceded this one, but rather the popular interpretations of those papers that get so much play. This Bayesian framework starts with the simple intuitive prior that the factors have zero expected return. The in-sample results influence us to raise our estimate from this start of zero, but not one-for-one. If you started out thinking something was likely zero with some confidence, then you observed a bunch of data that said it was positive, you would be foolish to think “oh, now it’s as good as what I just observed.” Your prior counts too. The idea is simple and intuitive. It’s a needed and brilliant formalization of something we, and we believe many other factor investors, have always done. Simply put we do not assume you should do as well as a backtest, and this new paper formalizes and tests this notion.

Their results are rather startlingly (even to me) positive for the field in general. I’ll leave you in the hands of Larry for, again, a very well done summary and just quote from the main paper’s abstract here:

“The majority of asset pricing factors: (1) can be replicated, (2) can be clustered into 13 themes, the majority of which are significant parts of the tangency portfolio, (3) work out of sample in a new large data set covering 93 countries, and (4) have evidence that is strengthened (not weakened) by the large number of observed factors.”

In plainer words, done properly (using a consistent methodology, accounting for the simultaneous testing of many factors, comparing to a fair yardstick and not 100% of a backtest, studying the broadest set of factors in the most places, using global data, and even making their code and data available online) the study of ex post “after the research paper came out and told everybody the good news” factor results is extremely supportive of factor investing. 24 24 Close Their results are actually a little more extreme than even I would’ve forecast. Again, we pride ourselves in investing in a subset of factors where perhaps we have a prior higher than zero they should work. They don’t use any such latitude but rather find life would be pretty good if you just invested in them all!

I think this is one of the most important papers in our field for a long time. I am, of course, incredibly biased, both by my own interests and the esteem in which I hold my colleagues (ex-Theis of course; I don’t trust that guy as far as I could throw a Danish sumo wrestler). 25 25 Close Bryan and Lasse are afraid somebody will take my comments about Theis, who I’ve not had the pleasure of meeting, seriously. I am always running into this problem. Suffice it to say that Bryan and Lasse rave about Theis, and I rave about Bryan and Lasse, so by the transitive property I rave about Theis. So I do encourage you to read the actual paper and decide for yourself if I’m right. But I am. 26 26 Close If this sounds arrogant please note that we’ve discussed ad nauseum, and just lived through a pretty clear example, that none of this makes factor investing easy in real life!

Oh, and I’m not sure this blog qualifies as brief, so I may have lied in the beginning when I promised you that. But, it would’ve been a lot less brief without Larry’s excellent summary, so a big thank you to him!

Disclosures

The views and opinions expressed herein are those of the author and do not necessarily reflect the views of AQR Capital Management, LLC, its affiliates or its employees.

Past performance is no guarantee of future results.

This document has been provided to you solely for information purposes and does not constitute an offer or solicitation of an offer or any advice or recommendation to purchase any securities or other financial instruments and may not be construed as such. There can be no assurance that an investment strategy will be successful. Historic market trends are not reliable indicators of actual future market behavior or future performance of any particular investment which may differ materially and should not be relied upon as such. This material should not be viewed as a current or past recommendation or a solicitation of an offer to buy or sell any securities or to adopt any investment strategy.

AQR Capital Management, LLC, (“AQR”) provide links to third-party websites only as a convenience, and the inclusion of such links does not imply any endorsement, approval, investigation, verification or monitoring by us of any content or information contained within or accessible from the linked sites. If you choose to visit the linked sites, you do so at your own risk, and you will be subject to such sites' terms of use and privacy policies, over which AQR.com has no control. In no event will AQR be responsible for any information or content within the linked sites or your use of the linked sites. Information contained on third party websites that AQR Capital Management, LLC, (“AQR”) may link to are not reviewed in their entirety for accuracy and AQR assumes no liability for the information contained on these websites. Information contained on third party websites that AQR Capital Management, LLC, (“AQR”) may link to are not reviewed in their entirety for accuracy and AQR assumes no liability for the information contained on these websites.

This document is not research and should not be treated as research. This document does not represent valuation judgments with respect to any financial instrument, issuer, security or sector that may be described or referenced herein and does not represent a formal or official view of AQR. This document has been prepared solely for informational purposes. The information contained herein is only as current as of the date indicated, and may be superseded by subsequent market events or for other reasons. Nothing contained herein constitutes investment, legal, tax or other advice nor is it to be relied on in making an investment or other decision.

Related Thinking

Is There a Replication Crisis in Finance?

You are now leaving AQR.com

You are now leaving AQR.com