As the old adage has it, before it gets better it will get worse.
Three interesting recent exhibits
The last couple of weeks we have seen three interesting additional exhibits:
First, a quartet from Harvard tried to straighten out the public record established by the Open Science Collaboration (OSC) – a group of 270 researchers from psychology — a few months earlier: that results in psychology are mostly not replicable. Gilbert and his colleagues make the remarkable claim that there is “no evidence for a replicability crisis in psychological science.”
Second, on the same day an author made the astonishing claim in Slate magazine that 20 years of studies of ego depletion, an influential and seemingly robust set of findings, have recently dissolved into thin air.
Third, a group of 18 researchers in economics published the (long awaited) findings of another replication project and – after the dismal findings of another such attempt reported by two researchers from the St Louis Fed last year — , had mostly good news for econs. A colleague of mine from psychology sent me the write-up from The Economist with these words:
“So all is fine in the house of experimental economics then…
Is there a crisis, if not in economics, then in psychology? Or, in the social sciences as such?
In the following I will briefly comment on each of these three events from the last couple of weeks and the heat that they have produced. I will then try to cast a broader net before I come to an assessment of the current state of affairs in the social sciences.
So, is all fine in the house of experimental economics?
As I pointed out to my colleague, a replication of 11 of 18 experiments published in a couple of top journals (which have an acceptance rate of well below 5 percent and huge selection biases) says little about the state of the art – replicability, reproducibility — in economics. I like to believe that evidence production in economics is more stable than in psychology because economists’ experimentation practices are less laissez faire but I fear that we have also a lot of false positives. In work that I have done with Le Zhang (currently under second- round review), we have shown that dictator game experiments published in the top experimental economics journal were typically severely underpowered, inviting them pesky false positives. While Camerer et al. ran their replications under an exacting standard of a required power of 0.9, until recently most (well, at least most dictator game) experiments in economics were not properly powered up.
It is also worth recalling briefly that a few months earlier two Federal Reserve economists came to the alarming conclusion that economics research is usually not replicable. Their conclusion was based on an attempt to replicate 67 empirical papers in 13 reputable academic journals of which they could, without assistance by the original researchers, replicate only a third of the results. With the original researchers’ assistance that percentage increased to about half. A good summary of their study can be found here. This is an arguably even more troubling result, as you would think that the additional variance that experimenters’ choice of experimental design and implementation details entails, would reduce the irreplicability of empirical findings (where data sets are, after all, pre-existing). This replication attempt indicates the magnitude of the problems that economics might have.
Indeed, we have seen for a couple of decades now that effects – for example, many so-called cognitive illusions – have not survived serious attempts at replication, or maybe I should say at reproduction. Take as a recent – non-laboratory — example the controversy over reference dependence and the alleged propensity of taxi drivers to shoot for income targets, thereby violating the neo-classical optimizing model of labor supply theory. In a recent article, Hank Farber, using a much larger and complete data set for New York taxi drivers than Camerer et al. (QJE 1997) had, finds that “income reference dependence is not an important factor in the daily labor supply decisions of taxi drivers”. My colleague Tess Stafford had come to a similar conclusion earlier, demonstrating how the results in Camerer et al. (QJE 1997) can be made to appear and disappear. Hint: proper metrics is a key. Complete and large data sets also help.
I could parade many examples (endowment effects anyone? Loss aversion? Conjunction fallacy?) where serious questions have been raised about the replicability and reproducibility of effects claimed in the Biases and Heuristics literature.
On balance then there is reason to believe that economists have way to go and ought to continue to improve their data collection and sharing efforts and to reflect on the design and implementation of their experiments and, very importantly, the appropriate econometric assessment of the evidence produced. The house of experimental economics, I fear, is not yet in good order.
Is all fine in the house of experimental psychology?
The Harvard quartet’s critique of the OSC initiative, and Ed Young in The Atlantichttp://www.theatlantic.com/science/archive/2016/03/psychologys-replication-crisis-cant-be-wished-away/472272/ its provocative conclusion that “the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%“ has been widely dissected by OSC as well as individual OSC members (e.g,, Brian Nosek and Elizabeth Gilbert here. A number of commentators had their take on the situation published in popular media (e.g., Katie Palmer in WIRED and Ed Yong in The Atlantic , for others see the retraction watch list or Mayo’s summary of the recent developments on repligate), and some highly qualified (albeit not always completely uninterested) parties such as David Funder, Andrew Gelman, Daniel Lakens, Uri Simonsohn. and Sanjay Srivastava have done so in more specialized outlets. (Follow the links attached to the names.)
The latter three have, at least to my mind, demolished pretty good the case that Gilbert and his colleagues presented. As did Funder and Gelman (and some of the commentators on Gelman’s piece).
Funder and Gelman also step back from the battle and look at the war that really is being waged here and by doing so provide some much needed light where there is currently way too much heat.
Funder, for example, points out that the OSC study “is not the only, and was far from the first, sign that we have a problem”; he is too modest to point out that he himself has provided more than a decade back a lengthy contribution and problem description.
Funder, seemingly unaware of the replication crisis that economists are dealing with, points also out that “other fields have replicability problems too”; he mentions specifically biochemistry, molecular biology, and medical research including cancer biology studies.
He then argues, “if Gilbert & Co. are right, are we to take it that the concerns in our sister sciences are also overblown?” It is a rhetorical question to which his answer is pretty clear. He concludes with a useful discussion of ”the ultimate source of unreliable scientific research”, locating it in a tightening market for academic jobs and opportunities, the emerging “academic star” system, and other perverse incentives for academics.
The apparent debunking of 20 years of ego depletion findings mentioned at the beginning as one of three prominent developments during the last two weeks is just another illustration of the current unsatisfying state of affairs.
So, is there a crisis?
You have to live in an ivory tower to believe that there is not. It seems obvious to me that there is and that before it gets better, it will get worse. That’s because suddenly everyone is talking about it and got interested in it. And a general sentiment has developed, and even found its way in editorial practices, that flashy results that barely clear conventional hurdles ought to be not trusted.
Some observers have taken the current debate as a cue to rethink the way we do science. In psychology for example, in the wake of huge and somewhat nasty controversies over the reality of various priming effects, a replication recipe has been proposed and it would indeed be a good start if it were widely implemented. Likewise the various offerings of pre-registration, while not completely uncontroversial, are a welcome move in the right direction if the rather stunning results from NHLBI funded trials recently reported in PLOS One are any indication.
How deep the crisis is, is a question that is harder to answer. That’s because any such answer depends on what our measuring rod is, and ought to be. Are we looking for what some people call direct replication, or are we really interested in what some people have called conceptual replication and yet others have called reproduction? Ben Strickland makes an excellent case for conceptual replication here, arguing that what we really ought to be after is reproducibility of robust effects. Rolf Zwaan makes a related argument here.
Which of course ties into important question of the appropriate choice of design and implementation characteristics, as well as the question of the correct statistical evaluation which is another but related battlefield.
In sum, there can be little doubt that there is a crisis. There is no crisis of the crisis, for all I can see. And it seems fair to say that the sense that there is a crisis has both widened and deepened to judge by the evidence that has been forthcoming.
That there is a crisis and that it is widening and deepening, at least for now, is the bad news. The good news is that overdue discussions – about replicability and reproducibility and everything that is connected to them — do take place and do take place in a serious manner. Mostly.
The widening sense of a widening and deepening crisis is upping the level of the game; it is for example encouraging to see the increasing offerings of pre-registered studies, the increased opportunities to publish replications or reproductions, the fact that many journals now require submission of data files before publications, etc. Similarly, it is encouraging to see platforms such as retraction watch emerge and clearly stay for good.