New Things Under the Sun is a living literature review; as the state of the academic literature evolves, so do we. This post highlights some recent updates.
Same Data, Same Question, Different Answers
The post “One question, many answers” looked at the “many analyst” literature, wherein a bunch of different researchers and research teams independently try to answer the same set of questions, using the exact same dataset. Surprisingly, it’s not at all uncommon for different teams to arrive at different conclusions. I’ve added to this post another recent paper, Menkveld et al. (2021):
Finally, Menkveld et al. (2021) wrangles 164 teams of economists to test six different hypotheses about financial markets using a big dataset of European trading data. Testing these hypotheses required participants to define and build their own measures and indices, and to see if they have increased or decreased over time. As should be no surprise by now, the teams came up with an enormous range of estimates. For example, on one hypotheses - how has the share of client volume in total volume changed - 4% of teams found it had increased, 46% found it had declined, and 50% found no statistically significant change over time.
The updated post integrates discussion of Menkveld et al. (2021) throughout, where it echoes the findings of other papers in this genre, for example, in its finding that the dispersion of expertise does not seem to account for much of the dispersion in results. Instead, the post argues this difference stems from some of the inadequacies in our “methodological technology.” There are many different points at which researchers can make different, defensible, research choices, and those difference add up.
One place researchers can make different decisions is at step one: what counts as evidence that answers the stated research question? Another recent paper - Ausburg and Brüderl (2021) - suggests such differences were an important factor in the difference outcomes found in one the most famous of these studies, Silberzahn et al. (2018).
Ausburg and Brüderl (2021) provides some interesting detail on what drove different answers in [Silberzahn et al. (2018)], by digging back into the original study’s records. After analyzing each team’s submitted reports, Ausburg and Brüderl argue that the 29 teams were actually trying to answer (broadly) four different questions.
Recall [Silberzahn et al. (2018)’s] research prompt was “are soccer players with dark skin tone more likely than those with light skin tone to receive red cards from referees?” Ausburg and Brüderl argue some interpreted this quite literally, and sought to compute the simple average difference in the risk of red cards among dark- and light-skinned players, with no effort to adjust for any other systematic differences between the players. Others thought this was a question specifically about racial bias. For them, the relevant hypothetical was the average difference in risk of a red card among two players who were identical except for their skin tone. Yet others interpreted the question as asking “if we are trying to predict the risk of red cards, does skin tone show up as one of the most important factors?” And still others thought of the whole project as being about maximizing the methodological diversity used to tackle a question, and saw their role as trying out novel and unusual methodologies, rather than whatever approach they thought most likely to arrive at the right answer!
Menkveld and coauthors’ paper on financial markets provide some other evidence that tighter bounds on what counts as evidence can reduce, though not eliminate, the dispersion of answers. Recall this paper asked researchers to answer six different hypotheses. Some of these hypotheses were relatively ambiguous, such as “how has market efficiency changed over time?” leaving it to researchers define and implement a measure of market efficiency. Other hypotheses permitted much less scope for judgment, such as “how has the share of client volume in total volume changed?” The dispersion of answers for the more tightly defined questions was much narrower than for the more nebulous questions.
The updated post also discusses some promising evidence that when teams are allowed to discuss each other’s results and offer feedback, this can substantially reduce the dispersion in their results.
More Evidence Publication Bias is Real
The many analysts literature is worrying enough, but publication bias compounds the problem it identifies. Publication bias is when the probability a result gets published is dependent on the result identified. In general, we worry that there is a preference for novel results that identify some new statistical relationship, as opposed to results that find no statistically significant correlation between variables. This can create a biased picture of the evidence, because if so-called “null results” are not publishable, a review of the literature will seem to find unanimous evidence for some statistical relationship.
The post “Publication bias is real” reviews various lines of evidence on the existence of publication bias and its magnitude. I’ve added to this post a new short section on experimental papers.
As a first step, let’s consider some papers that use experiments to explicitly see whether reviewers treat papers differently, depending on the results. In each of these papers, reviewers receive descriptions of papers (or actual papers) that are basically identical, except for the results. For one random set of reviewers, the papers (or descriptions of papers) obtain statistically significant results; in the other, these results are changed to be statistically insignificant. But as much as possible, what the reviewers see is otherwise unchanged. The papers then compare the recommendations and ratings of the two groups of reviewers to see if the non-significant results are rated more poorly or given lower recommendations than the significant ones.
We have three papers from different fields. Emerson et al. (2010) has 110 actual reviewers of papers of orthopedic journals do a standard peer review of one of two different fictitious papers, each of which are identical but for the results. Berinsky et al. (2021) email short descriptive vignettes of research papers to all faculty in US political science departments that grant PhDs and have respondents fill out surveys about these vignettes, getting about 1,000 responses. Similarly, Chopra et al. (2022) get responses on short descriptive vignettes of economics papers from about 500 responses from economists at top 200 departments. These studies varied a bit in exactly how they measured support for publication and what other dimensions they studied, but in all cases reviewers believed papers with statistically significant results were better candidates for publication. The figure below tracks, in dark blue, the probability a given reviewer would support publication among the reviewers who saw a statistically significant finding, while light blue illustrates the same for reviewers who saw a statistically insignificant result of an otherwise identical paper.
To emphasize - the only difference in the papers or paper vignettes that respondents read in the above figure was whether the result was described as statistically significant or not. Holding everything else fixed - the research question, the methodology, the quality of the writing, the sample size, etc - reviewers were less likely to recommend the versions of the papers that found non-significant results be published.
The rest of the post looks at other evidence that takes a variety of complementary approaches.
Weaker Methods → Worse Bias?
Finally, the post “Why is publication bias worse in some disciplines than others?” seeks to get some answers about why we have publication bias, and more specifically, why some fields seem to have it worse than others. This is a subject where the experimental literature discussed above has been really clarifying I think. I have largely rewritten a discussion of possible reasons for why publication bias might vary across fields:
Suppose the root cause of publication bias is that journals want to highlight notable research, in order to be relevant to their readership. There are at least two different ways this can lead to publication bias, depending on what journals view as “notable” research.
First, it might be that journals consider surprising results to be the most notable. After all, if we’re not surprised by research, doesn’t that imply we already sort-of knew the result? And what would be the point of that? But this leads to publication bias if results that challenge the prevailing wisdom are easier to publish than results that support it. In aggregate the weight of evidence is distorted because we do not observe the bulk of the boring evidence that just supports the conventional wisdom.
This could lead to variation in publication bias across fields if fields vary in the breadth of what is considered surprising. For example, we could imagine one field that is very theoretically contested, with different theories making very different predictions. In that field, perhaps everything is surprising in light of some theory and so most results are publishable. In this field, we might not observe much evidence of publication bias. In another field (social science?), perhaps there is an unstated assumption that most hypotheses are false and so null results are perceived as boring and hence difficult to publish. In this field, we would observe a lot of evidence of publication bias.
A second way that a preference for notable research can lead to bias has to do with a field’s skepticism towards its own empirical methods. Suppose you have a theory that predicts a positive relationship between two variables, but when you test the theory you get a null result. That’s a surprising result, and so under the first theory of publication bias its notable and therefore more attractive to publishers. But your willingness to recommend such a paper for publication might depend on if you think the result is reliable. If you trust the empirical methods - you believe that if you replicated the methods you would get the same result - then you might recommend it for publication. But if you are working in a field where you know empirical work is very hard to do right, then it becomes a lot more likely that the surprising result merely reflects inadequacies in the study.
Fields do seem to differ substantially in the reliability of their empirical methods. Fields differ in how feasible it is to precisely measure phenomena of interest, rather than noisy proxies. Fields differ in their ability to run tightly controlled repeatable experiments and isolate specific phenomena of interest from a web of complex interactions. Fields differ in the number of observations they have to work with. It might be that fields with imprecise empirical methods are much more hesitant to publish null results, because there is much less signal imputed to null results. Nine times out of ten, a null result just means the data was too noisy to see the signal.
A field with very reliable empirical methods might be more willing to publish null results than a field where empirical work is “more art than science.” A tragic outcome of this theory of publication bias is that it predicts publication bias will be worst in the fields with the weakest empirical methods, exacerbating the already rough state of empirical work in those fields!
These theories of publication bias make somewhat different predictions. Under the surprise theory, null results will be easier to publish when positive results are expected, because that’s when null results are most surprising. Under the skepticism theory, null results will be harder to publish when positive results are expected, because that’s when it’s more likely that the research messed up the test. That said, it could be that both theories are true to some degree, and pull in different directions.
The post goes on to discuss how experimental evidence on publication bias, discussed in the preceding section of this email, sheds light on these theories.
To begin, we can turn to a strand of literature that conducts experiments on the ultimate source of publication bias: the people helping to decide what gets published. In these experiments, a set of randomly selected peer reviewers or editors view one version of a description of a (fictitious) research project and another set of peer reviewers view a description of the same project, but with the results changed from a positive result to a null result. We then see if those shown the version with a null result rate the project as less publishable than the one with a positive result.
For example, in Chopra et al. (2022), about 500 economists assess a series of short descriptive vignettes of research projects, some of which describe positive results and some of which describe null results. As discussed here, indeed, positive results were more likely to be recommended for publication than negative ones. But Chopra and coauthors introduce a second experiment too: within each of these groups, some participants additionally see information on what a survey of experts expects the result to be. That lets us see how our expectations about what a result “should” be informs its perceived suitability for publication. As discussed above, this can help us distinguish between the surprise and skepticism theories of publication bias. Are null results easier or harder to publish when reviewers are told the result is expected to be positive?
Chopra et al. (2022) finds that a null result is extra unlikely to be recommended for publication if it flies in the face of the expert consensus, consistent with the skepticism theory of publication bias. To provide some further insight into what’s going on, Chopra and coauthors run a second experiment with about 100 economics graduate students and early career researchers, again showing them research vignettes that differ across groups in terms of whether the main results are statistically significant. But this time, they directly ask their respondents to rate the statistical precision of the main result on a scale from 1 to 5, where 1 is very imprecisely estimated” and 5 is “very precisely estimated.” Even though the statistical precision is the exact same in the positive and null result versions of the vignettes, respondents rated the precision of the null results as significantly lower.
That suggests to me that when economists like myself see an unexpected null result, we tend to think that’s probably due to a weak study design and hence not very informative about the true state of the world.
Another paper in this literature provides additional evidence that reviewers see null results as indicative of something going wrong in the paper’s methods. Emerson et al. (2010) sends 110 reviewers at 2 leading orthopedic journals an article for review: roughly half the reviewers get a version of the paper with a positive result; the others get a basically identical paper, except the main result is negative. As with Chopra et al. (2022), reviewers were more likely to recommend acceptance for the version with a positive result.
As with Chopra and coauthors, this seems to be because reviewers didn’t trust the results of null results. Despite the fact that each paper had a word-for-word identical “methods” section, reviewers who reviewed a version of the paper with positive results rated the paper’s methodological validity at 8.2 out of 10, while those who saw the version with a null result rated the validity at 7.5 out of 10 (a difference that was statistically significant). Emerson and coauthors also embed another clever check inside their experiment: in each paper they purposefully inserted various errors and then they read the review reports to see how many of these errors were detected by reviewers. The set of errors was identical across both sets of paper, but on average reviewers of the null result version of the paper detected 0.8 of them, while reviewers of the positive result version detected 0.4. That suggests reviewers of the null result papers were reading more skeptically, with an eye towards seeing whether the null result reflected ground truth or merely weak methods.
Another way we can see how publication is affected by our expectation of what a result “should” be is with replication studies. Specifically, is it easier to publish a replication study if it finds results that are the same as the original study or different? In this case the results are a bit more in favor of the surprise theory of publication bias.
Berinsky et al. (2021) test this, also by providing short descriptions of research projects to reviewers, this time in political science departments. Once again, some reviewers see versions of these descriptions with positive results, others with negative results. But Berinsky and coauthors also look specifically at the willingness to publish replications of prior work. Just as is the case with original work, they find a bias against replications that get null results. But in this case the effect is slightly ameliorated if the original study had a positive finding. In other word, in the specific case of replications, if people “expect” a positive result, because the original study was positive, they are slightly more likely to be willing to publish a null result than would be the case if the original study was also a null result. However, the opposite does not seem to be the case: a replication that gets a positive result is just as likely to be published, whether the original study is positive or a null result.
The paper then discusses a 2013 study by Doucouliagos and Stanley, which examines the extent of publication bias across different subfields in economics, and - also consistent with the skepticism theory - argues bias is worse in fields where theory points in one particular direction than in fields where theory permits a wider array of results.
Finally, the post has a new conclusion:
To sum up, there seems to be significant variation in the extent of publication bias across different scientific fields with social sciences and some parts of biology seeming to suffer from this problem more acutely than other areas. There’s not a ton of work in this area (email me if you know of more!) but what is available seems to suggest one reason for this may be that reviewers will tend to be more skeptical of null results in fields with less reliable methods. In some experiments on reviewers, we see some evidence reviewers read papers with null results more skeptically: they spot more errors, they rate the methodology as weaker, and they estimate the results are less precisely estimated. And experiments and observational data from economics also suggests reviewers tend to be extra skeptical of results that go against the prevailing consensus. In a world where evidence that surprises is most interesting, we would predict that kind of paper to be easier to publish; but that logic only holds if you think the null result reflects a genuine fact about the world.
And in fact, we know from other work that social scientists are right to exercise skepticism towards their own empirical methods. In the many analysts literature (discussed in some detail here), multiple teams of social scientists are given the same data and asked to answer the same question, yet quite frequently arrive at very different conclusions!
Unfortunately, if this mechanism is right, it suggests fields with lower levels of empirical reliability are more likely to additionally face the burden of publication bias. It is precisely the fields where lots of empirical work is needed to average out a semi-reliable answer that we see the most bias in what empirical work is published!
That said, the study of publication bias is itself a field where empirical work might be shaky. And that should give us pause about the strength of this conclusion. Indeed - who knows if others have looked for the same relationships elsewhere and gotten different results, but have been unable to publish them? (Joking! Mostly.)
Until Next Time
Thanks for reading! As always, if you want to chat about this post or innovation in generally, let’s grab a virtual coffee. Send me an email at mattclancy at hey dot com and we’ll put something in the calendar.
New Things Under the Sun is produced in partnership with the Institute for Progress, a Washington, DC-based think tank. You can learn more about their work by visiting their website.