Why is publication bias worse in some fields than in others?

Hard sciences don't seem to have so many problems; why?


Publication bias is real. In the social sciences, more than one study finds that statistically significant results seem to be about three times more likely to be published than insignificant ones. Some estimates from medicine aren’t so bad, but a persistent bias in favor of positive results remains. What about science more generally? 

To answer that question, you need a way to measure bias across fields that might be very different in terms of their methodology. One way is to look for a correlation between the size of the standard error and the estimated size of an effect. In the absence of publication bias, there shouldn’t be a relationship between these two things. To see why, suppose your study has a lot of data points. In that case, you should be able to get a very precise estimate that is close to the actual population average of the thing you’re studying. On the other hand, if your study has very few data points, you’ll get a very imprecise estimate, including a high probability of getting something very much bigger than the actual population mean and a high probability of getting something very much smaller. But over lots of studies, if there’s no publication bias, you’ll get some abnormally high estimates, some abnormally low ones, but in the end they’ll cancel each other out. If, however, small estimates are systematically excluded from publication, then you’ll end up with a robust correlation between the size of your standard errors and the size of your effects. The extent of this correlation is a way to measure the extent of publication bias in a given literature.

(A downside of this approach is that it will only work in disciplines where this framework makes sense; where research is primarily about measuring effect sizes with noisy data. But enough disciplines do this that it’s a start.)

Fanelli, Costas, and Ioannidis (2017) obtain 1,910 meta-analyses drawn from all areas of science, and pull from these 33,355 datapoints from original underlying studies. For each meta-analysis, they compute the correlation between the standard error and the size of the estimated effect; they then do a weighted average across the different meta-analyses to generate a sort of average over the meta-analyses in fields they cover. In general, the more positive the estimate, the stronger the correlation between standard errors and effect size, implying stronger publication bias. Results below:

Note that the social sciences (up at the top) have pretty high measures of bias, estimated with a lot of precision, while many (but not all) of the biological fields also have fairly high bias. But also note the bottom two rows, which seem to exhibit no bias: computer science, chemistry, engineering, geosciences, and mathematics.

As noted already though, this method of measuring bias might not be appropriate for all fields, since it is rigidly defined in terms of sampling from noisy data. Fanelli (2010) uses a simpler, but more flexible measure of publication bias. Fanelli analyses a random sample of 2,434 papers from all disciplines that include some variation of the phrase “test the hypothesis.” For each paper, Fanelli determined if the authors of the paper argued they had found positive evidence for their hypothesis or not (that is, they either found no evidence in favor of the hypothesis, or actually found contrary results). As a rough and ready test of publication bias, he then looked at the share of hypotheses in each field for which positive support was found. He finds between 75 and 90% of hypotheses mentioned in published papters tend to be supported, across different disciplines. But there are some significant differences across disciplines.

Fanelli cuts papers into six categories: physical sciences, biological sciences, and social sciences, and for each one he further sub-divides papers into pure science and applied science. There are no major differences among applied papers in all three domains - bias seems to be quite high in every case. But in the pure science fields, physical sciences tended to find support for less than 80% of their hypotheses while social sciences tended to find support for nearly 90% of the hypotheses investigated. Biology is in the middle.

Taken together, these two studies suggest the social sciences have bigger problems with publication bias than do the biological sciences, which tend to have more problems than the hard sciences. Why?

What Drives Bias Across Fields?

Let me run through three possible explanations, before looking at one of the few studies that can provide evidence. Note that there are probably more possible explanations, and they’re not mutually exclusive. As far as I can tell, there is very little work on this question (please email me if you know of other relevant work - I would love to hear about it). 

First, variation in publication bias could be related to the nature of publication in different fields. If it’s easier to draft and push an article through peer review in some fields than in others then some fields may end up getting more results out there (even if they’re not out there in a top-ranked journal). In the social sciences, we have some evidence that the biggest difference between null results and strong results is that most null results are never even written up and submitted for publication. Maybe that’s because it’s too much work for too little reward. In a field where writing up and publishing results from an experiment somewhere is easy, it might be worth doing, if only to add another line to the CV.

Second, variation in publication bias could be related to the nature of data in different fields. It may be easier in some fields to tightly control for noise in data, or to obtain many more observations, than in others. In economics, a big sample might be hundreds of thousands of observations. In physics, the Large Hadron Collider generates 30 petabytes of data per year. In fields where clean data is plentiful, it might not be the case that when you run an experiment sometimes you find support for a hypothesis and sometimes you don’t. You always find the same thing, or at least always come to the same conclusion about statistical significance. In that case, you won’t find much of a relationship between the size of standard errors and effect sizes: within the range of observed standard errors, everything is either significant or not.

Lastly, it may be that fields differ in their criteria for deciding what is worth publishing. The root cause of publication bias is that journals want to highlight notable research, in order to be relevant to their readership. But what counts as notable research? 

Suppose that empirical research is most notable when it provides support for specific theories. In that case, a question in which multiple competing theories make different predictions might exhibit less publication bias. If there is a theory that predicts a null result, and another theory that predicts a statistically significant result, and we don’t have good evidence on which theory is correct, then either result is notable and helps us understand how the world works. Consequently, a journal should be more willing to publish either result.

It might be the case, for example, that the hard sciences have sufficiently established theories such that null results are quite surprising when they are found, and hence easier to publish. In the social sciences, in contrast, we’re just not there yet. Instead, we have an unstated assumption that most hypotheses are false. When we fail to find evidence for one of these hypotheses, it’s not surprising or notable, and so harder to publish.

A bit of evidence from economics

To shed a little light on these questions, let’s look at one more study of differential bias. We’ve seen some evidence that bias varies across major disciplines. But we also have some evidence that bias varies within a particular discipline. 

Doucouliagos and Stanley (2013) looks at 87 different meta-analyses from empirical economics and measure the extent of publication bias in each of the literatures covered using the approach already covered, where standard errors are compared with effect sizes. In the figure below, they classify anything smaller than 1 as exhibiting little to modest selection bias, anything between 1 and 2 as exhibiting substantial selection bias, and anything over 2 as exhibiting severe selection bias. They find there are plenty of results in each category.

What drives different levels of bias in economics?

I think it is less likely that variation in publication bias within economics is driven by different publication standards within the different meta-analyses covered. In many cases, these literature are publishing in the exact same journals, but on different questions.

Doucouliagos and Stanley provide a bit more evidence that publication bias might be related to data though. My subjective read on the quality of data across economic fields is that macroeconomics has the toughest time with getting lots of clean data. And Doucouliagos and Stanley do find publication bias seems to more extreme in macroeconomics than other fields.

But Doucouliagos and Stanley (2013) is really set up to test the third explanation: that differences in the range of values permitted by theory explain a big chunk of the variation in publication bias across fields. How are you going to measure that though?

Doucouliagos and Stanley take a few different approaches. First, they just use their own judgement to code up each meta-analysis as pertaining to a question where theory predicts empirical results can go either way (i.e., positive or negative). Second, they use their own reading of the meta-analyses or draw on surveys (where they exist) to assess whether there is “considerable” debate around this area of research. Whereas they claim their first measure is non-controversial and that most economists would agree with how they code things, they acknowledge the second criteria is a subjective one. 

By both of these measures, they find that when theory admits a wider array of results, there is less evidence of publication bias. And the effects are pretty large. A field whose theory they code as admitting positive and negative results has a lot less bias than one that doesn’t - the difference is large enough to drop from “severe” selection bias to “little or no” selection bias, for example. 

But maybe we’re worried at this point that we have the direction of causality exactly backwards. Maybe it’s not that wider theory permits a wider array of results to be published. Maybe it’s that a wider array of published results leads theorists to come up with wider theories to accommodate this evidence. Doucouliagos and Stanley have two responses here. First, there is a difference between the breadth of results published and publication bias and they try to control for the former to really isolate the latter. After all, it is possible for a field to have both selection bias and a wide breadth of results published. Their methodology can separately identify both, at least in theory, and so they can check if there is more selection bias when there is more accommodating theory, even when two fields have an otherwise similarly large array of results to explain.

But in practice, I wonder if controlling for this is hard to do. So I am a fan of the second approach they take to address this issue. There are some theories in economics where there just really isn’t much wiggle room about which way the results are supposed to go. One of them is studies estimating demand. Except for some exotic cases, economists expect that if you hold all else constant, when prices go up, demand should go down, and vice-versa. We even permit ourselves to call this the “law” of demand. Economists almost uniformly will believe that apparent violations of this can be explained by a failure to control for confounding factors. They will strongly resist the temptation to derive new theories that predict demand and price go up or down together. 

Moreover, it isn’t controversial to identify which meta-analyses are about estimating demand and which are not. So for their final measure, Doucouliagos and Stanley look at estimates of bias in studies that estimate demand and those that don’t. And they find studies that estimate demand exhibit much more selection bias than those that don’t (even more than in their measures about extent of debate or what theory permits). In other words, when economists get results that say there is no relationship between price and demand, or that demand goes up when prices go up, these results appear less likely to be published.

So, at least in this context, if your theory admits a wider array of “notable” findings, then you seem to have less trouble getting findings published. Of course, this is just one study, so I want to be cautious leaning too heavily on it. Indeed - who knows if others have looked for the same relationships elsewhere and gotten different results, but have been unable to publish them? (Joking! Mostly.)


If you liked this post, you might also like:

Publication Bias is a Real Thing

Estimating the size of the file drawer problem


Publication bias is when academic journals make publication of a paper contingent on the results obtained. Typically, though not always, it’s assumed that means results that indicate a statistically significant correlation are publishable, and those that don’t are not (where statistically significant typically means a result that would be expected to occur by random chance less than 5% of the time). A rationale for this kind of preference is that the audience of journals is looking for new information that leads them to revise their beliefs. If the default position is that most novel hypotheses are false, then a study that fails to find evidence in favor of a novel hypothesis isn’t going to lead anyone to revise their beliefs. Readers could have skipped it without it impacting their lives, and journals would do better by highlighting other findings. (Frankel and Kasy 2021 develops this idea more fully)

But that kind of policy can generate systematic biases. In another newsletter, I looked at a number of studies that show when you give different teams of researchers the same narrow research question and the same dataset, it is quite common to get very different results. For example, in Breznau et al. (2021), 73 different teams of researchers are asked to answer the question “does immigration lower public support for social policies?” Despite each team being given the same initial dataset, the results generated by these teams spanned the whole range of possible conclusions, as indicated in the figure below: in yellow, more immigration led to less support for social policies, in blue the opposite, and in gray there was no statistically significant relationship between the two.

Seeing all the results in one figure like this tells us, more or less, that there is no consistent relationship between immigration and public support for social policies (at least in this dataset). But what if this had not been a paper reporting on the coordinated efforts of 73 research teams, committed to publishing whatever they found? What if, instead, this paper was a literature review reporting on the published research of 73 different teams that tackled this question independently? And what if these teams could only get their results published if they found the “surprising” result that more immigration leads to more public support for social policies? In that case, a review of the literature would generate a figure like the following:

In this case, the academic enterprise would be misleading you. More generally, if we want academia to develop knowledge of how the world works that can then be spun into technological and policy applications, we need it to give us an accurate picture of the world. 

On the other hand, maybe it’s not that bad. Maybe researchers who find different results can get their work published, though possibly in lower-ranked journals. If that’s the case, then a thorough review of the literature could still recover the full distribution of results that we originally highlighted. That’s not so hard in the era of google scholar and sci-hub.

So what’s the evidence? How much of an issue is publication bias?

Ideal Studies

Ideally, what we would like to do is follow a large number of roughly similar research projects from inception to completion, and observe whether the probability of publication depends on the results of the research. In fact, there are several research settings where something quite close to this is possible. This is because many kinds of research require researchers to submit proposals or pre-register studies before they can begin investigation. A large literature attempts to track the subsequent history of such projects to see what happens to them.

Dwan et al. (2008) reports on the results of 8 different healthcare meta-analyses of this type, published between 1991 and 2008. The studies variously draw on the pool of proposals submitted to different institutional review boards, or drug trial registries, focusing on randomized control trials where a control group is given some kind of placebo and a treatment group some kind of treatment. The studies then perform literature searches and correspond with the authors of these studies to see if the proposals or pre-registered trials end up in the publication record. 

In every case, results are more likely to be published if they are positive than if the results are negative or inconclusive. In some cases, the biases are small. One study (Dickersin 1993) of trials approved for funding by the NIH found 98% of positive trial outcomes were published, but only 85% of negative results were (i.e., positive results were 15% more likely to be published - though, as a commenter below points out, you could just as easily say null results are not published 15% of the time, and that is 7.5 times as much as positive results are not published). In other cases, they’re quite large. In Decullier (2005)’s study of protocols submitted to the French Research Ethics Committees, 69% of positive results were published, but only 28% of negative or null results were (i.e., positive results were about 2.5 times more likely to be published). 

These kinds of studies are in principle possible for any kind of research project that leaves a paper trail, but they are most common in medicine (which has these institutional review boards and drug trial registries). However, Franco, Malhotra, and Simonovits (2014) are able to perform a similar exercise for 228 studies in the social sciences that relied on the NSF’s Time-sharing Experiments in the Social Sciences program. Under this program, researchers can apply to have questions added to a nationally representative survey of US adults. Typically, the questions are manipulated so that different populations receive different versions of the question (i.e., different question framing or visual stimulus), allowing the researchers to conduct survey-based experiments. This paper approaches the ideal in several ways:

  • The proposals are vetted for quality via a peer review process, ensuring all proposed research question clear some minimum threshold

  • The surveys are administered in the same way by the same firm in most cases, ensuring some degree of similarity of methodology across studies

  • All the studies are vetted to ensure they have sufficient statistical power to detect effects of interest

That is, in this setting, what really differentiates research projects is the results they find, rather than the methods or sample size.

Franco, Malhotra, and Simonovits start with the set of all approved proposals from 2002-2012 and then search through the literature to find all published papers or working papers based on the results of this NSF program survey. For more than 100 results, they could find no such paper, and so they emailed the authors to ask what happened and also to summarize the results of the experiment. 

The headline result is that 92 of 228 studies got strong results (i.e., statistically significant, consistent evidence in favor of the hypothesis). Of these, 62% were published. In contrast, 49 studies got null results (i.e., no statistically significant difference between the treatment and control groups). Of these, just 22% were published. Another 85 studies got mixed results, and half of these were published. In other words, studies that got strong results were 2.8 times as likely to be published as those that got null results.

The figure below plots the results from both studies, with the share of studies with positive results that were published on the vertical axis and the share of studies with negative results that were published on the horizontal axis. The blue dots correspond to 8 different healthcare meta-analyses from Dwan et al. (2008), and the orange ones to four different categories of social science research in Franco, Malhotra, and Simonovits (2014). For there to be no publication bias, we would expect dots to lie close to the 45-degree line, on either side. Instead, we see they all lie above it, indicating positive results are more likely to be published than negative ones. The worst offenders are in the social sciences. Of the 60 psychology experiments Franco, Malhotra, and Simonovits study, 76% of positive results resulted in a publication, compared to 14% of null results. In the 36 sociology experiments, 54% of positive results got a publication and none of the null results did.

Blue dots are from biomedical meta-analyses in Dwan et al. (2008); Orange are different social science groupings from Franco, Malhotra, and Simonovits (2014)

Franco, Malhotra, and Simonovits (2014) has an additional interesting finding. As we have seen, most null results do not result in a publication for this type of social science. But in fact, of the 49 studies with null results, 31 were never submitted to any journal, and indeed were never even written up. In contrast, just 4 of 98 positive results were not written up. In fact, if you restrict attention only to results that get written up, 11 of 18 null results that are written up get published - 61% - and 57 of 88 strong results that get written up get published - 64%. Almost the same, once you make the decision to write them up.

So why don’t null results get written up and submitted for publication? Franco, Malhotra, and Simonovits got detailed explanations from 26 researchers. Most of them (15) said they felt the project had no prospect for publication given the null results. But in other cases, the results just took a backseat to other priorities, such as other projects. And in two cases, the authors eventually did manage to publish on the topic… by getting positive results from a smaller convenience sample(!).

Why is this interesting? It shows us that scientists actively change their behavior in response to their beliefs about publication bias. In some cases, they just abandon projects that they feel face a hopeless path to publication. In other cases, they might change their research practices to try and get a result they think will have a better shot at publication, for example by re-doing the study with a new sample. The latter is related to p-hacking, where researchers bend their methods to generate publishable results. More on that at another time.

Less than Ideal Studies

These kinds of ideal studies are great when they are an option, but in most cases we won’t be able to observe the complete set of published and unpublished (and even unwritten) research projects. However, a next-best option is when we have good information about what the distribution of results should look like, in the absence of publication bias. We can then compare the actual distribution of published results to what we know it should look like to get a good estimate of how publication bias is distorting things.

Among other things, Andrews and Kasy (2019) show how to do this using data on replications. To explain the method, let’s think of a simple model of research. When we do empirical research, in a lot of cases what’s actually going on is we get our data, clean it up, make various decisions about analysis, and end up with an estimate of some effect size, as well as an estimate of the precision of our estimate - the standard error. Andrews and Kasy ask us to ignore all that detail and just think of research as like drawing a random “result” (an estimated effect size and standard error) from an unknown distribution of possible results. The idea they’re trying to get at is just that if we did the study again and again, we might get different results each time, but if we did it enough we would see they cluster around some kind of typical result, which we could estimate with the average. In the figure below, results are clustered around an average effect size of -0.5 and an average standard error 0.4.

If we weren’t worried about publication bias, then we could just look at all these results and get a pretty good estimate of the “truth.” But because of publication bias we’re worried that we’re not seeing the whole picture. Maybe there is no publication bias at all and the diagram above gives us a good estimate. Or maybe results that are not statistically significant are only published 10% of the time. That would look something like this, with the light gray dots results that aren’t published in this example.

In this case, the unpublished results tend to be closer to zero for any given level of standard error. By including them, the average effect size is cut down by 40%.

What Andrews and Kasy do is show is that if you have a sample of results that isn’t polluted by publication bias, you can compare that to the actual set of published results to infer the extent of publication bias. So where do you get a sample of results that isn’t polluted by publication bias? One option is from systematic replications. 

Two such projects are Camerer et al. (2016)’s replication of 18 laboratory experiments published in two top economics journals between 2011 and 2014, and Open Science Collaboration (2015)’s replication of 100 experiments published in three top psychology journals in 2008. In both cases, the results of the replication efforts are bundled together and published as one big article. We have confidence in each case that all replication results get published; there is no selection where successful or unsuccessful replications are excluded from publication.

We then compare the distribution of effect sizes and standard errors, as published in those journals, to an “unbiased” distribution that is derived from the replication projects. For example, if half of the replicated results are not statistically significant, and we think that’s an unbiased estimate, then we should expect half of the published results to be insignificant too in the absence of publication bias. If instead we find that just a quarter of the published results are not statistically significant, that tells us significant results are three times more likely to be published, since there are three times as many significant ones as insignificant, and we would expect them to have equal probability of being significant or insignificant. Using a more sophisticated version of this general idea, Andrews and Kasy estimate that null results (statistically insignificant at the standard 5% level) are just 2-3% as likely to be published in these journals as statistically significant results!

Now, note the important words “in these journals” there. We’re not saying null results are 2-3% as likely to be published as significant results; only that they are 2-3% as likely to be published in these top disciplinary journals. As we’ve seen, the studies from the previous section indicate publication bias exists but is usually not on the order of magnitude seen here. We would hope non-significant results rejected for publication in these top journals would still find a home somewhere else (but we don’t know that for sure).

Even Less Ideal Studies

This technique works when you have systematic replication data to generate an unbiased estimate of the “true” underlying probability of getting different effect sizes and standard errors. But replications remain rare. Fortunately, there is a large literature on ways to estimate the presence of publication bias even without any data on the distribution of unpublished results. I won’t talk about this whole literature (Christensen and Miguel 2018 is a good recent overview) but I’ll talk about Andrews and Kasy’s approach here, but plan to cover some of the others in a future newsletter.

Let’s go back to this example plot from the last section. Suppose we have this set of published results. How can we know if this is just what the data looks like, or if this is skewed by publication data?

As I noted before, statistically significant results are basically ones where the estimated effect size is “large” relative to imprecision of that estimate, here given by the standard error. For any point on this diagram, we can precisely calculate which results will be statistically significant and which won’t be. It looks sort of like this, with results in the red area not statistically significant.

Notice the red area of statistical insignificance is eerily unpopulated; an indication that we’re missing observations (these are called funnel plots, in the literature).

What Andrews and Kasy propose is something like this: for a given standard error, how many do you see in the zone of insignificant results versus outside it. For instance, look at the regions highlighted in blue in the figure below.

If we assume that without publication bias there is no systematic relationship between standard errors effect sizes (i.e., without publication bias, a big standard error is equally likely for a big effect as a small one), then we should expect to see a similar spread of effect sizes across each rectangle. In the bottom one, they range pretty evenly from -1 to 3, but in the top one they are all clustered on around 2. Also, in the top area, most of the area lies in the “red zone” of statistical insignificance and so we should expect to observe a lot of statistically insignificant results there - the majority of observations, in fact. But instead, we just observe 1 out of 6 total observations statistically insignificant. This kind of information about the difference between what we should expect to see versus what we actually see can also be used to infer the strength of publication bias. In this case, instead of using replications to get at the “true” distribution, it’s like we’re using the distribution of effect sizes for precise results to help tell us about what kinds of effect sizes we should see for less precise results. After all, they should be basically the same, since usually an imprecise result is just a precise result with fewer datapoints. There isn’t anything fundamentally different about them.

Andrews and Kasy apply this methodology to Wolfson and Belman (2015), a meta-analysis of studies on the impact of the minimum wage on employment. They estimate results that find a negative impact of the minimum wage on employment (at conventional levels of statistical significance) are a bit more than 3 times as likely to be published as papers that don’t. Unlike the replication based studies, this estimate is a closer match to the results we found for the ideal data studies, where social science papers that found strong positive results were around 2.8 times as likely to be published as those that didn’t.

So, to sum up; yes - publication bias is a real thing. You can see it’s statistical fingerprints in published data. And when you’re lucky enough to have much better data, such as on the distribution of results that would occur without publication bias, or when you can actually see what happens to unpublished research, you find the same thing: positive results have an easier time getting published.

So what do you do about it? Well, you can find ways to reduce the extent of bias, for example by getting journals to precommit to publishing papers based on the methods and the significance of the question asked, rather than the results. Or you can create avenues for the publication of non-significant findings, either in journals or as draft working papers. Recall however that Franco, Malhotra, and Simonovits found most null results weren’t even written up at all - it may be that if non-significant work can be published in some outlets but will still be ignored by other researchers, researchers might not bother to write them up and instead choose to allocate their time to pursuing research that might attract attention.

Alternatively, if you can identify publication bias, you can correct for it with statistical tools. Andrews and Kasy, as well as others, have developed ways to infer the “true” estimate of an effect by estimating the likely value of unpublished research. Indeed, Andrews and Kasy make such a tool freely available on the web.

But we can also look more deeply at the underlying causes of publication bias. It turns out that the extent of publication bias varies widely across disciplines and sub-disciplines. That’s a bit surprising. Why is that the case? The plan for next week is to dig into precisely that question.


If you liked this post, you might also like:

An example of high returns to publicly funded R&D

Quasi-experimental evidence from R&D grants to small businesses


In last week’s newsletter, we looked at a thought experiment by Jones and Summers that pretty convincingly argued the average return on a dollar of R&D was really high. That would seem to suggest we should be spending a lot more on R&D.

But the devil is in the details. How, exactly, should you increase your R&D spending? Today, let’s look at one kind of program that seems to work and would be an excellent candidate for more funds: the US’ Small Business Innovation Research (SBIR) program and the European Union’s SME instrument (which was modeled on the SBIR). 

Grants for Small Business Innovation

The SBIR and SME instrument programs are competitive grant competitions, where small businesses submit proposals for R&D grants from the public sector. Each program consists of two phases, where the first phase involves significantly less money than the second. For the SBIR program, a phase 1 award is typically up to $150,000 and roughly $1mn in phase 2; for the SME instrument, phase 1 is just €50,000 and phase 2 is €0.5-2.5mn. In the US, firms apply for phase 1 first, and then phase 2, whereas in the EU firms can apply to either straightaway. Broadly speaking, the money is intended to be used for R&D type projects.

They’re pretty competitive. In the US, an application to the SBIR program run by the Department of Energy will typically take a full-time employee 1-2 months to complete and has about a 15% chance of winning a phase 1 award; conditional on winning a phase 1 award, firms have about a 50% chance of winning a phase 2 award (or overall chances about 8%). In the EU, the probability of winning in phase 1 is about 8%, and in phase 2, just 5%.

So, both programs involve the government attempting to pick winning ideas, and then giving the winners money to fund R&D. How well do they work? Does the money generate innovations? Does it get the good return on investment that Jones and Summers’ thought experiment implies should be possible?

Evaluating the Impact of Grants

Two recent papers look at this question using the same method. Howell (2017) looks at the Department of Energy’s SBIR program ($884mn disbursed over 30 years), while Santoleri et al. (2020) looks at the SME instrument program (€1.3bn disbursed over 2014-2017). Each paper has access to details on all applicants to the program, not merely the winners. That means they can follow the trajectory of businesses that apply and get a grant as well to those that apply but fail to get an R&D grant.

But to assess the impact of money on innovation, they can’t just compare the winners to the losers, because the government isn’t randomly allocating the money: it’s actively trying to sniff out the best ideas. That means the winning applicants would probably have done better than the losers, even if they hadn’t received any R&D funds, since someone thought they had a more promising idea. But the way these programs are administered have a few quirks that allow researchers to estimate the causal impact of getting money.

During the period studied, each program held lots of smaller competitions devoted to a specific sector or technology. For example, the Department of Energy might solicit proposals for projects related to Solar Powered Water Desalination. Within each of these competitions, proposals are scored by technical experts and then ranked from best to worst. After this ranking is made overall budgets are drawn up for the competitions (without reference to the applications received), and the best projects get funded until the money runs out. For example, the Department of Energy might receive 11 proposals for it’s solar powered water desalination topic, but (without looking at the quality of the proposals) decide it will only be able to fund 3. The top three each get their $150,000, and the fourth gets nothing.

The important thing is that, for applications right around the cut-off, although there is a big change in the amount of money received, there shouldn’t be a big change in the quality of the proposals (in our example, the difference between third and fourth place shouldn’t be abnormally large). That is, although we don’t have perfect randomisation, we have something pretty close. Proposals on either side of the cutoff for funding differ only a bit in the quality of their proposals, but experience a huge difference in their ability to execute on those proposals because some of them get money and some don’t. It’s a pretty close estimate of the causal impact of getting the money. 

What’s the Impact

Each paper looks at a couple measures of impact. A natural place to start when evaluating the impact of small business innovation is patents. In each case, patents are weighted by how many citations they end up receiving (while citations might be a problematic measure of knowledge flows, they seem to be quite good as a way of measuring the value of patents: better patents seem to get more citations). In the text, I’ll just call them patents, but you should think of them as “patents adjusted for quality.” The papers produce some nice figures (Howell left, Santoleri et al. right).

These figures nicely illustrate the way the impact of cash is assessed in these papers. Especially in the left figure, you can see that, as we worried, proposals that are ranked more highly by the SBIR program do tend to get more patents: the SBIR program does have the ability to judge which projects are more likely to succeed. Even looking at projects that don’t get funding (the ones to the left of the red line), Howell’s measure of patenting rises by about 0.2 log points at each rank, as we move from -3 to -2 to -1. If that trend were to continue, we would expect the +1 ranked proposal to jump another 0.2, to something a bit less than 0.8. But instead, it jumps more than twice as much. That’s pretty suggestive that it was the funding that mattered. Estimated more precisely, Howell finds getting one of the SBIR’s phase 1 grants increases patenting by about 30%. Santoleri et al. (2020) get similar results, estimating a 30-40% increase in patenting from getting a phase 2 grant (though note that the phase 2 grants in the EU tend to be a lot larger than the phase 1 US grants).

To the extent we’re happy with patents as a measure of innovation, we’ve already shown that the program successfully manages to buy innovations. But the papers actually document a voluminous set of additional indicators all associated with a healthy and flourishing innovative business (in all cases below, SBIR refers to phase 1 and SME refers to phase 2):

  • Winning an SBIR grant doubles the probability of getting venture capital funding; winning an SME instrument grant triples the probability of getting private equity funding

  • Winning an SBIR grant increases annual revenue by $1.3-1.7mn, compared to an average of $2mn

  • Winning an SME instrument grant increases the growth rate of company assets by 50-100%, the growth rate of employment by 20-30%, and significantly decreases the chances of the firm failing.

Benchmarking Value for Money

OK, so winning money helps firms. Is that surprising? Do we really need scientists to tell us that? In fact, it’s not guaranteed. When Wang, Li, and Furman (2017) apply this methodology to a similar program in China they don’t find the money makes a statistically significant difference. That could be for a lot of reasons (discussed in the paper), but the main point is simply that we can’t take for granted seemingly obvious results like “giving firms money helps them.” 

But still, even if we find R&D grants help firms, that doesn’t necessarily imply it’s a good use of funds. We want to know the return on this R&D investment. That’s challenging because although we know the cost of these programs, it’s hard to put a solid monetary value on the benefits that arise from them, which is what we would need to do to calculate a benefits cost ratio.

So let’s take a different tack. One thing we can measure reasonably well is whether firms get a patent. So let’s just see how many patents these programs generate per R&D dollar and compare that to the number of patents per R&D dollar that the private sector generates. If we assume the private sector knows what it’s doing in terms of getting a decent return on R&D investment, then that gives us a benchmark against which we can assess the performance of these government programs.

So how many patents per dollar does the private sector get? If you divide US patent grants (from domestic companies) over 2010-2017 by R&D funded by US businesses in the same year, you pretty consistently get a ratio of around 0.5 patents per million dollars of R&D (details here). That’s about the same ratio as this post finds, looking only at 7 top tech companies. 

To be clear, the point isn’t that each patent costs 2 million dollars of R&D. R&D doesn’t just go into patents. This report found in 2008 that only about 20% of companies that did R&D reported a patent. Taking that as a benchmark, suppose that only 20% of inventions get patented; in that case, we could think of this as telling us that every $2mn in R&D generates 5 “innovations” of which one gets patented. As long as SBIR/SME grant recipients have a similar ratio between innovation and patenting as other US R&D performing firms, then looking at patents per R&D for them is an OK benchmark for the productivity of R&D spending.

If that sounds good enough to you, read on! If not, I say a bit more about this in an extra note at the end of this post. Feel free to check that out and then come back here if you’re feeling skeptical.

So do these programs generate patents at a similar rate of 0.5 per million? Yes!

Value for money in the SBIR Program

This isn’t something that Howell (2017) or Santoleri et al. (2020) calculate directly, but you can back out estimates from their results using a method described in the appendix of our next paper, Myers and Lanahan (2021). Myers and Lanahan estimate Howell’s results imply the DOE SBIR program gets about 0.8-1.3 patents per million dollars. Applying their method to the range of estimates in Santoleri et al. (2020) and converting into dollars, you get something in the ballpark of 0.7 patents per million dollars in the SME instrument program (see the extra notes section at the bottom for more on where that number comes from). In either case, that compares pretty favorably with a rough estimate of 0.5 patents per million R&D dollars for the US private sector.

That’s reassuring, but it’s not exactly what we’re interested in. As stated at the outset, Jones and Summers’ thought experiment implies that R&D is a really good investment once you take into account all the social benefits. What we have here is evidence that the SBIR and SME instrument programs can probably match the private sector in terms of figuring out how to wisely spend R&D dollars to purchase innovations. Frankly, that seems plausible to me. It just means governments, working with outside technical experts (that even the private sector might need to turn to) could do about as well as the private sector. But they don’t tell us much about the benefits that accrue from these R&D investments that aren’t captured by the patents the grant recipients get.

But that’s what Myers and Lanahan (2021) is about. What they would like to see is how giving R&D money to different technology sectors leads to more patents in that sector by grant recipients, as well as other impacts on patenting in general. For example, if we give a million dollars to a couple firms working on solar powered water desalination, how many new solar water patents do we get from those grant recipients? What about solar water patents from other people? What about patents that aren’t about solar powered water desalination at all?

Like Howell, they’re going to look at the Department of Energy’s SBIR program. They need to use a different quirk of the SBIR though, because they’re not comparing firms that get funds to firms that don’t; they’re comparing entire technology fields that get more money to fields that get less money. 

Instead, they rely on the fact that some US states have programs to match SBIR funding with local funds. Importantly, DOE doesn’t take that into account when deciding how to dole out funds. For example, in 2006, North Carolina began partially matching the funds received by SBIR winners in the state. If a bunch of winning applicants in solar technology happen to reside in, say, North Carolina in 2008 instead of South Carolina in 2008 or North Carolina in 2005, then those recipients get their funds partially matched by the state and solar technology research, as a field, gets an unexpected windfall of R&D dollars. What Myers and Lanahan end up with is something close to random R&D money drops for different kinds of technologies. 

Myers and Lanahan use variation in this unexpected “windfall” money to generate estimates of the return on R&D dollars. For this to work, you have to believe there are no systematic differences between SBIR recipients that reside in states with matching programs and those that don’t, and they present some evidence that this is the case.

One more hurdle though. You can crudely measure innovation by counting patents. And, with some difficulty, you can come up with estimates of more-or-less random R&D allocations to different technologies. But if you want to see how the one affects the other, you have to link patents to SBIR technology areas. Myers and Lanahan accomplish this with natural language processing. For every SBIR grant competition, they analyze the text of the competition description and identify the patent technology categories whose patents are textually most similar to this description. When, say, solar technology gets a big windfall of R&D money, they can see what happens to the number of patents in the patent technology categories that historically have been textually closest to the DOE’s description of what it was looking to fund. And this is also how they measure the broader impact of SBIR money on other technologies. When solar gets a big windfall, what happens to the number of patents in technology categories that are not solar, but are kind of “close” to solar technology (as measured by text similarity)?

OK! So that’s what they do. What do they find? More money for a technology means more patents!

In the above figure, each dot is a patent technology group and compares the funding received by that technology to subsequent patenting. The figure looks at the patents of DOE SBIR recipients only and nicely illustrates the importance of doing the extra work of trying to estimate “windfall” funding. The steep dotted green line is what you get if you just tally up all the funding the SBIR program gives to different technologies - it looks like a little more funding gives you a lot more patents. But this is biased by the fact that the technologies that get the most money were already promising (that’s why they got money!). The flatter dark blue line is the relationship between quasi-random windfall money and patenting. It’s still the case that more money gets you more patents, but the relationship isn’t as strong as the green one. But this is the more informative estimate on the actual impact of cash. 

Using estimates based on windfall funding, an extra million dollars is associated with SBIR recipients being granted about 0.5 additional patents. Which is pretty typical (or so I’ve argued here). But the more important finding is that’s only a fraction of the overall benefit. When there’s more R&D in a given technology sector, we typically think that creates new opportunities for R&D from other firms, because they can learn from the discoveries made by the grant recipient. Indeed, other papers have found spillovers are often just as important, or even more important, than the direct benefits to the R&D performer.

Myers and Lanahan get at this in two ways. First, they look for an impact of R&D funding not only on the patent technology classes that are closest to SBIR’s description of the funding competition, but also ones that are more textually distant. Typically, a bigger share of extra patent activity comes from classes that are not the closest fields, but still closer than a random patent (consistent with other work). Second, they look at patents held by people who are not SBIR recipients themselves, but who live closer or farther away from SBIR recipients.

So, looking only at SBIR recipients, an extra million tends to produce an extra 0.5 patents. Looking at patents belonging to anyone in the same county as an SBIR recipient - a group for whom we might assume is likely to contain people with similar technical expertise and possibly overlapping social and professional networks - an extra million tends to produce an extra 1.4 patents (across a wide range of technology fields). And looking at all US patents (from inventors residing anywhere in the world), an extra million tends to produce an extra 3 patents.

If all those patents are equally valuable, that would imply when the SBIR gives out money, the innovation outputs created by recipients are only a small part of the overall effect (0.5 of 3 total patents). Of course, all those patents are not, in fact, equally valuable. The ones created by the grant recipients tend to be more highly cited than the ones that we’re attributing to knowledge spillovers. Still, Myers and Lanahan estimate that after adjusting for the quality of patents, half the value generated by an SBIR grant is reflected in the patents of non-recipients working on different (but not too different) technology.

Prospects for Scaling Up

Whew! That’s a lot. To sum up: we’ve got some good theoretical reasons to think the return on R&D is very high, on average. If we look at a specific R&D program that gives R&D grants to small firms, the grants are effective at funding innovation at about the same level as the private sector could manage. And if we try to assess the broader impact of that funding, we find including all the social benefits gives us a return at least twice as high as the ones we got by focusing just on the grant recipients; and those were already decent! All together, more evidence that we ought to be spending more on R&D.

Lastly, we have good reason to think these effects can also be maintained if we scale up these programs. The design of Howell (2017) and Santoleri et al. (2020) is premised on estimating the impact of R&D funding on firms right around the cut-off. For the purposes of scaling up, that’s great news, because if we increased funding the firms that would get extra money would be ones that are closest to the cut-off.


If you liked this post, you might also like:

Extra credit

  • One reason patents per R&D dollar might be a bad benchmark in this case is if we think small firms like the ones getting R&D grants are more likely to seek patents than the typical R&D performing firm. There are some good reasons to think that’s the case: basically, they’re small but they aim to grow on the back of their technologies and so they need all the protection they can get. But looking at Howell (2017) and Santoleri et al. (2020) finds the median firm in these programs still has zero patents (even after winning an award). If just 20% of firms that do R&D also have a patent, then these grant recipients can’t be much more than twice as likely to get patents as everyone else. Alternatively, if you compare patents per dollar of small firms to large ones for the USA, they don’t look that different in aggregate. Nonetheless, I fully concede R&D is certainly a noisey predictor of patents; but I’m out of other ideas.

  • Santoleri et al. (2020) find the mean patents per firm is 4 among phase II applicants, and that getting a phase II grant increases cite-weighted patenting by 15-40%. That implies between 0.15x4 = 0.6 and 0.4 x 4 = 1.6 patents for every €0.5-2.5mn, or 0.2-3.2 patents per million euros or 0.2-2.9 patents per million dollars. Using the midpoint of each you get 0.7 patents per million dollars

What are the returns to R&D?

A Thought Experiment by Jones and Summers


Jones and Summers (2021) is a new working paper that attempts to calculate the social return on R&D - that is, how much value does a dollar of R&D create? The paper is like something out of another time; the argument is so simple and straight-forward that it could have been made at any point in the last 60 years. It requires no new math or theoretical insights; just basic accounting and some simple data. The main insight is simply in how to frame the problem. 

What I want to do in this post is walk through Jones and Summers’ simple “thought experiment.” At the end, we’ll have a new argument that the returns on R&D are quite high and that we should probably be spending much more on R&D. Next week we’ll look at some empirical data to see if it matches the intuition of the thought experiment. (Spoiler: It does)

Taking an R&D Break

Let’s start with a model of long-run changes in material living standards that is so simple it’s hard to argue with: 

  • R&D is an activity that consumes some of the economy’s resources

  • R&D is the only way new technologies come into existence

  • Growth in GDP per capita comes entirely from new technologies (at least, in the long run)

We’ll re-examine all of these points later, but for now let’s accept them and move on. 

This model helps clarify what it means to compute the returns to R&D. If we do more R&D, we have to use more of the economy’s resources, but in return we’ll get more GDP per capita. So computing the returns to R&D is really about computing how much does growth change when we spend a bit more on R&D. Specifically, if we increase R&D by, say, 1%, what will the expected impact be on GDP per capita? 

That’s actually a really hard question to answer! And the clever thing Jones and Summers do is they don’t ask it. Instead, they ask a different question which is much easier to answer: what would happen if we took a break from R&D for a year?

Why is this easier to answer? Because in our simple model, if we stop all R&D, we stop all growth! We no longer have to estimate “how much” growth we get for an extra dollar of R&D. We know that if we stop all R&D, we stop all growth. Simple as that!

Let’s get more specific. Suppose under normal circumstances, we spend a constant share of GDP on R&D. Let’s label the share s. In return, the economy grows by a long-run average that we’ll call g. In the USA, between 1953 and 2019, the annual share of GDP spent on R&D was about 2.5%, so s = 0.025. Over the same time period, GDP per capita (adjusted for inflation) grew by about 1.8% per year, so g = 0.018. If we hit “pause” on R&D for one year, then in that year we save 2.5% of GDP (since we don’t have to spend it on R&D), but GDP per capita stays stuck at its current level for one year, instead of growing by 1.8%. 

But that’s not a full accounting of the benefits or the costs of doing R&D. In the next year, our R&D break will end and we’ll start spending 2.5% of GDP on R&D again. But because we took that break, we didn’t grow in the previous year, GDP will be smaller than it would otherwise have been. Since we always spend 2.5% of GDP on R&D, we’ll be devoting a bit less money to R&D than we otherwise would (since it will be 2.5% of a smaller GDP). And because we didn’t grow in the previous year, we’ll also be growing from a lower level than we would have been if we hadn’t taken our R&D break. And that will be true in the next period, and the next, and the next: in every year until the end of time, GDP per capita will be 1.8% lower than it would have been if we had not taken that R&D break.

Adding up all these costs and benefits over time requires us to do some calculations using the interest rate r, which is how economists value dollars at different points in time. In the USA, a common interest rate to use might be 5%, so that r = 0.05. Jones and Summers show the math shakes out so that the ratio from here to infinity of benefits from R&D to costs of R&D is:

Benefits-to-Cost Ratio = g/(sr)

In other words, on average the return on a dollar spent on R&D is equal to the long-run average growth rate, divided by the share of GDP spent on R&D and the interest rate. With g = 0.018, s = 0.025, and r = 0.05, this gives us a benefits to cost ratio of 14.4. Every dollar spent on R&D gets transformed into $14.40! 

One thing I really like about this result is that you do not need any advanced math to derive it. It’s just a consequence of algebra and the proposed model of how growth and R&D are linked. In the video below, I show how to get this result without using any math more advanced than algebra.

Can that really be all there is to it? Well, no. If we look more critically at the assumptions that went into generating this number, we can get different benefit-cost ratios. But the core result of Jones and Summers is not any exact number. It’s that whatever number you believe is most accurate, it’s much more than 1. R&D is a reliable money-printing machine: you put in a dollar, and you get back out much more than a dollar. 

But let’s turn now to some objections to the simple argument I’ve made so far.

Is there really no growth without R&D?

Starting at the beginning, we might question the assumption that R&D resources are really the only way to get improvements in per capita living standards. If that’s wrong, and growth can happen without R&D, then our thought experiment would be over-estimating the returns to R&D, since growth wouldn’t actually go to zero if we (hypothetically) stopped all R&D for that year.

There are two ways we could get growth without doing R&D. First, it may be that we can get new technologies without spending resources on R&D. Second, we could get growth without new technologies.

The latter case is basically excluded by assumption in economics, at least for countries operating at the technological frontier. In 1956, Robert Solow and Trevor Swann argued that countries cannot indefinitely increase their material living standards by investing in more and more capital. That’s because the returns to investment drop as you run out of useful things to build, until you reach where the returns to investment are offset by the cost of upkeep. To keep growth going, you need to discover new useful things to build. You need new technology.

On the other hand, the first objection - that we may be able to get new technologies without spending resources on R&D - has more going for it. For instance, a common understanding of innovation is that it’s about flashes of insight, serendipity, and ideas that come to you in the shower. Good ideas sometimes just come to us without being sought.

The trouble with this notion of innovation is that in almost all cases, the free idea is only part of the story. It might provide a roadmap, but there is still a long journey from the idea to the execution, and that journey typically requires resources to be expended. In terms of our thought experiment, if it still takes R&D to translate an unplanned inspiration into growth, then we are actually measuring the returns to R&D correctly. If we turned off R&D, those insights wouldn’t get realized, and so growth would freeze until we began R&D again.

But maybe that’s not always the case. In The Secret of Our Success, Joseph Henrich gives a (fictional) example of how a package of hunting techniques could evolve over several generations without diverting any economic resources to innovation. In the example, proto-humans use sticks to fish termites out of a nest to eat, but one of them mistakenly believes the stick must be sharpened (their mother taught them the technique with a stick that happened to be sharp). One day, they accidentally plunge their sharp stick into an abandoned termite mound and impale a rodent - he has “invented” a spear. The proto-humans start using the sticks to impale prey. A generation later, another proto-human sees rabbits leaving tracks in the mud and going back into their hole; he realizes he can follow tracks to the hole and use the spear, instead of just hoping he sees an animal. Bit by bit, cumulative cultural evolution can happen, leading to a steadily more technologically sophisticated society.

These kinds of processes still happen today. In learning-by-doing models of innovation, firms get more productive as they gain experience in a production process. The process by which this happens is likely another form of evolution, with workers and managers tinkering with their process and selectively retaining the changes that improve productivity. We could call this kind of tinkering R&D if we wanted, but it’s almost certainly not part of the national statistics. 

But here’s the rub. With modern learning-by-doing, we typically think of firms and workers finding efficiencies and productivity hacks in production processes that are novel. And where do new and unfamiliar production processes come from? In the modern world, typically they are the result of purposeful R&D. If that’s the case, then in the long run we are once again accounting correctly for the costs and benefits of R&D. In this case, if we turned off R&D for one year would delay by one year the creation of new production processes that would then experience rapid learning-by-doing gains in subsequent years.

Of course, there still might be learning-by-doing with older technologies. But learning-by-doing models typically assume progress is very, very slow in mature technologies because there are not many beneficial tweaks left to discover. The process has already been optimized.

That’s pretty consistent with what we know about growth in the era before much purposeful R&D. Tinkering and cumulative cultural evolution is probably the right model for innovation before the industrial revolution, and as best as we can tell, growth during that era was painfully slow. Nearly zero, compared to today’s standards.

All that said, if you still believe growth can happen without R&D, then you can still use Jones’ and Summers’ approach to compute the benefits of R&D and adjust the estimate to take all this into account. It’s just now you need to use take only the fraction of growth that comes from R&D as your benefit. I have argued that almost all long-run growth comes from R&D. But if you think it’s just 50%, then that would cut the benefit cost ratio of R&D in half - to a still very high 7.2.

What about other costs?

A second objection to our initial estimate of the returns to R&D makes the opposite point: R&D is not costlessly translated into growth. New ideas must be laboriously spun out into new products and infrastructure that are then disseminated across the economy, before growth benefits are realized. Focusing exclusively on the R&D costs overstates the returns to R&D by understating the full costs of getting growth.

Take the covid-19 vaccine as an example. Pfizer has said the R&D costs of developing the vaccine were nearly $1bn. But once Pfizer had an FDA-approved vaccine, the benefits were not instantly realized by society. Instead, the shots needed to be manufactured and put into arms, and the cost of building that manufacturing capacity ought to be accounted as part of the cost of deriving a benefit from the R&D.

We don’t know exactly how much the US spends on “embodying” newly discovered ideas in physical form so that they can affect growth. But we do know that since 1960, the total US private sector investment in new capital (not merely upkeep or replacement of existing capital) has been about 4.0% of GDP per year. Not all of that is the upgrading of capital to incorporate new ideas. Some of it is just extending existing forms of capital over a growing population (think building new houses). But it’s a plausible upper bound on how much we spend turning ideas into tangible things. 

If we add the 4.0% of GDP spent annually on net investment to the 2.5% spent explicitly on R&D, we get a revised estimate that the US spends 6.5% of GDP per year on creating and building new technologies. If we return to our original estimate for the benefit-cost ratio of R&D, but use s = 0.065 instead of s = 0.025, we get that the benefit-cost ratio is 5.5. Every dollar spent on R&D still generates $5.50 in value!

Does R&D Instantly Impact Growth?

OK, so it’s important to count costs correctly. By the same token, we may believe the benefits of R&D are overstated. The simple framework I laid out above assumed if you pause R&D, you pause growth at the same time. Clearly that’s incorrect. 

In reality, R&D is not instantly translated into growth. About 17% of US R&D is spent on basic research - that is, science that is not necessarily directed towards any specific technological application. As I’ve argued before, this kind of investment does eventually lead to technological innovation, but it takes time: twenty years is not a bad estimate of how long it takes to go from science to technology.

Invested at 5% annually, $1 today is worth $2.65 in twenty years. Alternatively, $1 received in twenty years is only worth $0.38 today (since you can invest the $0.38 at 5% per year and end up back with $1 in twenty years). The implication is that benefits that arrive in the more distant future should be more discounted in our accounting framework. For example, if we believed spending R&D resources today only had an impact on growth in 20 years, then we would want to discount our estimate of the benefits to 38% of the levels we came up with when we naively assumed the benefits of R&D arrived instantly. That would imply a benefit-costs ratio of 5.5, as compared the 14.4 we initially computed. 

But that’s surely a big over-estimate, since only 17% of R&D is spent on basic science. The other 83% is spent on applied science and development, both of which have much shorter time horizons. Just to illustrate, let me assume 17% of R&D has a 20-year time horizon (38% discount), 33% of R&D has a 10-year horizon (61% discount), and the remaining 50% of R&D has a 5-year horizon (78% discount). In that case, the average discount we should apply, due to the fact that R&D is not instantly translated into growth, is 66%. That implies a benefits-cost ratio of 9.5, as compared the original 14.4. Again - the point is not any specific number. Just that under a lot of sensible assumptions, the return is a lot more than 1!

What about other benefits?

So far we have looked at some ways in which the benefits-cost ratio is over-estimated. Of these, I think the argument that we should include investment as part of the cost of getting a benefit from R&D is a good one, as well as the argument that we should discount the benefits by time since they don’t arrive instantly. Combining those estimates gives us a benefits cost-ratio on the order of 3.6 (i.e., 0.66 * 0.018 / (0.065 * 0.05). Every dollar spent on R&D + investment gets us at least $3.60 in value!

But we also have plenty of reasons why we could argue it is inappropriate to simply use GDP per capita as our measure of the benefits of R&D. There are many benefits from R&D that may not show up in the GDP numbers: reduced carbon emissions from alternative energy sources and greater fuel efficiency; the reduction in work hours that more productive technology has allowed us to realize over the last century; the increased value of leisure time due to the internet; the years of life saved by the covid-19 vaccine; indeed, the years of life saved by biomedical innovation overall. 

Jones and Summers take a stab at an estimate for the benefits of biomedical innovation that do not show up in GDP. Biomedical innovation is probably the single largest sector of our innovation system: probably 20-30% of total R&D spending. One way to try and get at the non-GDP benefits of this biomedical innovation is to estimate the value people place on longer lives using things like their spending to reduce their risk of death. I’m not sure how much confidence we want to put in those numbers, but Summers and Jones estimate that a range of reasonable estimates would lead us to increase the estimated benefits of R&D by 20-140%. Taking my tentatively favored benefit cost ratio of 3.6 as our starting line, scaling up the benefits by 20-140% gets us a range of 4.3-8.6.

Estimating the general non-GDP benefits of innovation beyond biomedical innovation is probably an inherently subjective task. But here’s one attempt at a thought experiment to get a sense of how much value you get out of innovations that isn’t reflected in GDP. Suppose there was a magical genie (such things happen in thought experiments) who offered to set you on one of two parallel timelines. 

The first is our own timeline, where innovation will happen the same as it has been for a century, and GDP per capita growth will continue to be 1.8% per year. The second timeline is a weird one where technology is frozen at our current level, but (magically) everyone gets richer at a rate of 2.25% per year (as long as you do R&D) - 25% faster than in our current timeline. That is, in the second timeline, you get a bit more money, but you don’t have access to new products and services that innovation would bring. If we were to compute the benefits to cost ratio of R&D in that second world it would be 25% higher than in our timeline, since growth is 25% faster.

If GDP per capita is a good metric of the value of innovation, you’ll clearly choose the second timeline. But if you pick the first one, it means you value access to the newly invented technologies at a level that is at least 25% above their measured impact on GDP per capita. 

It’s kind of hard to think what choice you would actually make in this scenario, since choosing between different growth rates is a very foreign decision to most of us. So consider an alternative formulation  where the genie offers you the following choice:

  • A cash payment (right now) equal to 20% of your current income, plus the opportunity to purchase products and services developed between now and 2031

  • A cash payment (right now) equal to 25% of your current income

Which do you choose?

The first choice is basically where you will expect to be in the year 2031; 1.8% growth compounded over 10 years means you’ll have 20% more income. And in the year 2031, you’ll also have access to all technologies invented between now and then. The second choice gives you a growth rate that is about 25% higher than the other, but no access to the non-monetary benefits of innovation - just the cash. Again, if you pick the first choice, you are saying GDP per capita undervalues the benefits of innovation over the next decade by at least 25%. And so you should scale up your assessment of the benefits to cost ratio of R&D by 25%.

What if option #2 was a payment equal to 30% of your current income? If you would still prefer option #1 in that case, then you think GDP per capita undervalues the benefits of innovation by 50%. And so you should scale up your assessment of the benefits to cost ratio of R&D by 50%. And so on.

What if the average doesn’t matter?

All told, Jones and Summers’ thought experiment essentially argues that R&D is a money-printing machine. Ignoring benefits that don’t accrue to GDP, every dollar you put into your R&D machine gets you back more than $3.60 in value. Possibly much more. So why don’t we use this money printing machine much more? Why are we only spending $2.50 out of every $100 on R&D?

There are two main reasons. First, the value created by R&D is distributed widely throughout society and does not does not primarily accrue to the R&D funder. If I put $1 of my own cash into the R&D machine, I’m not getting back $3.60. Very likely I might get back less than the dollar I put in. The private sector funds about 70% of US R&D and for them the average social return on R&D doesn’t really matter. What matters is the private return that the firm will receive.

But that doesn’t account for why the US government doesn’t spend more on R&D. Presumably, it should care about the social return. One obvious possibility is that decision-makers in government face incentives that don’t reward R&D spending. Maybe election cycles are too short for any politician to get credit for funding more R&D; maybe the R&D funded by government gets implemented by businesses who get all the credit; maybe government is just skeptical of academic theory. I don’t know!

But another possibility is that it’s a problem related to knowledge. The average return to R&D must be quite high if we buy the argument just made. But that doesn’t mean the next dollar we spend will earn the average return. Maybe we funded the best R&D ideas first, and every additional dollar is spent on a successively less promising R&D project. Maybe the supply of talented scientists and inventors is already maxed out. If we want to argue that R&D should be increased, we want to know the marginal return to R&D; that is, how much extra GDP will get if we spend another dollar on R&D, given what we’re already spending. As I said at the outset of this post, that’s a much harder question to answer. But there are some attempts to answer it, and they also find a quite high rate of return. We’ll look at those next week.


If you liked this post, you might also enjoy the following:

One Study, Many Results

Replication is hard, even without publication bias


Science is commonly understood as being a lot more certain than it is. In popular science books and articles, an extremely common approach is to pair a deep dive into one study with an illustrative anecdote. The implication is that’s enough: the study discovered something deep, and the anecdote made the discovery accessible. Or take the coverage of science in the popular press (and even the academic press): most coverage of science revolves around highlighting the results of a single new (cool) study. Again, the implication is that one study is enough to know something new. This isn’t universal, and I think coverage has become more cautious and nuanced in some outlets during the era of covid-19, but it’s common enough that for many people “believe science” is a sincere mantra, as if science made pronouncements in the same way religions do. 

But that’s not the way it works. Single studies - especially in the social sciences - are not certain. In the 2010s, it has become clear that a lot of studies (maybe the majority) do not replicate. The failure of studies to replicate is often blamed (not without evidence) on a bias towards publishing new and exciting results. Consciously or subconsciously, that leads scientists to employ shaky methods that get them the results they want, but which don’t deliver reliable results. 

But perhaps it’s worse than that. Suppose you could erase publication bias and just let scientists choose whatever method they thought was the best way to answer a question. Freed from the need to find a cool new result, scientists would pick the best method to answer a question and then, well, answer it.

The many-analysts literature shows us that’s not the case though. The truth is, the state of our “methodological technology” just isn’t there yet. There remains a core of unresolvable uncertainty and randomness in the best of circumstances. Science isn’t certain. 

Crowdsourcing Science

In many-analyst studies, multiple teams of researchers test the same previously specified hypothesis, using the exact same dataset. In all the cases we’re going to talk about today, publication is not contingent on results, so we don’t have scientists cherry-picking the results that make their results look most interesting; nor do we have replicators cherry-picking results to overturn prior results. Instead, we just have researchers applying judgment to data in the hopes of answering a question. Even still results can be all over the map.

Let’s start with a really recent paper in economics: Huntington-Klein et al. (2021). In this paper, seven different teams of researchers tackle two research questions that had been previously published in top economics journals (but which were not so well known that the replicators knew about them). In each case, the papers were based on publicly accessible data, and part of the point of the exercise was to see how different decisions about building a dataset from the same public sources lead to different outcomes. In the first case, researchers used variation across US states in compulsory schooling laws to assess the impact of compulsory schooling on teenage pregnancy rates. 

Researchers were given a dataset of schooling laws across states and times, but to assess the impact of these laws on teen pregnancy, they had to construct a dataset on individuals from publicly available IPUMS data. In building the data, researchers diverged in how they handled different judgement calls. For examples:

  • One team dropped data on women living in group homes; others kept them.

  • Some teams counted teenage pregnancy as pregnancy after the age of 14, but one counted pregnancy at the age of 13 as well

  • One team dropped data on women who never had any children

  • In Ohio, schooling was compulsory until the age of 18 in every year except 1944, when the compulsory schooling age was 8. Was this a genuine policy change? Or a typo? One team dropped this observation, but the others retained it.

Between this and other judgement calls, no team assembled exactly the same dataset. Next, the teams needed to decide how, exactly, to perform the test. Again, each team differed a bit in terms of what variables it chose to control for and which it didn’t. Race? Age? Birth year? Pregnancy year?

It’s not immediately obvious which decisions are the right ones. Unfortunately, they matter a lot! Here were the seven teams’ different results.

Depending on your dataset construction choices and exact specification, you can find either that compulsory schooling lowers or increases teenage pregnancy, or has no impact at all! (There was a second study as well - we will come back to that at the end)

This isn’t the first paper to take this approach. An early paper in this vein is Silberzahn et al. (2018). In this paper, 29 research teams composed of 61 analysts sought to answer the question “are soccer players with dark skin tone more likely to receive red cards from referees?” This time, teams were given the same data but still had to make decisions about what to include and exclude from analysis. The data consisted of information on all 1,586 soccer players who played in the first male divisions of England, Germany, France and Spain in the 2012-2013 season, and for whom a photograph was available (to code skin tone). There was also data on player interactions with all referees throughout their professional careers, including how many of these interactions ended in a red card and a bunch of additional variables.

As in Huntington-Klein et al. (2021), the teams adopted a host of different statistical techniques, data cleaning methods, and exact specifications. While everyone included “number of games” as one variable, just one other variable was included in more than half of the teams regression models. Unlike Huntington-Klein et al. (2021), in this study, there was also a much larger set of different statistical estimation techniques. The resulting estimates (with 95% confidence intervals) are below.

Is this good news or bad news? On the one hand, most of the estimates lie between 1 and 1.5. On the other hand, about a third of the teams cannot rule out zero impact of skin tone on red cards; the other two thirds find a positive effect that is statistically significant at standard levels. In other words, if we picked two of these teams’ results at random and called one the “first result” and the other a “replication,” they would only agree whether the result is statistically significant or not about 55% of the time!

Let’s look at another. Breznau et al. (2021) get 73 teams, comprising 162 researchers to answer the question “does immigration lower public support for social policies?” Again, each team was given the same data. This time, that consisted of responses to surveys about support for government social policies (example: “On the whole, do you think it should or should not be the government’s responsibility to provide a job for everyone who wants one?”), measures of immigration (at the country level), and various country-level explanatory variables such as GDP per capita and the Gini coefficient. The results spanned the spectrum of possible conclusions.

Slightly more than half of the results found no statistically significant link between immigration levels and support for policies - but a quarter found more immigration reduced support, and more than a sixth found more immigration increased support. If you picked two results at random, they would agree on the direction and statistical significance of the results less than half the time!

We could do morestudies, but the general consensus is the same: when many teams answer the same question, beginning with the same dataset, it is quite common to find a wide spread of conclusions (even when you remove motivations related to beating publication bias).

At this point, it’s tempting to hope the different results stem from differing levels of expertise, or differing quality of analysis. “OK,” we might say, “different scientists will reach different conclusions, but maybe that’s because some scientists are bad at research. Good scientists will agree.” But as best as these papers can tell, that’s not a very big factor.

The study on soccer players tried to answer this in a few ways. First, the teams were split into two groups based on various measures of expertise (teaching classes on statistics, publishing on methodology, etc). The half with greater expertise was more likely to find a positive and statistically significant effect (78% of teams, instead of 68%), but the variability of their estimates was the same across the groups (just shifted in one direction or another). Second, the teams graded each other on the quality of their analysis plans (without seeing the results). But in this case, the quality of the analysis plan was unrelated to the outcome. This was the case even when they only looked at the grades given by experts in the statistical technique being used. 

The last study also split its research teams into groups based on methodological expertise or topical expertise. In neither case did it have much of an impact on the kind of results discovered.

So; don’t assume the results of a given study are definitive to the question. It’s quite likely that a different set of researchers, tackling the exact same question and starting with the exact same data would have obtained a different result. Even if they had the same level of expertise!

Resist Science Nihilism!

But while most people probably overrate the degree of certainty in science, there also seems to be a sizable online contingent that has embraced the opposite conclusion. They know about the replication crisis and the unreliability of research, and have concluded the whole scientific operation is a scam. This goes too far in the opposite direction.

For example, a science nihilist might conclude that if expertise doesn’t drive the results above, then it must be that scientists simply find whatever they want to find, and that their results are designed to fabricate evidence for whatever they happen to believe already. But that doesn’t seem to be the case, at least in these multi-analyst studies. In both the study of soccer players and the one on immigration, participating researchers reported their beliefs before doing their analysis. In both cases there wasn’t a statistically significant correlation between prior beliefs and reported results.

If it’s not expertise and it’s not preconceived beliefs that drive results, what is it? I think it really is simply that research is hard and different defensible decisions can lead to different outcomes. Huntington-Klein et al. (2021) perform an interesting exercise where they apply the same analysis to different teams data, or alternatively, apply different analysis plans to the same dataset. That exercise suggests roughly half of the divergence in the teams conclusions stems from different decisions made in the database construction stage and half from different decisions made about analysis. There’s no silver bullet - just a lot of little decisions that add up.

More importantly, while it’s true that any scientific study should not be viewed as the last word on anything, studies still do give us signals about what might be true. And the signals add up.

Looking at the above results, while I am not certain of anything, I come away thinking it’s slightly more likely that compulsory schooling reduces teenage pregnancy, pretty likely that dark skinned soccer players get more red cards, and that there is no simple meaningful relationship between immigration and views on government social policy. Given that most of the decisions are defensible, I go with the results that show up more often than not. 

And sometimes, the results are pretty compelling. Earlier, I mentioned that Huntington-Klein et al. (2021) actually investigated two hypotheses. In the second, Huntington-Klein et al. (2021) ask researchers to look at the effect of employer-provided healthcare on entrepreneurship. The key identifying assumption is that in the US, people become eligible for publicly provided health insurance (Medicare) at age 65. But people’s personalities and opportunities tend to change more slowly and idiosyncratically - they also don’t suddenly change on your 65th birthday. So the study looks at how rates of entrepreneurship compare between groups just older than the 65 threshold and those just under it. Again, researchers have to build a dataset from publicly available data. Again every team made different decisions, such that none of the data sets are exactly alike. Again, researchers must decide exactly how to test the hypothesis, and again they choose slight variations in how to test it. But this time, at least the estimated effects line up reasonably well.

I think this is pretty compelling evidence that there’s something really going on here - at least for the time and place under study.

And it isn’t necessary to have teams of researchers generate the above kinds of figures. “Multiverse analysis” asks researchers to explicitly consider how their results change under all plausible changes to the data and analysis; essentially, it asks individual teams to try and behave like a set of teams. In economics (and I’m sure in many other fields - I’m just writing about what I know here), something like this is supposedly done in the “robustness checks” section of a paper. In this part of a study, the researchers show how their results are or are not robust to alternative data and analysis decisions. The trouble has long been that robustness checks have been selective rather than systematic; the fear is that researchers highlight only the robustness checks that make their core conclusion look good and bury the rest.

But I wonder if this is changing. The robustness checks section of economics papers has been steadily ballooning over time, contributing to the novella-like length of many modern economics papers (the average length rose from 15 pages to 45 pages between 1970 and 2012). Some papers are now beginning to include figures like the following, which show how the core results change when assumptions change and which closely mirror the results generated by multiple-analyst papers. Notably, this figure includes many sets of assumptions that show results that are not statistically different from zero (the authors aren’t hiding everything).

Economists complain about how difficult these requirements make the publication process (and how unpleasant they make it to read papers), but the multiple-analyst work suggests it’s probably still a good idea, at least until our “methodological technology” catches up so that you don’t have a big spread of results when you make different defensible decisions.

More broadly, I take away three things from this literature:

  • Failures to replicate are to be expected, given the state of our methodological technology, even in the best circumstances, even if there’s no publication bias

  • Form your ideas based on suites of papers, or entire literatures, not primarily on individual studies

  • There is plenty of randomness in the research process for publication bias to exploit. More on that in the future.


If you liked this post, you might also enjoy the following:

Loading more posts…