Like the rest of New Things Under the Sun, this article will be updated as the state of the academic literature evolves; you can read the latest version here.
You can listen to this post above, or via most podcast apps here.
Recommendation: My Open Philanthropy colleague, Ajeya Cotra has teamed up with Kelsey Piper at Vox to launch a newsletter about “a possible future in which AI is functionally making all the most important decisions in our economy and society.” I would have put the newsletter on my substack recommendations, but it’s not on substack, so I’m plugging it here. If you are thinking about AI these days - and who isn’t? - check it out!
A frequent worry is that our scientific institutions are risk-averse and shy away from funding transformative research projects that are high risk, in favor of relatively safe and incremental science. Why might that be?
Let’s start with the assumption that high-risk, high-reward research proposals are polarizing: some people love them, some hate them. It’s not actually clear this is true,1 but it seems plausible and for the purposes of this post I’m just going to take it as given. If this is true, and if our scientific institutions pay closer attention to bad reviews than good reviews, then that could be a driver of risk aversion. Let’s look at three channels through which negative assessments may have outsized weight in decision-making, and how this might bias science away from transformative research.
Reviewer Preferences
Let’s start with individual reviewers: how does the typical scientist feel about riskier research?
As far as I know, we don’t have good data that’s directly on how academic peer reviewers feel about high-risk / high-reward research proposals. There is some work on how academic scientists treat novelty at the publication stage, but there might be some big differences between how risky research is judged at the proposal versus the publication stage (an argument developed in more detail in Gross and Bergstrom 2021). For one, after the research is done, you can often see if the risk paid off!
In this post I’m going to focus on work looking at research proposals and to learn about the preferences of peer reviewers, I’m going to look at Krieger and Nanda (2022), which provides some granular information about how working scientists in industry think about which kinds of pharmaceutical research projects to fund. Krieger and Nanda study an internal startup program at the giant pharmaceutical company, Novartis. The program was meant to identify and rapidly fund “transformative, breakthrough innovation” developed by teams of scientists working within Novartis. Over 150 Novartis teams submitted applications for the funding, and these were screened down to a shortlist of 12 who pitched their proposal to a selection committee.
These pitches were made over video chat, due to covid-19, which meant they could be viewed by lots of people at once. About 60 additional Novartis research scientists watched some or all of the pitches and Krieger and Nanda got them to score each research proposal on a variety of criteria, and then to allocate hypothetical money to the different proposals. What’s particularly interesting for us is that we can see how scientists rated different aspects of a proposal, and how that relates to their ultimate decision about what to (hypothetically) fund.
Participants in the study rated each proposal on:
Transformative potential (more creative, non-standard is better)
Breadth of applicability (more and higher value propositions)
Timescale to first prototype (within 18 months is better)
Feasibility/path to execution (more feasible is better)
Team (does the team have the skill and network to achieve the goal)
These different scores were aggregated into a weighted average that put extra weight on feasibility and the team, but put the most weight on a proposal’s transformative potential. (After all, that’s what the program was set up to fund.) Next, the study participants are asked how much money from a hypothetical budget to allocate to different projects. Note, when they’re doing this allocation, they can clearly see the weighted average of the scores they gave on each criteria, so it is obvious which proposals are supposed to get funding, if you strictly follow the scoring formula that Novartis devised.
No surprise, Krieger and Nanda find that proposals with a higher score tend to get more hypothetical funding. But they also find, all else equal, reviewers penalize projects that have greater variation among the different criteria. That is, when comparing two projects with the same weighted average, study participants give more money to a project if it most of its criteria are close to the overall weighted average and less money if some criteria are well above the average and some well below. That implies negative attributes of a project “count” for more in the minds of reviewers. Even if bad scores on some criteria are counterbalanced by higher scores on others, these kinds of projects still get less (hypothetical) funding than less uneven proposals.
But we can be even more precise. This bias against proposals with low scores on some dimensions and high scores on others is mostly driven by a particular type of divergence: proposals rated as having a high transformative potential but low feasibility tend to be the most penalized. That’s consistent with peer reviewers themselves being a source of bias against novel projects. They can recognize a project is high-risk and high-reward, but when asked which projects to give research funding too, they shy away from them in favor of lower-risk but lower-reward projects.
Note though, that this data is from industry scientists, and maybe they are different in their risk preferences than their academic peers. S0 interpret with caution. Let’s next turn to some studies specifically about academia.
Random Averages
The previous section was about possible biases among individual reviewers. But most of the time, research proposals are evaluated by multiple reviewers, and then the scores across reviewers are averaged. And that system can introduce different problems.
One way that averaging across reviewers leads to sensitivity to negative reviews is the fact that money for science tends to be tight, which means only research proposals that receive high average scores tend to be funded. If a single negative review can pull your score below this funding threshold, then negative reviews may exert excessive influence.
For example, proposals submitted to the UK’s Economic and Social Research Council (ESRC) are typically scored by 3-4 reviewers on a 6-point scale, and usually only proposals that receive average scores above 4.5 make it to the stage where a panel deliberates on which proposals to fund. Jerrim and de Vries (2020) look at over 4,000 ESRC research proposals made over 2013-2019 and find 81% of proposals with an average score of 5.75-6 from the peer reviewers get funded, but only 24% of proposals with an average score of 4.5-5. That is to say, if you have three reviewers who love a proposal and rate it a maximum 6/6, it’ll be funded 81% of the time, but if you add one more reviewer who hates it and gives it a 1/6, then the average of 4.75 implies it only has a 24% chance of being funded.
Of course, maybe that’s a feature, not a bug, if negative reviews actually do spot serious weaknesses. But before getting into that, we might first ask if this scenario is actually plausible in the first place: could it really be the case that three people rate a project 6/6 and another rates it 1/6? If three people think a project is outstanding, isn't it pretty unlikely that a fourth person would think it’s actually poor? This gets into the question of how consistent are peer review scores with each other, which is itself a large literature. But at least for their sample of ESRC proposals, Jerrim and de Vries find inter-reviewer correlations are very weak. Any particular reviewer’s score is only a tiny bit predictive of their peers score. That means a score of 1/6 is less likely when three other reviewers rate it 6/6 - but not that much less likely than random (though on average only 4% of reviewers give proposals a score of 1/6).
So it is true that one really bad review can substantially reduce the probability of getting funded. But that doesn’t necessarily mean the system isn’t working exactly as it should; perhaps the bad review noticed serious flaws in the proposal that the other reviewers missed? Even so, there are two reasons that this seemingly innocuous procedure (get expert feedback and average it) can lead to excessive risk aversion for a funder.
First, scores are asymmetrically distributed. In Jerrim and de Vries’ data, the average score is 4.4, and more than half of reviews are a 5 or 6. If you believe a proposal is really bad it’s feasible to strongly signal your dislike by giving it a score of 1, which is 3.4 below the average. But if you really love a proposal, it’s hard to signal that with your scoring: the best you can do is give it a 6, which is just 1.6 above the average. When you average out people who really love and really hate a project, the haters have more leverage over the final score.2
Second, low levels of inter-reviewer correlation imply there’s a lot of randomness in the reviewing process. That could be bad for transformative research proposals, if they are weirder and end up getting more reviews. For example, a proposal that combines ideas from disparate sources might need more reviewers to adequately vet the proposal, since it would need to pull in multiple reviewers to vet each of the idea’s sources. That could be a problem because, in general, there will be more variation in the average scores of proposals that receive fewer reviewers.
For example, in Jerrim and de Vries’ data, on average about 25% of reviewers rate proposals as 6/6. If you have a proposal that sits squarely in a given research niche, the panel might feel comfortable with just two reviewers from that niche. With two reviewers, the probability you get uniformly outstanding reviews will be on the order of 25% x 25% = 6%. But if you have a proposal that draws on ideas from multiple domains, a panel might want to have more than one reviewer from each of those niches. If you end up with five reviewers, the probability you get uniformly outstanding reviews is on the order of 25% to the fifth power, or 0.1%!3
Contagious Negativity
The above problems show up in the first stage of the ESRC grant process, where you solicit expert feedback and average it. But in the ESRC, there is also a second stage, where a panel comes together to debate the proposals, using the peer review scores as an input into their decision. Something like this deliberation process is also used to disburse most US biomedical grants, where NIH study sections come together to discuss proposals.
Lane et al. (2022) conducts an experiment with the review of grants that documents another way the review process is particularly sensitive to negative feedback. Lane and coauthors help run two different (real) grant competitions for biomedical translational research. Between the two competitions, they received about 100 proposals, which they sent out to over 350 reviewers, who reviewed and score proposals on a 9-point scale.
After reviewers rate a proposal, the experimental intervention happens. Before submitting their scores, all of the reviewers are given an opportunity to revise their score before final submission. But a subset of the reviewers are additionally shown information about the range of scores given by their peers (though in fact, the ranges shown are experimentally manipulated and not necessarily the true reviewer score). What Lane and coauthors want to see is how learning about peer scores affects the decision to revise one’s own score.
They find a general tendency for people to revise their scores in the direction of their peers. But this tendency is asymmetric. If you learn your peers all gave higher scores than you, on average people choose to raise their own score by about 0.5 (out of 9 possible points). But if you learn all your peers gave a lower score, on average people choose to lower their score by more like 0.8. This suggests that in an environment where peer reviewers can discuss and share their views, more polarizing proposals will tend to get penalized, as people tend to revise their views down more easily than up. The difference of 0.5 versus 0.8 not a huge effect, but it’s pretty sticky across lots of different ways of analyzing the data.
Lots of little biases?
In this post, I looked at three different papers that each illustrate a different way that the typical grant review process might be biased against high-risk, high-reward projects. Reviewers themselves might not like research proposals that are high-risk, high reward. Averaging peer review scores and using those to consensus scores to select which proposals to fund might further be biased against risky work, if riskier proposals get more polarized reviews or simply more reviews. And if peers openly share and discuss their views, proposals that are more polarized might have consensuses weighted more to the downside rather than the upside.
This is not normally the way I like to structure posts for New Things Under the Sun. Normally, I prefer to look more closely at one specific claim, and try to find a set of papers that shed light on it from different angles. For example, this other post looked at yet another potential reason for biases against novel research (the absence of expertise that can confidently vouch for the weirdest ideas), but did so drawing on several different papers. For today’s post, that wasn’t possible, as I couldn’t find different papers shedding light on the same mechanisms for risk aversion in science (though there are probably some out there - if you know any, email me and I’ll update this post).
But there is also a meta-reason for grouping these three distinct mechanisms for biases against novelty together. There is a widespread perception that the way we fund science is too risk averse. How can such a bias survive?
One reason may be that, protests not withstanding, the people who matter prefer risk aversion. And there might be good reasons for that. For example, maybe long-run political support for funding science is imperiled by excessive embrace of weird research that rarely pans out. That certainly seems possible and there certainly are people who use weird (to laymen) science grants as a weapon against government spending in general.
But another possibility is that we actually do want to get rid of risk aversion, but we don’t know how. For that possibility to hold, there can’t be some single feature of our grant review processes that obviously and unambiguously leads to excessive risk aversion. If there was, we would just change that process, without waiting for an academic study (much of how we structure research isn’t historically guided by evidence anyway!).
But what if risk aversion stems from myriad little biases that add up? What if each source of bias individually has effects that are too small for intuition and casual observation to detect, though a well designed study with lots of data would be able to sniff them out? In an environment when we haven’t traditionally designed our scientific institutions with much evidence from academic study of science, it seems to me that kind of bias could persist a very long time, even if we really did want to excise it.
Thanks for reading! As always, if you want to chat about this post or innovation in generally, let’s grab a virtual coffee. Send me an email at matt.clancy@openphilanthropy.org and we’ll put something in the calendar.
For example, Barnett et al. (2018) don’t see any evidence that polarizing research proposals get more citations on average.
Maybe there is greater scope for symmetrical feedback in the written comments, but I’m not so sure. In the data, 25% of people give the maximum score of 6/6, which is meant to correspond to “outstanding.” If those 6/6 scores are accompanied by glowing comments, it may be hard for a truly exceptional proposal to stand out from the crowd, even in its written text.
If you use Jerrim and de Vries (2020)’s data on the distribution of reviewer scores and the probability of winning across different averages, I wrote a quick and dirty simulation that implies the average paper with three reviewers gets funded about 21% of the time, and the average paper with four reviewers gets funded about 18% of the time. Note, that’s the case even if the underlying distribution of individual reviewer scores for each review is the same for both groups. The difference is entirely down to the higher variance in average scores in the three-review proposals, as compared to the four-review proposals.