How a field fixes itself: the applied turn in economics

  
0:00
-30:00

Getting an academic field to change its ways is hard. Independent scientists can’t just go rogue - they need research funding, they need to publish, and they need (or want) tenure. All those things require convincing a community of your peers to accept the merit of your research agenda. And yet, humans are humans. They can be biased. And if they are biased towards the status quo, it can be hard for change to take root. 

But it does happen. And I think changes in the field of economics are a good illustration of some of the dynamics that make that possible.

The Applied Turn in Economics

In 1983, economist Edward Leamer, writing about economics, said:

Hardly anyone takes data analysis seriously. Or perhaps more accurately, hardly anyone takes anyone else’s data analysis seriously.

But in subsequent decades, economics made a famous “applied turn.” The share of empirical papers in three top journals rose from a low of 38% in the year of Leamer’s paper to 72% by 2011! The field also began giving its top awards - the economics Nobel and the John Bates Clark award for best economist under 40 - to empirical economists. These new empirical papers were strongly characterized by careful attention to credibly detecting causation (rather than mere correlation). 

This applied turn seems to have worked out pretty well. In 2010, Joshua Angrist and Jorn-Steffen Pischke wrote a popular follow-up to Leamer’s article titled The Credibility Revolution in Empirical Economics: How Better Research Design is Taking the Con out of Econometrics. The title is an apt summary of the argument. More quantitatively, a 2020 paper by Angrist et al. tried to assess how this empirical work is received in and outside of economics, by looking at citations to different kinds of articles. Consistent with Leamer’s complaints, when looking at total citations received by papers, Agrist et al. find empirical papers faced a significant citation penalty prior to the mid 1990s. 

When they restrict their attention to citations of economics papers by non-econ social science papers, they find empirical papers now benefit from a citation premium, especially since the 2000s. This suggests the applied turn has been viewed favorably, even by academics who are outside of economics. 

Elsewhere, Alvaro de Menard, a participant in an experiment to see if betting markets could predict whether papers will replicate, produced this figure summarizing the market’s views on the likelihood of replication for different fields (higher is more likely to replicate).

Among the social sciences, economics is perceived to be most likely to replicate (though note the mean likelihood of replication is still below 70%!). I should also add at this point, that if you are a skeptic about the validity of modern economics research, I think you can still get a lot out of the rest of this post, since it’s about how a field comes to embrace new methods, not so much whether those new methods “really” are better.

So. How did this remarkable turn-around happen? 

Changing a Field is Hard

Changing a field is hard. Advancing in academia is all about convincing your peers that you do good work. The people you need to convince are humans and humans are biased. In particular, they may be more or less biased towards the methods and frameworks that are already prevalent in the field. 

Bias might reflect the cynical attitudes of researchers who have given up on advancing truth and knowledge, and just want to protect their turf. But bias can also emerge from other motives. Maybe biased researchers simply haven’t been trained in the new method and don’t appreciate its value. Or maybe they choose to value different aspects of methods, like interpretability over rigor. Or maybe there really is disagreement over the value of a new method - even today there is lots of debate about the proper role of experimental methods in economics. Lastly, it may just be that people subconsciously evaluate their own arguments more favorably than those advanced by other people.

Akerlof and Michaillat have a nice little 2018 paper on these dynamics that shows how even a bit of bias can keep a field stuck in a bad paradigm. Suppose there are two paradigms, an old one and a new (and better) one. How does this new paradigm take root and spread? Assume scientists are trained in one paradigm or the other and then have to convince their peers that their work is good enough to get tenure. If they get tenure, they go on to train the next generation of scientists in their paradigm.

Research is risky and whatever paradigm they are from, it’s uncertain whether they’ll get tenure. The probability they get tenure depends on the paradigm of the person evaluating them (for simplicity, let’s just assume every untenured scientist is evaluated by just one evaluator).

The probability a candidate gets tenure is...

In this example, the new paradigm is better, in the sense that an unbiased evaluator would give one of its adherents tenure more often than someone trained in the old paradigm (50% of the time, versus 30%). But in this model, people are biased. Not 100% biased, in the sense that they will only accept work done in their own paradigm. Instead, they are 30% biased: they give a 30% penalty to anyone from the other paradigm and a 30% bonus to anyone from their own paradigm. This means the old paradigm evaluators still favor candidates from their own field, but not by much (39% versus 35%). On the other hand, the new paradigm people are also biased, and since their paradigm really is better, the two effects compound and they are much more likely to grant tenure to people in their own paradigm (65% vs 21%).

What’s the fate of the new paradigm? It depends.

Suppose the new paradigm has only been embraced by 5% of the old guard, and 5% of the untenured scientists. If scientists cannot choose who evaluates them, and instead get matched with a random older scientist to serve as an evaluator then in the first generation:

  • 95% of the new paradigm scientists are evaluated by old paradigm scientists, and only 35% of them are granted tenure

  • 5% of the new paradigm scientists are evaluated by new paradigm scientists and 65% of them get tenure

  • 95% of the old paradigm scientists are evaluated by old paradigm-ers, and 39% of them get tenure

  • 5% of the old paradigm scientists are evaluated by new paradigm-ers and only 21% of these get tenure. 

Thus, of the original 5% of untenured scientists in the new paradigm, only 1.8% get tenure (5% x (0.95 x 0.35 + 0.05 x 0.65)) and of the original 95% of untenured scientists in the old paradigm, 36.2% get tenure (95% x (0.95 x 0.39 + 0.05 x 0.21)). All told, 38% of the untenured scientists get tenure. They become the new “old guard” who will train and then evaluate the next generation of scientists. 

In this second generation of tenured scientists, 4.8% belong to the new paradigm and 95.2% belong to the old paradigm. That is, the share of scientists in the new paradigm has shrunk. If you repeat this exercise starting with 4.8% of scientists in the new paradigm, the new paradigm fares even worse, because they are less likely to receive favorable evaluators than the previous generation. Their share shrinks to 4.6% in the third generation; then 4.4% in the fourth; and so on, down to less than 0.1% of the population by the 25th generation.

Escaping the Trap

In contrast, if the share of scientists adopting the new paradigm had been a bit bigger - 10% instead of 5%, for example - then in every generation the share of new paradigm scientists would get a bit bigger. That would make it progressively easier to get tenure, if you are a new paradigm scientist since it gets increasingly likely you’ll be evaluated by someone favorable to your methods.

Akerlof and Michaillat show that if a new paradigm is going to take root, it needs to get itself above a certain threshold. In the example I just gave, the threshold is 8.3%. In general, this threshold is determined by just two factors: the superiority of the new paradigm and the extent of bias. The better the paradigm and less bias, the lower is this threshold and therefore the lower is the population of scientists that needs to be “converted” to the new paradigm for it to take root. 

The applied turn in economics looks reasonably good on these two factors.

Let’s talk about the new paradigm’s superiority first. Here, we are talking about the extent to which the new paradigm really is better than the old one, as judged by an unbiased evaluator. In general, the harder it is to deny the new paradigm is better, the better the outlook of the new paradigm. In my example above, if the new paradigm earns tenure from an unbiased evaluator 55% of the time instead of 50%, this is enough to outweigh the 30% bias penalty. The new paradigm will grow in each generation. 

One way to establish a paradigm’s superiority is if it is able to answer questions or anomalies that the prior paradigm failed to answer (this is closely related to Kuhn’s classic model of paradigms in science). Arguably, this was the case with the new quasiexperimental and experimental methods in economics. As retold in Angrist and Pischke’s “The Credibility Revolution” a 1986 paper by Lalond was quite influential. 

Lalond had data on an actual economic experiment: in the mid-1970s, a national job training program was randomly given to some qualified applicants and not to others (who were left to fend for themselves). Comparing outcomes in the treated and control groups indicated the program raised incomes by a bit under $900 in 1979. In contrast, using economics’ older methods produced highly variable results with some estimates off by more than $1,000 in either direction.

The second factor that matters in Akerlof and Michaillat is the extent of bias. If the bias in this example was 20% instead of 30%, the new paradigm would gain more converts in each generation, further raising the tenure prospects for new paradigm-ers. 

I suspect certain elements of quasiexperimental methods also helped reduce the bias these methods faced when they were taking root in economics. It’s probably important here that economics has a long history of thinking of experiments as an ideal but regrettably unattainable method. When some people showed that these methods were indeed usable, even in an economic context, it was easier to accept them than if the field had adopted an attitude of not valuing experiments even if they were valuable. Moreover, some quasi-experimental ideas could easily be understood in terms of long-standing empirical approaches in economics (like differentiating between exogenous and endogenous variables). 

So, given all that, the applied turn in economics probably had a relatively low bar to clear. But is that the whole story? I don’t think so, for two reasons.

First, this version of the story is really flattering to economists. Basically, it says the applied turn happened because economists were not that biased against this kind of paradigm shift, and because good work convinced them of it’s value. But we should be skeptical towards explanations that are flattering to our self-image.

Second, even supposing the threshold economics needed to clear was a small one, supposing there was still a bar to clear at all, how did economics get above it? In any new paradigm, there will tend to be a very small number of people who initially adopt it. How does this microscopic group get above the threshold?

One possibility is simple luck. This is a bit obscured in my presentation; the dynamics I describe are only the averages. The new paradigm lot could get lucky and be assigned to a greater share of new paradigm tenured evaluators than the average, or they might catch a few breaks from old paradigm evaluators. If a bit of luck pushes them over a crucial threshold (in this case, 8.3%), then they can expect to keep growing in each generation. Luck is especially important when fields are small.

In economics, the important role of a single economist, Orley Ashenfelter, is often highlighted. Panhas and Singleton’s wrote a short history of the rise of experimental methods in economics, noting:

...Ashenfelter eventually returned to a faculty position at Princeton. In addition to his promoting the use of quasi-experimental methods in economics through both his research and his role as editor of the American Economic Review beginning in 1985, he also advised some of the most influential practitioners who followed in this tradition, including 1995 John Bates Clark Medal winner David Card and, another of the foremost promoters of quasi-experimental methods, Joshua Angrist.

We will return to Ashenfelter in a bit. But if he had not been in this position at this time (editor of the flagship journal in economics and an enthusiastic proponent of the new methods), it’s possible that things might have turned out differently for the applied turn.

All of these factors are probably part of the story of the applied turn in economics. But a 2020 preprint by O’Conner and Smaldino suggests a fourth factor that I think turned out to also be very important. 

O’Connor and Smaldino suggest interdisciplinarity can also be an avenue for new paradigms to take root in a field. The intuition is simple: start with a model like Akerlof and Michaillat’s, but assume there is also some probability that you will be evaluated by someone from another field. This doesn’t have to be a tenure decision - it could also represent peer review from outside your discipline that helps build a portfolio of published research.

If that other field has embraced the new paradigm, then this introduces a new penalty for those using the old paradigm, since they will get dinged anytime they are reviewed by an outsider, and it raises the payoff to adopting the new paradigm since they benefit anytime an outsider reviews them. If we add a 10% probability that you are evaluated by an evaluator from another field that has completely embraced the new paradigm, this is also enough to ensure the share of new paradigm scientists grows in each period.

Outside Reviewers in Economics

At first, this wouldn’t seem to be relevant. Economics has a reputation for being highly insular, with publication in one of five top economics journals a prerequisite for tenure in many departments. When Leamer was writing in 1983, well under 5% of citations made in economics articles were made to other social science journals (which compares unfavorably to other social sciences). Even if there was another field using the same quasi-experimental methods as would later become popular in economics, they weren’t going to be serving as reviewers for many economics articles.

But economics is unusual in having a relatively large non-academic audience for its work: the policy world. Economists testify before congress at twice the rate of all other social scientists combined and this was especially true in the period Leamer was writing. I think policy-makers of various stripes played a role analogous to the one O’Connor and Smaldino argue can be played by other academic disciplines.

The quasi-experimental methods that came to prominence in the applied turn began in applied microeconomics, more specifically in the world of policy evaluation. Policy-makers had a need to evaluate different policies, and experimental methods were viewed as more credible by this group than existing alternatives (such as assuming a model and then estimating its parameters). In 2017, Panhas and Singleton wrote a short history of the rise of experimental methods in economics, highlighting these roots in the world of policy evaluation:

The late 1960s and early 1970s saw field experiments gain a new role in social policy, with a series of income maintenance (or negative income tax) experiments conducted by the US federal government. The New Jersey experiment from 1968 to 1972 “was the first large-scale attempt to test a policy initiative by randomly assigning individuals to alternative programs” (Munnell 1986). Another touchstone is the RAND Health Insurance Experiment, started in 1971, which lasted for fifteen years.

This takes us back to Orley Ashenfelter. Upon graduating with a PhD from Princeton in 1970, Ashenfelter was appointed director of the Office of Evaluation at the US Department of Labor. As Panhas and Singleton write:

[Ashenfelter] recalled about his time on the project using difference-in-differences to evaluate job training that “a key reason why this procedure was so attractive to a bureaucrat in Washington, D.C., was that it was a transparent method that did not require elaborate explanation and was therefore an extremely credible way to report the results of what, in fact, was a complicated and difficult study” (Ashenfelter 2014, 576). He continued: “It was meant, in short, not to be a method, but instead a way to display the results of a complex data analysis in a transparent and credible fashion.” Thus, as policymakers demanded evaluations of government programs, the quasi-experimental toolkit became an appealing (and low cost) way to provide simple yet satisfying answers to pressing questions.

This wasn’t the only place that the economics profession turned to quasi-experimental methods to satisfy skeptical policy-makers. In another overview of the applied turn in economics, Backhouse and Cherrier write:

...in 1981 Reagan made plans to slash the Social Sciences NSF budget by 75 percent, forcing economists to spell out the social and policy benefits of their work more clearly. Lobbying was intense and difficult. Kyu Sang Lee (2016) relates how the market organization working group, led by Stanley Reiter, singled out a recent experiment involving the Walker mechanism for allocation of a public good as the most promising example of policy-relevant economic research.

So it seems at least part of the success of the applied turn in economics was the existence of a group outside of the academic field of economics who favored work in the new “paradigm” and that this allowed the methods to get a critical toehold in the academic realm. 

Through the 1990s and 2000s, the share of articles using quasi-experimental terms rose, led by applied microeconomics. But by the 2010s, the experimental method also began to rise rapidly in another field: economic development. This too was a story about economics’ contact with the policy world, albeit a different set of policy-makers.

Economic Development and the Rise of the RCT

de Souza Leão and Eyal (2019) also argue the rise of a key new method in development economics - the randomized control trial (RCT) - was not the inevitable result of the method’s inherent superiority. They point out that the current enthusiasm for RCTs in international development is actually the second such wave of enthusiasm, after an earlier wave dissipated in the early 1980s. 

The first wave of RCTs in international development occurred in the 1960s-1980s, and was largely led by large government funders engaging in multi-year evaluations of big projects or agencies. In this first wave, experimental work was not in fashion in academic economics, and experiments were instead run primarily by public health, medicine, and other social scientists, as well as non-academics. Enthusiasm for this approach waned for a variety of reasons. The length and scale of the projects heightened concerns about the ethics of giving some groups access to a potentially beneficial program and not others. Moreover, this criticism was particularly acute when directed at government funders, who are necessarily responsive to political considerations, and who it could be argued have a special duty to provide universal access to potentially beneficial programs. The upshot is a lot of experiments were not run as initially planned (after political interference), and then a lot of time and money had been spent on evaluations that weren’t that informative.

The second wave, on the other hand, shows no sign of slowing down. The second wave of RCTs in international development began after the international development community fragmented after the breakdown of the Washington consensus. International NGOs and philanthropists were a new source of funding for experiments. Compared to governments, international NGOs and philanthropists were more insulated from the arguments about the ethics of offering an intervention to only part of a population. They did not claim to represent the general will of the population and budget constraints usually meant that universal access was infeasible anyway. Moreover, the nature of these interventions tended to be shorter and smaller, which also tended to blunt the argument that experiments were unethical. (Though critiques of the ethics of RCTs in economic remain quite common)

On the economist’s side, appetite for using RCT methods had become somewhat established by this time, thanks to its start in applied microeconomics. Moreover, economists were content to study small or even micro interventions, because of a belief that economic theory provided a framework that would allow a large number of small experiments to add up to more than the sum of their parts. Whereas the first wave of RCTs was conducted by a wide variety of different academic disciplines, this second wave is dominated by economists. This also creates a critical mass in the field, where economists using RCTs can be confident their work will find a sympathetic audience from their peers.

How to Change a Field

So, to sum up, it’s hard to change a field if people are biased against that change since any change necessarily has a small number of adherents at the outset. One way change can happen though, is if an outside group sympathetic to the innovation provides space for its adherents to grow. Above a threshold, the change can be self-perpetuating. In economics, the rise of quasi-experimental methods in the field probably arises partially from the presence of the policy world, which liked these methods and allowed them to take root. It was also important, however, that these methods credibly established their utility in a few key circumstances, and also that they could be framed in such a way that was consistent with earlier work.


The next newsletter (February 2) will be about one reason cities are so innovative: because they create encounters between people who would not normally meet.

If you enjoyed this post, you might also enjoy:

Does science self-correct? (evidence from retractions)

Does chasing citations lead to bad science? (sometimes yes, in general no)

How bad is publish-or-perish for the quality of science? (it’s not great)

How useful are learning curves, really?

  
0:00
-23:26

In economic models of “learning-by-doing,” technological progress is an incidental outcome of production: the more a firm or worker does something, the better they get at it. In its stronger form, the idea is formalized as a “learning curve” (also sometimes called an experience curve or Wright’s Law), which asserts that every doubling of total experience leads to a consistent decline in per-unit production costs. For example, every doubling of the cumulative number of solar panels installed is associated with a 20% decline in their cost, as illustrated in the striking figure below (note both axes are in logs).

Learning curves have an important implication: if we want to lower the cost of a new technology, we should increase production. The implications for combating climate change are particularly important: learning curves imply we can make renewable energy cheap by scaling up currently existing technologies.

But are learning curves true? The linear relationship between log costs and log experience seems to be compelling evidence in their favor - it is exactly what learning-by-doing predicts. And similar log-linear relationships are observed in dozens of industries, suggesting learning-by-doing is a universal characteristic of the innovation process.

But what if we’re wrong?

But let’s suppose, for the sake of argument, the idea is completely wrong and there is actually no relationship between cost reductions and cumulative experience. Instead, let’s assume there is simply a steady exponential decline in the unit costs of solar panels: 20% every two years. This decline is driven by some other factor that has nothing to do with cumulative experience. It could be R&D conducted by the firms; it could be advances in basic science; it could be knowledge spillovers from other industries, etc. Whatever it is, let’s assume it leads to a steady 20% cost reduction every two years, no matter how much experience the industry has.

Let’s assume it’s 1976 and this industry is producing 0.2 MW every two years, and that total cumulative experience is 0.4 MW. This industry faces a demand curve - the lower the price, the higher the demand. Specifically, let’s assume every 20% reduction in the price leads to a doubling of demand. Lastly, let’s assume cost reductions are proportionally passed through to prices.

How does this industry evolve over time?

In 1978, cost and prices drop 20%, as they do every every two years. The decline in price leads demand to double to 0.4 MW over the next two years. Cumulative experience has doubled from 0.4 to 0.8 MW.

In 1980, cost and prices drop 20% again. The decline in price leads demand to double to 0.8 MW per decade. Cumulative experience has doubled from 0.8 MW to 1.6 MW.

In 1982… you get the point. Every two years, costs decline 20% and cumulative experience doubles. If we were to graph the results, we end up with the following:

In this industry, every time cumulative output doubles, costs fall 20%. The result is the same kind of log-linear relationship between cumulative experience and cost as would be predicted by a learning curve.

But in this case, the causality is reversed - it is price reductions that lead to increases in demand and production, not the other way around. Importantly, that means the policy implications from learning curves do not hold. If we want to lower the costs of renewable energy, scaling up production of existing technologies will not work. 

This point goes well beyond the specific example I just devised. In any demand curve with a constant elasticity of demand, it can be shown constant exponential progress yields the same log-linear relationship predicted by a learning curve. And even when demand doesn’t have a constant elasticity of demand, you can frequently get something that looks pretty close to a log-linear relationship, especially if there is a bit of noise.

But ok; progress is probably not completely unrelated to experience. What if progress is actually a mix of learning curve effects and constant annual progress? Nordhaus (2014) models this situation, and also throws in growth of demand over time (which we might expect if the population and income are both rising). He shows you’ll still get a constant log-linear relationship between cost and cumulative experience, but now the slope of the line in such a figure is a mix of all these different factors.

In principle, there is a way to solve this problem. If progress happens both due to cumulative experience and due to the passage of time, then you can just run a regression where you include both time and total experience as explanatory variables for cost. To the extent experience varies differently from time, you can separately identify the relative contribution of each effect in a regression model. Voila!

But the trouble is precisely that, in actual data, experience does not tend to differ from time. Most markets tend to grow at a steady exponential rate, and even if they don’t, their cumulative output does. This point is made pretty starkly by Lafond et al. (2018), who analyze real data on 51 different products and technologies. For each case, they use a subset of the data to estimate one of two forecasting models: one based on learning curves, one based on constant annual progress. They then use the estimated model to forecast cost for the rest of the data and compare the accuracy of the methods. In the majority of cases, they tend to perform extremely similarly. 

To take one illustrative example, the figure below forecasts solar panel costs out to 2024. Beginning in 2015 or so, the dashed line is their forecast and confidence interval for a model assuming constant technological progress (which they call Moore’s law). The red lines are their forecasts and confidence intervals for a model assuming learning-by-doing (which they call Wright’s law). The two forecasts are nearly identical. 

So the main point, so far, is that consistent declines in cost whenever cumulative output doubles is not particularly strong evidence for learning curves. Progress could be 100% due to learning, 100% due to other factors, or any mix of the two, and you will tend to get a result that looks the same.

But that doesn’t mean learning curves are not true - only that we need to look for different evidence.

A theoretical case for learning curves

One reason I think learning curves are sort-of true is that they just match our intuitions about technology. We have a sense that young technologies make rapid advances and mature ones do not. This is well “explained” by learning curves. By definition, firms do not have much experience with young technologies; therefore it is relatively easy to double your experience. Progress is rapid. For mature technologies, firms have extensive experience, and therefore achieving a doubling of total historical output takes a long time. Progress is slow.

There is a bit of an issue of survivor bias here. Young technologies that do not succeed in lowering their costs never become mature technologies. They just become forgotten. So when we look around at the technologies in widespread use today, they tend to be ones that successfully reduced cost until they were cheap enough to find a mass market, at which point progress plateaued. All along the way, production also increased since demand rises when prices fall. (Even here, it’s possible to think of counter-examples: we’ve been growing corn for hundreds of years, yet yields go up pretty consistently decade on decade)

But even acknowledging survivor bias, I think learning-by-doing remains intuitive. Young technologies have a lot of things that can be improved. If there’s a bit of experimentation that goes alongside production, then many of those experiments will yield improvements simply because we haven’t tried many of them before. Mature technologies, on the other hand, have been optimized for so long that experimentation is rarely going to find an improvement that’s been missed all this time.

There’s even a theory paper that formalizes this idea. Auerswald, Kauffman, Lobo, and Shell (2000) apply models drawn from biological evolution to technological progress. In their model, production processes are represented abstractly as a list of subprocesses. Every one of these sub-processes has a productivity attached to it, drawn from a random distribution. The productivity of the entire technology (i.e., how much output the technology generates per worker) is just the sum of the productivities of all the sub-processes. For instance, in their baseline model, a technology has 100 sub-processes, each sub-process has a productivity ranging from 0 to 0.01, so that the productivity of the entire technology when you add them up ranges from 0 to 1.

In their paper, firms use these technologies to produce a fixed amount of output every period. This bypasses the problem highlighted in the previous section, where lower costs lead to increased production - here production is always the same each period, and is therefore unrelated to cost. As firms produce, they also do a bit of experimentation, changing one or more of their sub-processes at a constant rate. When changes result in an increase in overall productivity, the updated technology gets rolled out to the entire production process next period, and experimentation begins starting from this new point.

In this way, production “evolves” towards ever higher productivity and ever lower costs. What’s actually happening is that when a production process is “born” the productivity of all of its sub-processes are just drawn at random so they are all over the map: some high, some low, most average. If you choose a sub-process at random, in expectation it’s productivity will just be the mean of the random distribution, and so if you change it there is a 50:50 shot that the change will be for the better. So progress is fairly rapid at first.

But since you only keep changes that result in net improvements, the productivity of all the sub-processes gets pulled up as production proceeds. As the technology improves, it gets rarer and rarer that a change to a sub-process leads to an improvement. So tinkering with the production process yields an improvement less and less often. Eventually, you discover the best way to do every sub-process, and then there’s no more scope to improve.

But even though this model give you progress that gets harder over time, it actually does not generate a learning curve, where a doubling of cumulative output generates a constant proportional increase in productivity. Instead, you get something like the following figure:

To get a figure that has a linear relationship between the log of cumulative output and the log of costs, the authors instead assume (realistically, in my view) that production is complex and sub-processes are interrelated. In their baseline model, every time you change one subprocess, the productivity of four other sub-processes is also redrawn.

In this kind of model, you do observe something like a learning curve. This seems to be because interdependence changes the rate of progress such that it speeds up in early stages and slows down in later ones. The rate of progress is faster at the outset, because every time you change one subprocess, you actually change the productivity of multiple subprocesses that interact with it. Since these changes are more likely to be improvements at the outset, that leads to faster progress when the technology is young, because you can change lots of things at once for the better.

But when a technology matures, the rate of progress slows. Suppose you have a fairly good production process, where most of the sub-processes have high productivity, but there are still some with low productivity. If you were to tinker with one of the low-productivity sub-processes, it’s pretty likely you’ll discover an improvement. But, you can’t just tinker with that one. If you make a change to the one, it will also lead to a change in several other sub-processes. And most of those are likely to be high-productivity. Which means any gains you make on the low-productivity sub-process will probably be offset by declines in the productivity of other ones with which it interacts. 

When you add in these interdependencies between sub-processes, their model generates figures like the following. For much of their life, they look quite a lot like learning curves. (And remember, this is generated with constant demand every period, regardless of cost)

What’s encouraging is the story Auerswald, Kauffman, Lobo, and Shell are telling is one that sounds quite applicable to many real technologies. In lots of technologies there are many sub-components or sub-processes that can be changed, changes may result in improvements or deterioration, and changing one thing frequently affects other sub-components. If you go about this kind of randomly, you can get something that looks like a learning curve.

Evidence from an auto plant

Another paper by Levitt, List, and Syverson (2013) use a wealth of data from an automobile assembly plant to document exactly the kind of learning from experience and experimentation that undergirds learning curves. The paper follows the first year of operation for an auto assembly plant at an unnamed major automaker. Their observations begin after several major changes: the plant went through a major redesign, the firm introduced a new team-based production process, and the vehicle model platform had its first major redesign in six years. Rather than focus on the cost of assembling a car, the paper measures the decline in production defects as production ramps up.

Levitt, List, and Syverson observe a rapid reduction in the number of defects at first, when production is still in early days, followed by a slower rate of decline as production ramps up. Consistent with the learning curve model, the relationship between the log of the defect rate and the log of cumulative production is linear.

Learning-by-doing really makes sense in this context. Levitt, List and Syverson provide some concrete examples of what exactly is being learned in the production process. In one instance, two adjacent components occasionally did not fit together well. As workers and auditors noticed this, they tracked the problem to high variance in the shape of one molded component. By slightly adjusting the chemical composition of the plastic fed into the mold, this variability was eliminated and the problem solved. In another instance, an interior part was sometimes not bolted down sufficiently. In this case, the problem was solved by modifying the assembly procedure for those particular line workers, and adding an additional step for other workers to check the bolt. It seems reasonable to think of these changes as being analogous to changing subprocesses, each of which can be potentially improved and where changes in one process may affect the efficacy of others.

Levitt, List, and Syverson also show that this learning becomes embodied in the firm’s procedures, rather than the skill sets of the individual workers. Midyear, the plant began running a second line and on the second line’s first full day (after a week of training), their defect rate was identical to the first shift workers. 

This is a particularly nice context to study because there were no major changes to the plant’s production technology during the period under review. There were not newly designed industrial robots installed midway through the year, or scientific breakthroughs that allowed the workers to be more efficient. It really does seem like, what changed over the year was the plant learned to optimize a fixed production technology.

An Experiment

So we have a bit of theory that shows how learning curves can arise, and we’ve got one detailed case study that seems to match up with the theory pretty well. But it would be really nice if we had experimental data. If we were going to test the learning curve model and we had unlimited resources, the ideal experiment would be to pick a bunch of technologies at random and massively ramp up their production, and then to compare the evolution of their costs to a control set. Better yet, we would ramp up production at different rates, and in a way uncorrelated with time, for example, raising production by a lot but then shutting it down to a trickle in later years. This would give us the variation between time and experience that we need to separately identify the contribution to progress from learning and from other stuff that is correlated with the passage of time. We don’t have that experiment, unfortunately. But we do have World War II.

The US experience in World War II is not a bad approximation of this ideal experiment. The changes in production induced by the war were enormous: the US went from fielding an army of under 200,000 in 1939 to more than 8 million in 1945, and also equipping the allied nations more generally. The production needs were driven by military exigencies more than the price and cost of production, which should minimize reverse causality, where cost declines lead to production increases. Production was also highly variable, so that it is possible to separately identify cost reductions associated with time and cumulative experience. The following figure, for example, illustrates monthly production of Ford’s Armored Car M-20 GBK.

A working paper by Lafond, Greenwald, and Farmer (2020) uses this context to separately identify wartime cost reductions associated with production experience and those associated with time. They use three main datasets:

  • Man hours per unit over the course of the war for 152 different vehicles (mostly aircraft, but also some ships and motor vehicles)

  • Total unit costs per product for 523 different products (though with only two observations per product: “early” cost and “later” costs)

  • Indices of contract prices aggregated up the level of 10 different war sub departments

So, in this unique context, we should be able to accurately separate out the effect of learning-by-doing from other things that reduce cost and are correlated with the passage of time. When Lafond, Greenwald, and Farmer do this, they find that cost reductions associated specifically with experience account for 67% of the reduction in man hours, 40% of the reduction in total unit costs, and 46% of the reduction in their index of contract prices. Learning by doing, at least in World War II, was indeed a significant contributor to cost reductions.

Are learning curves useful?

So where do stand, after all that? I think we have good reason to believe that learning-by-doing is a real phenomenon, roughly corresponding to a kind of evolutionary process. At the same time, it almost certainly accounts for only part of the cost reductions we see in any given project, especially over the long-term when there are large changes to production processes. In particular, the historical relationship between cost reductions and cumulative output that we observe in “normal” circumstances is so hopelessly confounded that we really can’t figure out what share accrues to learning-by-doing and what share to other factors.

That means that if we want to lower the costs of renewable energy (or any other new technology), we can probably be confident they will fall to some degree if we just scale up production of the current technology. But we don’t really know how much - historic relationships don’t tell us much. In World War II, at best, we would have gotten about two-thirds of the rate of progress implied by the headline relationship between cost reduction and cumulative output. Other datasets imply something closer to two fifths. Moreover, the evidence reviewed here applies best to situations where we have a standard production process around which we can tinker and iterate to a higher efficiency. If we need to completely change the method of manufacture or the structure of the technology - well, I don’t think we should count on learning by doing to deliver that.


The next newsletter (January 19) will be about the challenges in getting any academic field to embrace new (better) methodologies, and how the field of economics overcame them.

If you enjoyed this post, you might also enjoy:

More people = more ideas?

On Collaborating with an AI, Writing, and Innovation

  
--:--
--:--

Heads up: this week’s post is a bit different, in that it’s not built primarily from academic papers. We’ll be back to the normal format in 2021 with a post about learning-by-doing and learning curves. Happy Holidays!

GPT-3 is OpenAI's new generative language model. It’s built by feeding a gigantic neural network hundreds of gigabytes of text, pulled from books, wikipedia, and the internet at large. The neural network identifies statistical patterns in text, which are translated into the structure and weights of the neural network. The upshot is that you end up with a predictive text generator. Prompt GPT-3 with a bit of text and it will complete the text based on its underlying model of statistical regularities in text. The results range from spooky good to funny bad.

GPT-3 could end up having a big impact on innovation and research, but that’s not what this post is about. Instead, I’m going to talk about how writing stories with GPT-3 is kind of a neat metaphor for innovation in general.

Let's look at an example. The Eye of Thuban is a short incomplete science fiction-fantasy story about a woman named Vega who lives on an alien world, longs to be a space pilot, and encounters a mysterious artifact that seems to give her powers. It's a joint effort between GPT-3 and the human Arram Sabati, and while it's not particularly good, it is coherent and recognizably a science-fiction story.

Sabeti generated the story by prompting GPT-3 with the following:

This novel is a science fiction thriller that can be thought of as a strange mix of the fantastic and whimsical worlds of Hayo Miyazaki and Ian M. Banks Culture novels. They’re set in a post singularity world where humanity and its descendants span thousands of worlds, and sentient super intelligent ships with billions of people living on them wander the galaxy.

Chapter 1.

Starting with that, GPT-3 wrote a few sentences of story by predicting what kind of text would be most likely to continue on from this prompt. Sabeti accepted or rejected these sentences based on his own judgment: was it coherent? interesting? If not, he would prompt GPT-3 to try again. If he liked it, Sabeti would add the GPT-3 generated text to what he had and use the story so far as the next prompt for GPT-3. It would generate a few more sentences, based on what came before, which Sabeti would accept or reject. Repeat, until you have a story.

Who "wrote" the story that emerged? After the initial prompt, GPT-3 wrote all the text. But Sabeti played a crucial role in selecting text that he liked most, in the process shaping future prompts and pulling the story in one direction or another.

More mysteriously, we have very little understanding of what GPT-3 was "thinking" or "drawing on" when it composed it's contributions. We don't know what words or phrases in the prompts caught GPT-3's attention, and where the patterns it used to generate text originated in it's training data. To Sabeti, it's a black box. Indeed, even to the team at OpenAI that created it, GPT-3 is largely a black box. The way neural nets encode statistical patterns in data is useful, but hard to translate into the kind of explanations that people understand, at least at the present.

So, in essence, through a process Sabeti doesn't understand, text is generated. Sabeti evaluates it critically, and then chooses whether to leave it or try again. At some point, he is satisfied with the results and a story is published. This isn't writing as we know it. 

The Mystery of Writing

Or is it? In fact, the kind of creation described above is remarkably similar to how some writers describe their process. George Saunders writes:

we often discuss art this way: the artist had something he “wanted to express”, and then he just, you know … expressed it. We buy into some version of the intentional fallacy: the notion that art is about having a clear-cut intention and then confidently executing same...

The actual process, in my experience, is much more mysterious and more of a pain in the ass to discuss truthfully.

Authors often do not exactly understand where their own ideas come from. Stephen King, in his memoir On Writing writes:

Good story ideas seem to come quite literally from nowhere, sailing at you right out of the empty sky. Two previously unrelated ideas come together and make something new under the sun. Your job isn't to find these ideas but to recognize them when they show up.

In the same memoir, King describes his process of organic and intuitive writing. He advises writers to come up with a scenario and discover what happens to the characters, rather than setting up the plot in advance and maneuvering them through it. After finishing the first draft, he tells writers to reread it to discover what they were "really" writing about. Rather than sculpture, King likens the whole process to unearthing a fossil. The story is discovered, rather than planned.

King and Saunders don't know where they are going; but they still get somewhere interesting. How? Part of the answer is taste. They may not be able to create a satisfying plot twist or turn of phrase on command, but when they see one, they recognize it. Notice how in the previous quote King says:

Your job isn't to find these ideas but to recognize them when they show up.

If you are capable of creating work and then evaluating it, you are capable of using an evolutionary algorithm to write great fiction. Saunders is most explicit about how this works:

My method is: I imagine a meter mounted in my forehead, with “P” on this side (“Positive”) and “N” on this side (“Negative”). I try to read what I’ve written uninflectedly, the way a first-time reader might (“without hope and without despair”). Where’s the needle? Accept the result without whining. Then edit, so as to move the needle into the “P” zone. Enact a repetitive, obsessive, iterative application of preference: watch the needle, adjust the prose, watch the needle, adjust the prose (rinse, lather, repeat), through (sometimes) hundreds of drafts. Like a cruise ship slowly turning, the story will start to alter course via those thousands of incremental adjustments.

More generally, in Old Masters and Young Geniuses: The Two Lifecycles of Artistic Creativity David Galenson describes two broad approaches to artistic creation: conceptual and experimental. The experimental creator is unsure of where they are going and proceeds by creating, evaluating the results, tweaking them, and repeating over and over again. Galenson provides a wealth of anecdotes supporting this creative style for many famous writers: Charles Dickens, Virginia Woolf, Mark Twain, etc. 

Just as the blind forces of evolution are capable of creating highly complex creatures via a process of mutation, selection, and retention, this process of myopic, evolutionary writing is capable of generating work that surprises it's own authors. They didn't know they would end up here, but they recognized it was a good place to be, once they did. Again, Saunders:

The interesting thing, in my experience, is that the result of this laborious and slightly obsessive process is a story that is better than I am in “real life” – funnier, kinder, less full of crap, more empathetic, with a clearer sense of virtue, both wiser and more entertaining.

This method of writing sounds remarkably like collaborative writing with GPT-3 to me. Through a lifetime of reading and writing, great writers develop their own intuitive subconscious map of the regularities in good writing. Like GPT-3, they have a kind of prompt - where are they starting from - and like GPT-3, they have some kind of internalized model of what writing looks like, but for which they might struggle to explain the exact sources and reasons. Then, once their intuition or subconscious tosses out some text, they can evaluate whether it’s any good. If so, they keep it and “add it” to the prompt. If not, they try again. 

Indeed, François Chollet, makes this analogy pretty explicitly:

Not so different after all

That said, there are many differences too. The extent to which human brains are “like” digital neural networks is not clear, and it may be that the differences are really important. But for the purposes of this essay, that’s not a big problem. What matters is that both processes involve the generation of text via methods that are hard to describe but effective, and then a second step of conscious evaluation of this output.

A more important difference is that the unedited text generated by Saunders or King might be much better than the text generated by GPT-3. Certainly the Eye of Thuban is not a particularly compelling story. In writing The Eye of Thuban, Sabeti notes that sometimes GPT-3 adds text that is logically inconsistent with what has come earlier - we can assume King and Saunders usually don’t do that.

The quality of the text GPT-3 generates isn’t actually so important though, at least for the end result. Instead, the quality of the unedited text generated by Saunders, King, or GPT-3 mostly has an effect on the amount of time that must be spent curating the text. Indeed, if one was willing to put in a lot of time (like, longer than the lifespan of the universe) monkeys banging on keyboards could eventually replicate Shakespeare. GPT-3 is much, much better than that. It’s also clearly an improvement over what’s come before. The pseudonymous Gwern has extensively played with GPT-3 and it’s predecessor GPT-2, writing:

With GPT-2-117M poetry, I’d typically read through a few hundred samples to get a good one… But for GPT-3, once the prompt is dialed in, the ratio appears to have dropped to closer to 1:5—maybe even as low as 1:3! I frequently find myself shrugging at the first completion I generate, “not bad!” (Certainly, the quality of GPT-3’s average prompted poem appears to exceed that of almost all teenage poets.) I would have to read GPT-2 outputs for months and probably surreptitiously edit samples together to get a dataset of samples like this page.

Sabeti claims to have written the Eye of Thuban in the course of a few hours, whereas Saunders describes potentially hundreds of iterations on each draft. It may well be that collaborative writing with GPT-3 could come up with something really good, if one was willing to put in a lot more time. 

What GPT-3 does is reduce the time spent iterating to good work, by exploiting regularities in language to avoid wasting the curators time selecting obviously bad passages. 

Saunders and King do the same thing - a lifetime of reading, writing, and thinking critically about what they read and write, has led them to internalize “good writing” such that the raw text they come up with is not bad, even before they selectively curate their own writing. Stretching the analogy, we might view the texts Saunders and King read as the training data they use to make their inner models of good writing. Unlike GPT-3, some of the curation that Sabeti performs on GPT-3 probably occurs in a writer's head, rather than through the process of actually writing text and evaluating it (think of a writer mentally searching through a series of adjectives to find just the right one before putting pen to paper). To the extent the uncurated text is already good, this speeds up their iterative process.

There is one respect, however, in which time may not be sufficient to offset any weaknesses in GPT-3’s generated text. It may be that there are certain turns of phrase and sentences that GPT-3 will never produce, simply because they lie so much at odds with its internalized model of the regularities in writing. If this is true, then no amount of time would suffice to generate good writing from GPT-3. In this regard, GPT-3 could potentially be worse than monkeys pounding on keyboards, since they are at least capable of generating any text. The tradeoff seems unavoidable; when you exploit regularities in language to weed out certain passages and save the collaborator time, you also might weed out good passages that do not exhibit these regularities. We just hope these cases are rare.

Beyond Writing

What we have then, in both collaborative writing with GPT-3 and a certain method of professional writing, is an iterative process of generation and evaluation. Good writing does not emerge fully formed - instead, many texts are generated, and good ones are retained. If we were to generate our texts randomly, we could still generate great writing - but it would take a very long time, because most random text is gibberish. GPT-2 and GPT-3 greatly reduce the time necessary to generate great writing by restricting the generation of texts to those more likely to be good (they are coherent, and obey certain regularities in language). Great writers do both sides of the process - they have good subconscious models of writing, so that their raw text is reasonably good, and they have great taste, so they can prune their output and so direct their writing towards interesting ends.

Collaborative writing with GPT-3 is more than an analogy for good writing though; it’s an analogy for innovation in general.

Let’s pause briefly to consider what “innovation” is. To me, innovation is the emergence of interesting, reproducible novelty. Novel, because it must be something that has not been done before. Reproducible, because the innovation cannot be one-time miracle, but a new class of thing which can serve (in principle) as a blueprint for more of its kind. And interesting, because otherwise, who cares? The challenge with innovation is that most novel things are not interesting. How to find the few that are?

In practice, what we do is use some kind of simplified model of reality to guide our efforts, so that we don’t just try things at random. That “model” could literally be a scientific model of the physical world; these are, after all, ways of representing observed regularities in data. But they could also be much more prosaic: rules of thumb, analogies to other examples, or expert intuition (built up from long study of relevant precedents). When these models are good, they allow us to develop ideas and technologies that would take an eternity to arrive at by evolution or random chance. When they are bad, they restrict us and prevent us from trying things that would have worked if we could only take off our blinders.

In the analogy of collaborative writing with GPT-3 to innovation, our models of the world are analogous to GPT-3, but it is the world itself that is analogous to the human collaborator. The world is messy and complex, and even the best models may miss things. It is only when innovations are brought into reality - whether as clinical trials, prototypes, product launches, or startups - that we see if they are, in fact, “interesting.” Those that are, are retained. Like GPT-3’s text in The Eye of Thuban, they get added to the “prompt” and further work builds on the ideas and innovations that have been retained. For the rest, we “go back to the drawing board” (our simplified model of the world) and try again.


If you liked this post, you might also like the following:

Progress or stagnation in film?

Innovation as combination

Next week, “How useful are learning curves, really?”

How bad is publish-or-perish for the quality of science?

  
0:00
-17:09

How did we end up in a situation where so many scientific papers do not replicate? Replication isn’t the only thing that counts in science, but there are lots of papers that, if they actually describe a regularity or causal mechanism in the world, then we should be able to replicate it. And we can’t. How did we get here? 

One theory (not the only one), is that the publish-or-perish system is to blame.

In an influential 2016 paper, Paul Smaldino and Richard McElreath simulated science in action with a simple computer simulation. Their simulation is a highly simplified version of science, but it captures the contours of some fields well. In their simulation, “science” is nothing but hypothesis testing (that is, using statistics and data to assess whether the data is consistent or inconsistent with various hypotheses). One hundred simulated labs pursue various research strategies and attempt to publish their results. In this context, a “research strategy” is basically just three numbers: 

  • A measure of how much effort you put into each research project: the more effort you put in, the more accurate your results, but the fewer projects you finish

  • A measure of what kinds of protocols you use to detect a statistically significant event: you can trade off false negatives (incorrectly rejecting a true hypothesis) and false positives (incorrectly affirming a false one)

  • The probability you choose to replicate another lab’s findings or investigate a novel hypothesis

At the end of each period, labs either do or do not finish their project. If they do, they get a positive or null result. They then attempt to publish what they’ve got. Next, a random set of labs is selected and the oldest one “dies.”

Over time, labs accumulate prestige (also a number) based on their publishing record. Prestige matters because at the end of every period, the simulation selects another set of random labs. The one with the highest prestige spawns a new lab which follows similar (though not necessarily identical) research strategies as it’s “parent.” This is meant to represent how successful researchers propagate their methods via training postdocs who go on to form their own labs or via imitation of their methods by new labs who attempt to emulate prestigious work. 

Lastly, Smaldino and McElreath assume prestige is allocated according to the following rules:

  • Positive results are easier to publish than null results

  • More publications leads to more prestige

  • Replications give less prestige than novel hypotheses

What happens when you simulate this kind of science is not that surprising: labs with low effort strategies that adopt protocols conducive to lots of false positives publish more often then those that try and do things “right.” Let’s call this kind of research strategy “sloppy” science. Note that it may well be that these labs sincerely believe in their research strategy - there is no need in this model for labs to be devious. But by publishing more often, these labs become more prestigious and over time they spawn more labs, so that their style of research comes to dominate science. The result is a publication record that is riddled with false positives.

In short, Smaldino and McElreath suggest the incentive system in science creates selective pressures where people who adopt research strategies that lead to non-replicable work thrive and spread their methods. If these selective pressures don’t change, no amount of moral exhortion to do better will work; those who listen will always be outcompeted by those who don’t, in the long run. In fact, Smaldino and McElreath show that, despite warning about the poor methodologies in behavioral science that date back to at least the 1960s, in 44 literature reviews (see figure below) there has been no increase in the average statistical power of hypothesis tests in the social and behavioral sciences. 

Is it a good model?

Smaldino and McElreath’s simulation suggests it’s the incentive schemes (and their effect on selection) we currently use that lead to things like the replication crisis. So what if we changed the incentive system? Two recent papers look at the conduct of science for projects that are just about identical, except for the incentives faced by the researchers. 

First, let’s take a look at a new working paper by Hill and Stein (2020). Smaldino and McElreath basically assert that the competition for prestige (only those with the most prestige “reproduce”) leads to a reduction in effort per research project, which results in inferior work (more likely to be a false positive). Hill and Stein document that this is indeed the case (at least in one specific context): competition for prestige leads to research strategies that produce inferior work fast. They also show this doesn’t have to be the case, if you change the incentives of researchers.

Hill and Stein study structural biology, where scientists try to discover the 3D structure of proteins using modeling and data from x-ray scattering off protein crystals. (Aside: this is the same field that was disrupted last week by the announcement that DeepMind’s AlphaFold had made a big leap in inferring the structure of proteins based on nothing more than DNA sequence data). What makes this setting interesting is a dataset that lets Hill and Stein measure the effort and quality of research projects unusually well.

Structural biology scientists report to a centralized database whenever they take their protein crystals to a synchrotron facility, where they obtain their x-ray data. Later, they also submit their final structures to this database, with a time-stamp. By looking at the gap between the receipt of data and the submission of the final protein model, Hill and Stein can see how much time the scientists spend analyzing their data. This is their measure of how much effort scientists put into a research project.

The database also includes standardized data on the quality of each structural model: for example, how well does the model match the data, what is the resolution of the model, etc. This is a key strength of this data: it’s actually possible to “objectively” rate the quality of research outputs. They use this data to create an index for the “quality” of research. 

Lastly, of course, since scientists report when they take their sample to a synchrotron for data, Hill and Stein know who is working on what. Specifically, they can see if there are many scientists working on the same protein structure.

The relevant incentive Hill and Stein investigate is the race for priority. There is a norm in science that the first to publish a finding receives the lion’s share of the credit. There are good arguments for this system, but priority can also lead to inefficiency when multiple researchers are working on the same thing and only the first to publish gets credit. In a best case scenario, this race for priority means researchers pour outsized resources into advancing publication by a few days or weeks, with little social benefit. In a worst case scenario, researchers may cut corners to get their work out more quickly and win priority. 

Hill and Stein document that researchers do, in fact, spend less time working with their data to build a protein model, when there are more rivals working on the same protein at the same time. They also show this leads to a measurable decline in the quality of the models. Moreover, based on some rules of thumb about how good a protein model needs to be for application in medical innovation, this quality decline probably has a non-negligible impact on things non-scientists care about, like the development of drugs.

But wait, it gets worse. Why do some proteins attract the attention of lots of scientists, and others not? It’s not random. In fact, Hill and Stein provide evidence that the proteins with the most “potential” (i.e., the ones that will get cited the most in other academic papers when their structure is found) are the ones that attract the most researchers. (Aside: Hill and Stein do this with a LASSO regression that predicts the percentile citation rank of each protein based on the data available on it prior to its structure being discovered). 

In short, the most interesting proteins attract the most researchers. The more intense competition, in turn, leads these researchers to shorten the time they spend on modeling the protein, in an attempt to get priority. That, in turn, leads to the most inferior modeling on the proteins we would like to know the most about.

Hill and Stein’s paper is about one of the downsides of the priority system. This is a bit different than Smaldino and McElreath, where prestige comes from the number of publications one has. However, in Smaldino and McElreath, their simulated labs can die at any moment, if they are the oldest one in a randomly selected sample. This means the labs that spawn are the ones who are able to rapidly accrue a sizable publication record - since if you can’t get one fast, you might not live to get one at all. As in Hill and Stein, one way labs do this is by cutting back on the effort they put into each research project.

Different Incentives, Different Results?

However, academics who are judged on their publication record aren’t the only people doing structural biology. “Structural genomics” researchers are “federally-funded scientists with a mission to deposit a variety of structures, with the goal of obtaining better coverage of the protein-folding space and mak[ing] future structure discovery easier” (Hill and Stein, p. 4). Hill and Stein argue that this group is much less motivated by publication than the rest of the structural biology. For example - only 20% of the proteins they work on end up with an associated academic paper (compared to 80% for the rest of structural biology). So if they aren’t driven by publication, is the quality of their work different?

Yes! Unlike the rest of structural biology, on average this group is likely to spend more time on proteins with more potential. In the above diagram, they are the red line, which slopes up. And while the quality of the models they generate for highest potential proteins is still a bit lower than the low potential ones, the strength of this relationship is much smaller than it is for those chasing publication.

One other recent study provides some further suggestive evidence that different incentives produce different results - or at least, the perception of different results. Bikard (2018) looks at how research produced in academia is viewed by the private sector, as compared to research produced by the private sector (think papers published by scientists working for business). Specifically, are patents more likely to cite academic or private sector science?

The trouble is this will be an apples-to-oranges comparison if academia and the private sector focus on different research questions. Maybe the private sector thinks academic research is amazing, but simply not relevant to private sector needs most of the time. In that case, they might cite private sector research at a higher rate, but still prefer academic research whenever it is relevant.

To get around this problem, Bikard identifies 39 instances where the same scientific discovery was made independently by academic and industry scientists. He then shows that patents tend to disproportionately cite the industry paper on the discovery, which he argues is evidence that inventors regard academic work skeptically, as compared to work that emerges from industry research.

To identify these cases of simultaneous discovery, Bikard starts with the assumption that if two papers are consistently cited together in the same parenthetical block, like so - (example A, example B) - then they may refer to the same finding. After identifying sets of papers consistently cited together this way, he provides further supporting evidence that this system works. He shows the sets of “twin” papers he locates are extremely similar when analyzed with text analysis algorithms, that they are almost always published within 6 months of each other, and that they are very frequently published literally back-to-back in the same journal (which is one way journals acknowledge simultaneous discovery). 

This gives Bikard a nice dataset that, in theory, controls for the “quality” and relevance of the underlying scientific idea being described in the paper. This provides a nice avenue for seeing how academic work is perceived, relative to industry. When an inventor builds on the scientific discovery and seeks a patent for their invention, they can, in principle, cite either paper or both since the discovery is the same either way. But Bikard finds papers that emerge from academia were 23% less likely to be cited by patents than an industry paper on the same discovery. 

This preference for industry research could reflect a lot of things. But Bikard goes on to interview 48 scientists and inventors about all this and the inventors consistently say things like the following, from a senior scientist at a biotechnology firm:

The principle that I follow is that in academia, the end game is to get the paper published in a as high-profile journal as possible. In industry, the end game is not to get a paper published. The end game is getting a drug approved. It's much, much, much harder, okay? Many, many more hurdles along the way. And so it's a much higher bar - higher standards - because every error, or every piece of fraud along the way, the end game is going to fail. It's not gonna work. Therefore, I have more faith in what industry puts out there as a publication.

So, to sum up, we’ve got evidence that non-academic consumers of science pay more attention to the results that come from outside academia, with some qualitative evidence that this is because academic science is viewed as lower quality. We’ve also got good data from one particular discipline (structural biology) that publication incentives lead to measurably worse outcomes. I wish we had more evidence to go on, but so far what we have is consistent with the simple notion that different incentive systems do seem to get different results in a way that moral exhortation perhaps does not. 

But maybe you already believed incentives matter. In that case, one nice thing about these papers is they provide a sense of the magnitude of how bad academic incentives screw up science. From my perspective, the magnitudes are large enough that we should try to improve the incentives we have, but not so large that I think science is irredeemably broken. Hill and Stein find the impact of priority races reduces research time from something like 1.9 years to 1.7 years, not from 1.9 years to something like 0.5 years. And though the quality of the models generated is worse, Hill and Stein do find that, in subsequent years, better structure models eventually become available for proteins with high potential (at significant cost in terms of duplicated research). And even if inventors express skepticism towards academic research, they still cite it at pretty high rates. We have a system that, on the whole, continues to produce useful stuff I think. But it could be better.

If you thought this was interesting, you might also like these posts:

Does chasing citations lead to bad science?

Does science self-correct?

How long does it take to go from science to technology?

  
--:--
--:--

An update on this project: I really enjoy writing this newsletter and have gotten good feedback on it. But it takes time (you may have noticed that there were only three newsletters in the last 6 months). So I’m very pleased to announce I’ve received funding from Emergent Ventures and the go-ahead from Iowa State University to carve out a bit of time specifically for this project, at least for the next year. The plan is to release a newsletter like this one every other Tuesday. 

After a year, I’ll re-assess things. If you want to help make this project a success, you can subscribe or tell other people about it. Thanks for your interest everyone!

----

It seems clear that a better understanding of the regularities that govern our world leads, in time, to better technology: science leads to innovation. But how long does this process take? 

Two complementary lines of evidence suggest 20 years is a good rule of thumb.

A First Crack

James Adams was one of the first to take a crack at this in a serious quantitative way. In 1990, Adams published a paper titled "Fundamental Stocks of Knowledge and Productivity Growth" in the Journal of Political Economy. Adams wanted to see how strong was the link between academic science and the performance of private industry in subsequent decades. 

Adams had two pieces of data that he needed to knit together. On the science side, he had data on the annual number of new journal articles in nine different fields, from 1908 to 1980. On the private industry side, he had data on the productivity of 18 different manufacturing industries over 1966-1980. Productivity here is a measure of how much output a firm can squeeze out of the same amount of capital, labor, and other inputs. If a firm can get more output (or higher quality outputs) from the same amount of capital and labor, economists usually assume that reflects an improved technology (though it can also mean other things). What Adams basically wanted to do was see if industries experienced a jump in productivity sometime after a jump in the number of relevant scientific articles.

The trouble is that pesky word “relevant.” Adams has data on the number of journal articles in fields like biology, chemistry, and mathematics, but that's not how industry is organized. Industry is divided into sectors like textiles, transportation, and petroleum. What scientific fields are most relevant to the textiles industry? To transportation equipment? To petroleum? 

To knit the data together, Adams used a third set of data: the number of scientists in different fields that work for each industry. To see how much the textiles sector relies on biology, chemistry, or mathematics, he looked at how many biologists, chemists, and mathematicians the sector employed. That data did exist. If they employed a lot of chemists, they probably used chemistry; if they employed lots of biologists, they probably used biology, and so on. He weighted the number of articles in each field by the number of scientists working in that field to get a measure of how much each industry relies on basic science.

So now Adams has data on the productivity and relevant scientific base of 18 different manufacturing sectors. He expects more science will eventually lead to more productivity, but he doesn’t know how long that will take. If there was a surge in scientific articles in a given year, at what point would Adams expect to see a surge in productivity? If scientific insight can be instantly applied, then the surge in productivity should be simultaneous. But if the science has to work it's way through a long series of development, then the benefits to industry might show up later. How much later? 

To come up with an estimate, Adams basically looked at how strong the correlations were for 5 years, 10 years, and 20 years. Specifically, he looked to see which one gives him the strongest statistical fit between scientific articles produced in a five-year span, and the productivity increase in a five-year span for industries that use that field's knowledge intensively. Of the time lags he tried, he found the strongest correlation was 20 years.

Adams’ study is an important first step, but recent work has largely validated Adams’ original findings.

A Bayesian Alternative

Nearly thirty years later, Baldos et al. (2018) tackled a similar problem with different data and a more sophisticated statistical technique. Unlike Adams, they focused on a single sector - agriculture. Like Adams, they had two pieces of data they wanted to knit together.

On the science side, a small group of ag economists has spent a long time assembling a data series on total agricultural R&D spending by US states and the federal government. Until recently, governments were a really big source of agricultural research dollars and those dollars can be relatively easily identified since they frequently flow through the department of agriculture or state experiment stations. The upshot is Baldos et al. have data on public sector agricultural R&D going back to 1908. Meanwhile, on the technology side, the USDA maintains a data series on the total factor productivity of US agriculture from 1949 to present day. So like Adams, they’re going to try and look for a statistical correlation between productivity and research. Unlike Adams, they’re going to use dollars to measure science and unlike Adams they’re going to focus on a single sector (and not a manufacturing sector).

To deal with the fact that we don’t know when research spending impacts productivity growth, they’re going to adopt a Bayesian approach. How long does it take agricultural spending to influence productivity? They don’t know, but they make a few starting assumptions:

  • They assume impact will follow an upside “U” shape. The idea here is that new knowledge takes time to be developed into applications, and then it takes time for those applications to be adopted across the industry. During this time, the impact of R&D done in one year on productivity in subsequent years is rising. At some point - maybe 20 years later - the impact of that R&D on productivity growth hits its peak. But after that point, the R&D becomes less relevant, so that it’s impact on productivity growth in subsequent years declines. Eventually, the ideas become obsolete and have no additional impact on increasing productivity.

  • They assume the impact of R&D on productivity after 50 years is basically zero.

  • They assume the peak impact will occur sometime between 10 and 40 years, with the most likely outcome somewhere around 20. This assumption is based on earlier work, similar in spirit to Adams.

Given these estimates, they are basically assuming there’s a bunch of different possible distributions for the relationship between R&D spending and productivity. They assume the most likely distribution is one peaking around 20 years, and the farther the distribution is from that, the more unlikely. They then use Bayes’ rule to update their beliefs, given the data on R&D spending and agricultural productivity. It will turn out that some of those distributions fit the data quite well, in the sense that if that distribution of R&D impacts is true, then the R&D spending data matches productivity pretty well. Others fit quite poorly. We update our beliefs after observing the data, increasing our belief in the ones that fit the data well, and decreasing our beliefs in the ones that don’t.

They find, with 95% probability, the best fitting distribution indicates science impacts productivity the strongest after 15-24 years, with the best point estimate around 20 years.

Evidence from Citations

So whether for manufacturing or agriculture, using slightly different data and statistical techniques, we find a correlation between productivity growth and basic science that is strongest around 20 years. But at the end of the day, both methods look for correlations between two messy variables separated in time by decades. It’s hard to make this completely convincing.

A cleaner alternative is to look at the citations made by patents to scientific articles. If we assume patents are a decent measure of technology (more on that later) and journal articles a decent measure of science, then citations of journal articles by patents could be a direct measure of technology’s use of science. In the last few years, data on patent citations to journal articles has become available at scale, thanks to advances in natural language processing (see Marx and Fuegi 2019, Marx and Fuegi 2020, and the patCite project).

Do citations to science actually mean patented technologies use the science? Maybe the citations are just meant to pad out the application? There's a lot of suggestive evidence that they do. For one; they say they do. Arora, Belenzon, and Sheer (2017) use an old 1994 survey from Carnegie Mellon about the use of science by firms. Firms that cite a lot of academic literature in their patents are also more likely to report in the survey that they use science in the development of new innovations. 

There's also a variety of evidence that patents that cite scientific papers are different from patents that don't. Watzinger and Schnitzer (2019) scan the text of patents, and find patents that cite science are more likely to include combinations of words that have been rare up until the year the patent was filed. This suggests these patents are doing something new; they don't read like older patents. Maybe they are using brand new ideas or insights that they have obtained from science? They also find these patents tend to be worth more money. Ahmadpoor and Jones (2017) find these science-ey patents also tend to be more highly cited by other patents.

So let’s go ahead and assume citations to journal articles are a decent measure of how technology uses science. What do we learn from doing that?

Marx and Fuegi (2020) apply natural language processing and machine learning algorithms to the raw text of all US patents, to pull out all citations made to the academic literature. They find about 29% of patents make some kind of citation to science. More importantly for our purposes, for patents granted since 1976, the average gap in time between a patent’s filing date and the publication date of the scientific articles it cites is 17 years.

This is pretty close to the twenty years estimated using the other techniques; especially when we consider that the date a patent is filed is not necessarily the date it begins to affect productivity (which is what the other studies were measuring). Any new invention that seeks patent protection might take a few years before it is widely diffused and begins to affect productivity. In that case, a twenty year estimate is pretty close to spot on.

Winding paths from science to technology

Of course, there is variation around that 17 year figure. Ahmadpoor and Jones (2017) found that the average shortest time gap between a patent and a cited journal was just 7 years. And on the other side, there are also longer and more indirect paths from science to technology than direct citation. Ahmadpoor and Jones cite Riemannian geometry as an example. Riemannian geometry was developed by Bernhard Riemann in the 19th century as an abstract mathematical concept, with little or no real world application until it was incorporated in Einstein's general theory of relativity. Later, those ideas were used to develop the time dilation corrections in GPS satellites. That technology, in turn, has been useful in the development of autonomous vehicles. In a sense then, patents for self-driving tractors owe some of their development to Riemannian geometry, though this would only be detectable by following a long and circuitous route of citation.

When these longer and more circuitous chains of citation are followed, Ahmadpoor and Jones find that 61% of patents are "connected" to science; patents that do not directly cite scientific articles may cite patents that do, or they may cite patents that cite patents that do, and so on. About 80% of science and engineering articles are "connected" to patents, in the sense that some chain of citation links them to a patent. (Technical note: this probably understates the actual linkages, because it is based only on citations listed on a patent’s front page; Marx and Fuegi 2020 show additional citations are commonly found in a patent’s text). 

The authors compute metrics for the average "distance" of a scientific field from technological application and these distance metrics largely line up with our intuitions. For example, material science and computer science papers are fields where we might expect the results to be quite applicable to technology, and indeed, they tend to be among the closest to patents - just 2 steps away from a patent (cited by a paper that is, in turn, cited by a patent). Atomic/molecular/chemical physics papers would also seem likely to have applications, but only after more investigation, and they tend to be 3 steps removed from technology (cited by a paper that is cited by a paper that is cited by a patent). And as we might expect, the field that is farthest from immediate application is pure mathematics (five steps removed). As expected, these longer citation paths also take longer when measured in time. The average gap between the shortest path from a math paper to a patent is more than 20 years.

On the other side of the divide, they also compute the average distance of technology fields to science, and these also align with our intuitions. Chemistry and molecular biology patents would probably be expected to rely heavily on science, and they tend to be slightly more than 1 step removed from science (most directly cite scientific papers). Further downstream, electrical computers tend to be two steps removed (they cite a patent that cites a scientific article) and ancient forms of technology like chairs and seats tend to be the farthest from science (five steps removed). The average gap between the shortest citation path from a chair/seat patent and a scientific article is also over 20 years.

Converging Evidence

All told, it’s reassuring that two distinct approaches arrive at a similar figure. The patent citation evidence is reasonably direct - we see exactly what technology at what date cites which scientific article, as well as the date that article was published. This line of evidence finds an average gap of about 17 years, with plenty of scope for shorter and longer gaps as well. The trouble with this evidence is that patents are far from a perfect measure of technology. Lots of things do not ever get patented, and lots of patents are for inventions of dubious quality. 

For that reason, it’s nice to have a different set of evidence that does not rely on citations or patents at all. When we try to crudely measure “science”, either by counting government dollars or scientific articles, we can detect a correlation with increases in science and the productivity of industries expected to use it 20 years later - again, with plenty of scope for shorter and longer lags as well. 

If you thought this was interesting, you may also enjoy these posts:

How useful is science?

Does chasing citations lead to bad science?

Loading more posts…