How common is independent discovery?

Relatively common for ideas we know will be big, but not so much for the rest

Jun 22, 2022

Like the rest of New Things Under the Sun, this article will be updated as the state of the academic literature evolves; you can read the latest version here.

Audio versions of this and other posts: Substack, Apple, Spotify, Google, Amazon, Stitcher.

An old divide in the study of innovation is whether ideas come primarily from individual/group creativity, or whether they are “in the air”, so that anyone with the right set of background knowledge will be able to see them. As evidence of the latter, people have pointed to prominent examples of multiple simultaneous discovery:

Isaac Newton and Gottfried Liebnitz developed calculus independent of each other
Charles Darwin and Alfred Wallace independently developed versions of the theory of evolution via natural selection
Different inventors in different countries claim to have invented the lightbulb (Thomas Edison in the USA, Joseph Swan in the UK, Alexander Lodygin in Russia)
Alexander Graham Bell and Elisha Grey submitted nearly simultaneous patent applications for the invention of the telephone

In 1922, Ogburn and Thomas compiled a list of nearly 150 examples of multiple independent discovery (often called “twin” discoveries or “multiples); wikipedia provides many more. These exercises are meant to show that once a new invention or discovery is “close” to existing knowledge, then multiple people are likely to have the idea at the same time. It also implies scientific and technological advance have some built in redundancy: if Einstein had died in childhood, someone else would have come up with relativity.

But in fact, all these lists of anecdote show is it is possible for multiple people to come up with the same idea. We don’t really know how common it is, because these lists make no attempt to compile a comprehensive population survey of ideas. What do we find if we do try to do that exercise?

Simultaneous Discovery in Papers and Patents

A number of papers have looked at how common it is for multiple independent discovery to occur in academic papers. An early classic is Hagstrom (1974), which reports on a survey of 1,947 academics in the spring of 1966. Hagstrom’s survey asked mathematicians, physicists, chemists, and biologists if they had ever been “anticipated”; today, we would call this getting scooped. Getting scooped isn’t that uncommon: 63% of respondents said they had been scooped at least once in their career, 16% said they had been scooped more than once.

For our purposes, the most illuminating question in Hagstrom’s survey is “how concerned are you that you might be anticipated in your current research?” Fully 1.2% of respondents said they had already been anticipated on their current project!

Let’s assume people are, on average, halfway through a research project. If they have a constant probability of being scooped through the life of a project, then that implies the probability of getting scooped on any given project is on the order of 2.5%, at least in 1966.

Hill and Stein (2020) get similar results, studying the impact of getting scooped over 1999-2017 for the field of structural biology. Structural biology is a great field for studying how science works because of its unusually good data on the practice of science. Structural biologists try to figure out the 3D structure of proteins and other biological macromolecules using data on the diffraction of x-rays through crystalized proteins. When they have a model that fits the data well, the norm (and often publication requirement) is to submit the model to the Protein Data Bank. This submission is typically confidential until publication, but creates a pre-publication record of completed scientific work, which lets Hill and Stein see when two teams have independently been working on the same thing. Since almost all protein models are submitted to the Protein Data Bank, Hill and Stein really do have something approaching a census of all “ideas” in the field of structural biology, as well as a way of seeing when more than one team “has the same idea” (or more precisely, is working on closely related proteins). Overall, they find 2.9% of proteins involve multiple independent discovery, as defined above, quite close to what Hagstrom reported in 1974.

Painter et al. (2020) takes yet another approach to identifying multiple simultaneous invention, this time in the field of evolutionary medicine (2007-2011). Their approach is to identify important new words in the text of evolutionary medicine articles, and then to look for cases where multiple papers introduce the same new word at the roughly the same time. In their context, this usually means an idea has been borrowed from another field (where a word for the concept already exists) and they are looking for cases where multiple people independently realized a concept from another field could be fruitfully applied to evolutionary medicine.

To identify important new keywords, they take all the words in evolutionary medicine articles and algorithmically pick out the ones unlikely to be there based their frequency in American English. This gives them a set of technical words that are not common English words. They build up a dictionary of such terms mentioned in papers published between 1991 and 2006; these are words that are “known” to evolutionary biology in 2007. Beginning in 2007, they look for papers that introduce new technical words. Lastly, they consider a word to be important if it is mentioned in subsequent years, rather than once and never again.

Over the period they study, there were 3,488 new keywords introduced that went on to appear in at least one subsequent year. Of this set, 197 were introduced by more than one paper in the same year, or 5.6%. As a measure of independent discovery, that’s probably overstated, since it doesn’t correct for the same author publishing more than one paper using the same keywords. Again, I think something in the ballpark of 2-3% sounds plausible. Painter and coauthors go on to focus on a small subset of 5 keywords that were simultaneously introduced by multiple distinct people and which were very important, being mentioned not just again, but in every subsequent year.

Bikard (2020) is another attempt to identify instances of multiple independent discovery, though in this case it’s harder to use the data to estimate how common they are. Bikard argues that when the same two papers are frequently cited together in the same parenthetical,1 then that is evidence they refer to the same underlying idea. Bikard algorithmically identifies a set of 10,927 such pairs of papers in the PubMed database and shows they exhibit a lot of other hallmarks of being multiple independent discoveries: they are textually quite similar, published close in time, and frequently published literally back-to-back in the same journal issue, which is one way journals acknowledge co-discovery.

Given 29.3 million papers in PubMed, if there are only 10,927 instances of multiple discovery, that would naively suggest something on the order of 0.03% of papers having multiple independent discovery. But while Bikard’s publicly available database of twin discoveries is useful for investigating a lot of questions related to science, it’s less useful for ascertaining the probability of independent discovery. That’s because the algorithm requires articles to have the right mix of characteristics to be identified as simultaneous discoveries. For example, in order to identify if two articles are frequently cited together in the same parenthetical block, Bikard needs each paper to receive at least 5 citations, and he needs at least three papers that jointly cite them to have their full text available, so he can see if those citations happen in sequence inside a parentheses. It’s unclear to me how many of the 29.3mn papers in PubMed meet this criteria. But we can at least say that as long as no less than 1 in 100 papers meet the criteria, then Bikard’s method suggests a rate of simultaneous discovery that is significantly lower than 3%.

To close out this section, let’s turn to patents.

Until 2013, the US patent system featured an unusual first-to-invent system wherein patent rights were awarded not to the first person to seek a patent but to the first person to invent it (provided certain conditions were met). This meant that if two groups filed patents for substantively the same invention, the US patent office initiated something called a “patent interference” to determine which group was in fact the first to invent. These patent interferences provide one way to assess how common is simultaneous invention at the US patent office.

Ganguli, Lin, and Reynolds (2020) have data on all 1,329 patent interference decisions from 1998-2014. Of this set, it’s not totally clear how many represent actual simultaneous invention. In a small number of cases (3.5%), the USPTO ruled there had in fact been no interference, but in some cases one party settles or abandons their claim, or ownership of the patents is transferred to a common owner. In these cases, we don’t know necessarily know if the patents were the same. But it turns out this doesn’t really matter for making the argument that simultaneous invention is very rare. For the sake of argument, let’s assume all 1,329 patent interference decisions correspond to cases of independent discovery.

On average, it takes a few years for a patent interference decision to be issued. So let’s assume, for the sake of argument, these decisions come from the set of granted patents whose application was submitted between 1996 and 2012. Some 6.3mn patents applications (ultimately granted) were submitted over this time period, which implies 0.02% of patent applications face simultaneous invention. That’s a lot less than the 2-3% we found in some academic papers!

Inferring the Probability of Rediscovery

All these approaches suggest simultaneous discovery is in fact not very common. But simultaneous discovery is not exactly the benchmark we’re interested in. What I’m actually curious about is the probability someone would eventuallyrediscover an idea. If Einstein died in childhood, would someone else have found relatively? And would that have been fast or slow?

We can back out a rough estimate of this based on the probability of multiple simultaneous discovery. Suppose I am working on an idea that takes me 1 year to go from idea conception to publication, and I face a 2-3% probability of being scooped by someone else working on the topic during that time. Imagine I drop the idea, but the probability that someone else out there is working on it and will publish remains 2-3% per year. In other words, in every year, the probability nobody publishes on the idea is 97-98%. If that stayed constant, the probability nobody publishes on it in the next 20 years is 54-67%! In other words, these estimates imply that if someone doesn’t publish on an idea, there’s less than even odds someone else will pick up the idea and run with it in the following two decades.

And this estimate is on the high end, for a few reasons. First, assuming a research idea can move from conception to publication in the space of a year is probably too optimistic. If research takes two years and you face a 2-3% probability of being scooped during that longer period, the probability of no independent discovery in 20 years rises to 74-82%.2 Moreover, a 2-3% probability of simultaneous discovery is actually on the high end in these papers. If it’s closer to 0.1%, it’s all but certain no one will make a discovery someone else missed.

This line of argument is suggestive, but there are at least two issues.

First, it assumes the probability of getting scooped is constant over time. Maybe that’s wrong. Maybe, as more knowledge around an area gets filled in, it gets increasingly likely we’ll make a discovery we might otherwise have missed. On the other hand, maybe the opposite is true; as science and technology move on, it might become increasingly unlikely that we’ll make a missed discovery.

Second, it treats all ideas the same. This is a problem, since ideas vary so much in their import. If most random papers are not likely to be rediscovered, but the most important papers are, that has different implications.

Let’s start with the first issue.

Why Not Look at Subsequent Re-Discovery?

For evidence on the probability of rediscovery over the long-run, we need longer-run data. But this is a very hard question to get at, because once a discovery is made, the kind of people capable of independently discovering it are likely to learn about the discovery and then not spend time trying to re-discover it.

In the patent office, for example, only the first to invent (or first to file) gets patent protection. So it behooves any prospective inventor to check to see if an invention has already been patented before spending a lot of time on reinvention. And in academia, most social credit for a discovery goes to the first to discover something. So it behooves researchers to spend some time searching to see if what they want to do has already been done, before embarking on major research projects.3

That said, there can be a considerable gap between when a discovery is made and when it is publicly disclosed. For patents, applications are generally only disclosed no less than 18 months after being filed. Even this is not universal and has only been the case since 2000. In other cases, patent applications are private until the patent is granted, which can take years. Moreover, even if a patent application or grant is public, searching the patent record is an art in and of itself and people might miss things even if the record is public. The point is, people may inadvertently reinvent patented inventions, either because the patent office hasn’t told anyone a similar invention is already under review, or because the inventor simply failed to locate a similar invention in the public patent record. And fortunately for us, the patent office keeps records of why patent applications are blocked. So this can serve as an alternative metric of reinvention.

Lück et al. (2020) look into what happened when the patent office began disclosing patent applications after 18 months, instead of waiting until patents were issued (which can take years). By comparing patent applicants from just before and just after the rule change, they find early disclosure decreased the number of subsequent patent applications whose claims infringed on pending patent applications by 5-15%. If we interpret patent infringement as an indicator of independent reinvention, then that suggests some reinvention occurs because inventors didn’t know a similar idea had already been submitted for patent examination. But that only applies 5-15% of the time. The rest of the time, “reinvention”, if that’s what it is, is not simply a case of someone inventing something that they could not have known about by searching the patent record.

All this is to say, something like reinvention at the patent office does happen at least some of the time. I estimate something like up to 8% of patented inventions might be reinvented over the course of a later decade. How I got to that number is a bit tedious, so I stuck it in an appendix to this post, but basically it’s an upper bound estimate on the share of patents that are cited as a reason for blocking a patent application that is later abandoned. I am not quite sure what to make of this 8% number, other than to remark it isn’t very high over the course of a decade. But it could easily be an underestimate of the true extent of reinvention, since it won’t include people who could have reinvented a technology and chose not to after successfully learning a similar patent was already on file at the patent office. Or it could be an overestimate of how much reinvention occurs, if it includes people who did not independently reinvent, but instead tried to adapt a patented invention they had learned of, and failed to make their adaptation sufficiently distinctive.

So let’s turn to another set of papers that identify periods when the disclosure of discoveries was disrupted for a long time.

Independent Discovery Amid Geopolitical Rivals

Geopolitics gives us a small number of cases where communication of discoveries is significantly impeded for a long time. These cases broadly confirm that it is possible for an idea to be discoverable without being discovered for a long time.

Iaria, Schwarz, and Waldinger (2018) study disruptions to international science in the wake of World War I. As described in more detail here, World War I had the effect of cleaving the international scientific community into two comparatively isolated communities. Iaria, Schwarz, and Waldinger show, for example, that delivery of scientific journals published in enemy countries faced delays in excess of a year after the onset of war, and international conferences featuring speakers from different sides largely ceased until well after the war. Evidence that the war split the scientific community into two groups can also be seen in citations to the work of scientists from the other side. After the onset of war, the share of citations to papers from the other side fell 85%, relative to the share of such papers cited before the war.

While this wasn’t a complete separation, we can see signs that scientists on the two sides quickly begin to work on different things, rather than independently discovering the same things in parallel. To get at this, Iaria, Schwarz, and Waldinger use the titles of published academic work, after being translated into a common language (English). They condense these words to their roots, so that each title is now associated with a list of word stems. Using latent semantic analysis, they can infer the similarity of different article titles, based on whether they include word stems that belong to similar topics. In the figure below, they plot the average similarity of a paper title to the 5 most similar titles from one of two groups of countries. In blue, the similarity to papers published in other countries which are on the same side of the war. In red, the similarity to papers published in other countries on the opposite side.

From Iaria, Schwarz, and Waldinger (2018)

To interpret this figure, imagine you’ve got some random paper published in the USA. To be concrete, let’s assume it’s a paper on electricity. Now imagine there are three stacks of papers in front of you, with five papers in each stack. One stack is the set of five papers, published in the USA, whose titles are most similar to the title of your paper on electricity. Another stack is the set of five papers, published by allied powers, but not in the USA, whose titles are closest to your paper on electricity. The last stack is the set of five papers, published in Germany and the other Central Powers, whose titles are closest to your paper on electricity. For each of these stacks, compute how similar your title is, on average, to the titles in the stack.

Prior to 1914, the similarity between your electricity paper and the USA stack was not really different from the similarity between your electricity paper and the Central Powers stack. This corresponds to a period when science was international and so ideas flowed relatively freely around the globe. People in both sets of countries worked on similar stuff. (Note there was a notable difference between the title of this paper and the titles of the allied papers stack, so it seems the USA worked on different kinds of science from its allies, even prior to the war)

Anyway, in 1914 war breaks out. From that point forward, if you repeat this exercise each year, the similarity between your paper and the Central Powers papers steadily declines, relative to the similarity with the USA papers or even the allied papers, until the end of the war. What’s the upshot? During the years of most severe disruption, the course of science diverges more and more each year. After WWI, journal communication is restored and we no longer see a continued divergence, though neither do we see complete convergence in topics (relations remained frosty and in-person conferences between sides was rare, so a partial split of science remained).

For an even longer period of separation, we can look at the Cold War. During this period, science and technology on either side of the iron curtain developed under some degree of isolation. This fact has been exploited by a few papers which use the unexpected collapse of the USSR as a way to learn how new knowledge spreads and what is its impact (see here and here).

An early paper in this literature is Borjas and Doran (2012), which looks at the impact of the collapse of the USSR on labor market outcomes for American mathematicians. For our purposes today, what is interesting is Borjas and Doran’s documentation of the extent to which Soviet and US math diverged during the Cold War. A few quotes suggest it was not at all uncommon for discoveries on one side of the iron curtain to remain undiscovered on the other side for decades:

In the Soviet Union, for example, the mathematical genius Andrew Kolmogorov developed important results in the area of probability and stochastic processes beginning in the 1930s. In a scenario common throughout Soviet mathematical history, he established a “school” at Moscow State University, attracting some of the best young minds over the next four decades, such as the teenage prodigy Vladimir Arnold in the 1950s. Arnold himself quickly solved Hilbert’s famous “Thirteenth Problem” and initiated the field of symplectic topology… Because the United States did not have the unique Kolmogorov-Arnold Combination, the amount of work done by American mathematicians in these subfields was far less than would have been expected given the size and breadth of the American mathematics community.

Later, Borjas and Doran quote a New York Times article written in 1990, after the iron curtain fell and ideas began to flow more freely:

Persi Diaconis, a mathematician at Harvard, said: “It’s been fantastic. You just have a totally fresh set of insights and results.” Dr. Diaconis said he recently asked [Soviet mathematician] Dr. Reshetikhin for help with a problem that had stumped him for 20 years. “I had asked everyone in America who had any chance of knowing” how to solve a problem… No one could help. But… Soviet scientists had done a lot of work on such problems. “It was a whole new world I had access to,” Dr. Diaconis said.

Further emphasizing the anecdotal evidence that American scientist found much that was valuable but unknown to them from decades of Soviet research, the share of citations to Soviet papers by American mathematicians also rose sharply after their isolation from each other ended.

Share of citations by American mathematicians to Soviet research; from Borjas and Doran (2012)

It is challenging to compare the magnitudes across all these different lines of evidence. On the one hand, the fact that increased isolation of scientific communities is associated with significant divergence in scientific knowledge seems consistent with the evidence from simultaneous discovery that there is no strong guarantee an idea will have a backup discoverer in the next few decades. On the other hand, the evidence from simultaneous discovery suggests divergence should be pretty severe. Taking the probability of simultaneous discovery from the preceding sections very literally, mostof the ideas discovered in the USSR will be unknown in the USA if they are actually completely isolated from each other.

Is that the case? This is miles outside my area of expertise, so I can’t really say, but much of Borjas and Doran’s paper documents that, while the math labor market impacts of the collapse of the USSR was pretty large, it is far from clear it had seismic effects on the quality of math. Total papers didn’t change much, nor did the citations to those papers. Maybe that’s because the most important ideas were communicated across the Iron curtain. But maybe it’s also because more important ideas are more likely to be independently discovered.

Is the probability of independent discovery the same?

The second issue with my naive calculations was I assumed every idea had the same probability of multiple independent discovery. It would seem quite sensible that more important discoveries have more people looking for them, and so face a higher rate of multiple independent discovery. That might also explain why it’s possible to draw up compelling lists of multiple discoveries; the most famous discoveries are also the ones most likely to have multiple inventors.

In fact, we do have a lot of evidence this is the case. The cleanest evidence we have on this is probably from structural biology. In a complementary paper, Hill and Stein try to estimate just how important the structure of a given protein is and then directly compare this to how many groups are actively working on figuring out the structure of a protein (something they can do, because data in this field is so good). To estimate how important a discovery might be, they fit a statistical model that tries to predict how many citations a paper about a given protein structure will get, based on data about it that would have been available to any scientist prior to beginning work on the protein. This includes stuff like “how many other papers have been written on this protein in the past” and also stuff like “is this protein found in humans?”

The figure below shows how the predicted citation value of a protein is related to the number of groups doing research on a protein. There is a pretty strong positive correlation: proteins with the kinds of characteristics that get lots of citations attract more investigation. And in fact, since the vertical axis is in log units, the correlation is actually much stronger than it seems - the highest potential proteins appear to garner way more interest. In the main data they use for their analysis, the protein cluster with the most submissions gets almost 50 times as many submissions as the median protein cluster.

We can also get some confirmatory descriptive evidence that doesn’t depend on doing anything fancy, like trying to predict citations based on protein characteristics. In Hill and Stein (2020), papers on proteins subject to multiple discovery tend to get 26 citations in the next five years, as compared to 17 citations among those discovered by just one scientist (or team of scientists).

Hagstrom (1974) offers some fuzzier evidence that is consistent with the view that high impact work is more likely to be discovered simultaneously by multiple people. In his survey of scientists, he found those who had received more citations were more likely to report having been anticipated at some point in their career.4 In other words, scientists working on topics that went on to be highly cited were also more likely to report being scooped at some point.

Lastly, some evidence from the patent office is also consistent with the notion that more important discoveries are more likely to be independently invented by multiple people. Cotropia and Schwartz (2018) have data on a sample of 1.4mn US patents issued between 1999 and 2007, including whether they were cited as the basis for rejecting a subsequent patent application (filed between 2008 and 2017), because of a lack of novelty. Optimistically, this data can be read as a way of seeing which kinds of patents are most likely to be reinvented later, though I think there are important caveats to this and so I wouldn’t lean on them too heavily. But supposing we set those aside, Cotropia and Schwartz show patents are more likely to have this kind of infringing reinvention occur if the patent is more valuable by various metrics (how much it gets cited, the probability the owner pays the patent’s renewal fees, probability its the subject of litigation, etc.). That’s consistent with the most valuable inventions attracting more inventors, with most of the follower inventors being bounced out by the patent office for failing to be first.

How Much Redundancy is in Innovation?

So all together, what does this imply?

Pick a discovery or innovation at random, and the probability it has much in the way of built-in redundancy is probably pretty small. I think it is quite plausible that for most papers or patents, if you erased them from history, no one else would independently reproduce the work in the next two decades.

But that’s for a discovery selected at random. If you pick a patent or paper at random, in all likelihood it won’t be a particularly impactful patent or paper. With innovation, a small number of hits appear to have a disproportionate impact on the direction of a discipline or industry. It seems plausible that the most promising ideas attract many times as many potential discoverers as a randomly selected paper. If the annual probability of getting scooped on an important paper is 10% instead of 2-3%, that implies something quite different about long-term redundancy. With a 10% annual probability of discovery, the probability that no one makes the discovery in twenty years drops from over 50% to just 12%.

That, in turn, suggests there is a lot of redundancy in the most important ideas and inventions, but not in the details. The main trunk of humanity’s scientific and technical knowhow is pretty robust, but the positions of the branches and twigs are not.

(It would be interesting to see if something like this can be systematically detected in the parallel histories of, say, Soviet and US mathematics. We know there were large differences, but if you take the top 100 most highly cited Soviet mathematics papers that were not communicated outside the USSR, how many of them refer to a comparable discovery made on the other side of the iron curtain?)

Where the real fragility lies would seem to be among ideas that are important in retrospect but not in prospect. Those are the ones that don’t attract a lot of attention and so there are not many shots on goal; if a scientists studying the topic gives up, it may be a very long time until someone else makes the discovery. But when those discoveries do happen they can have a big effect.

Bottom line: if we can see an idea is going to be important, there is probably a good chance of multiple independent discovery, which builds in a bit of redundancy. But in all other cases, all bets are off.

As always, if you want to chat about this post or innovation in generally, let’s grab a virtual coffee. Send me an email at mattclancy at hey dot com and we’ll put something in the calendar.

New Things Under the Sun is produced in partnership with the Institute for Progress, a Washington, DC-based think tank. You can learn more about their work by visiting their website.

Appendix: Later Reinvention in Patents(?)

In order to be granted, a patented invention needs to be novel, non-obvious, and useful. So maybe one way we can get a handle on how common is independent re-discovery is to look at how frequently a patent application gets rejected because it isn’t novel: someone else has already had the idea.

Cotropia and Schwartz (2018) have data on this. For a sample of 1.4mn US patents issued between 1999 and 2007, slightly more than 200,000 patents were cited as the basis for rejecting a subsequent patent application (filed between 2008 and 2017), because of a lack of novelty. In other words, over a future 10-year period, for about 14% of patents, a later inventor submitted something to the patent office close enough to them that it was bounced back for being insufficiently original.

Does that mean 14% of inventions would have been rediscovered in time? Well, not exactly.

For one: patents typically make more than one claim, and it may be that only a subset of claims are not novel. These aren’t necessarily one-for-one rediscoveries; there is just some overlap. And naturally, since the patent examination process is a back-and-forth iterative process, an applicant might be wise to try and make a broad claim initially, which they are prepared to walk back if the patent examiner rejects it. These kinds of broader claims are probably more likely to bump up against pre-existing patents.

Is there a way to separate out cases where a patent application is a close match to an existing patent from cases where there is just a bit of overlap that the patent applicant can get around by amending their application? Well, Frakes and Wasserman (2014) have some complementary data on the fate of 1.4mn patent applications filed after March 2001 and decided one way or another by July 2012. Of this set, 56% were rejected at some point due to lack of novelty, and 32% of patent applications were ultimately not granted at all.

That doesn’t mean 32% of patent applications failed because they lacked novelty. A patent could be rejected for a lot of other reasons too - too obvious, failure to disclose enough, not patentable material, etc. But just to get an order of magnitude, let’s make an unrealistic assumption and suppose that 32% of patent applications really did fail to be granted because they lacked novelty. Given 56% of patent applications were dinged for insufficient novelty at some point, and since 32% out of 56% is 57%, that suggests at best something like 57% of novelty rejections are strong enough to scuttle the whole patent application. Maybe we can consider these cases of genuine reinvention.

Now let’s go back to the Cotropia and Schwarz data that says 14% of patents were subsequently cited as the basis for rejecting a non-novel patent. If, at best, 57% of the time these rejections are severe enough to stop the patent application from moving forward, then that implies at best, something like 8% of patents were reinvented by someone.

So let’s say, at most, something on the order of 8% of patents are reinvented and someone submits something to the patent office close enough to the prior patent that the application is rejected. If we take the number 8% per decade naively, it implies an annual probability of reinvention on the order of 0.8%.

But as noted in the main piece, there are a lot of additional complications with this number and I hesitate to put much weight on it. Maybe this understates the probability of reinvention, because in most cases a quick patent search reveals to would-be inventors that their idea has already been done and so they stop doing work on it. But maybe it overstates the probability of reinvention, because it includes a lot of people who copied existing ideas and then tried to sneak something past the patent office.

For example: (Clancy 2020, Bikard 2020)

If you are scooped over 2 years with 2% probability, then the probability you are not scooped is 98%. The probability you are not scooped in any given year is the square root of 0.98 and the probability of not being scooped in twenty years is the square root of 0.98 raised to the power of 20, or 0.817. Similarly for 3% annual probability of being scooped.

While Hill and Stein do find the penalty for getting scooped is not nearly as high in academia as scientists believe (most scooped scientists publish and end up getting just 20% fewer citations than the scoop-er), their context is cases where scientists really have finished all the work prior to learning they have been scooped. I suspect scientists are much less forgiving in cases where a project is begun after it was possible to learn similar work already had been published. In fact, Hill and Stein show, even in their context, the citation penalty grows significantly as the gap between publications grows.

Note, this was after controlling for the total number of papers published, so this isn’t simply capturing a spurious link, where people who publish more have more opportunities to be anticipated and also have more things that are citable.

What's New Under the Sun