Measuring Knowledge Spillovers: The Trouble with Patent Citations

A paper trail for knowledge spillovers? Only kind of...

Matt Clancy

Apr 20, 2021

As a source of data for studying innovation, patents are really seductive. There’s nothing else quite like them:

detailed descriptions of millions of inventions…
dating back over a century...
from all corners of the economy...
validated by an expert as novel, useful, and non-obvious.

At the same time, they have real biases. Not every invention gets patented (in fact, it may be more common that inventions don’t get patented; a topic for another day), and not every patent represents a useful invention. Worse, the set of things that get patented isn’t just a random sample. There are systematic differences in the kinds of things that get patented and the kinds of things that don’t. It’s a constant source of tension for me, as someone who writes about social science research on innovation, the majority of which is based on patent data: does this cool result hold up when we use alternatives to patent data?

What we’re going to talk about today though is just one aspect of the patent problem: what are patent citations really telling us?

Patent citations are the citations patents make to “prior art”: those include other patents, as well as academic journal articles, patent applications, legal documents, and other stuff. We’re just going to focus on citations to patents today - many of the issues highlighted here don’t necessarily apply to the other kinds of documents.

Why do we care about patent citations? Because one of the most important things about innovation - as compared to other activities - is that knowledge spills over to new applications. Understanding exactly how and why that happens is a big part of the problem of understanding how innovation happens, and citations promise a very clean way to identify these spillovers. Or at least, they seem to. (See a list of posts I’ve made about papers using patent citations to measure knowledge flows at the end of this post)

I think most researchers who begin to study patents (including myself, when I began to use them as a dataset) start out by thinking of patent citations in the same way that we think of the citations we make in our academic work. For an academic, citations are simply addresses for ideas we are referencing, whether to build on them, critique them, or nod to them. In short, they provide a list of the ideas we are engaging with when we make our own intellectual contribution. If a patent is basically like the invention equivalent of a research project, and a patent citation is basically like the citation in a research paper, then they are a great way to measure the flows and uses of knowledge.

And sometimes, that’s basically what a citation in a patent is! But just as often, it’s not. In this post I’m going to walk through some of the issues we have in using citations as a proxy for knowledge flows, but then ultimately argue they’re still useful in some contexts, and especially when used in concert with different forms of evidence that have different strengths and weaknesses.

Patent Citations Have Many Authors

A first important distinction between citations in academia and citations in patents (to patents) is that whereas the former tend to be added exclusively by the authors, many parties besides the inventor add citations to a patent. This gets to the differing rationales for citations in academic papers and patents.

Citations in patents are mainly about establishing an invention is eligible for a patent, in the sense that the invention is novel and makes a non-obvious improvement on any existing work. In the USA, inventors have the duty to disclose all relevant information they are aware of (or risk having their patent invalidated upon challenge), which could imply inventors will cite any patents for inventions whose ideas they improved on. But many other citations may be added to show the invention is legally patentable (i.e., citing a famous patent for a GMO crop to establish GMO crops are patentable), or that it’s improvements are not obvious (i.e., perhaps by citing improvements made by other patents, only to argue they are distinct). Importantly, these citations can be added by patent attorneys or the patent examiner evaluating the application for a patent. Or citations may be added by the inventor, after they have “completed” their invention and begin to do the research needed to secure a patent.

So, for a variety of reasons, it may be that a patent is cited even though the inventor was completely unaware of it while doing the R&D that resulted in the invention. To get a sense of how common this is, we can look at a survey by Jaffe, Trajtenberg, and Fogarty, conducted in the 1990s. They simply used the addresses that inventors listed on their patents to mail a bunch of them surveys asking about a citation they had made. Only 38% of 166 respondents knew about the cited patent before or during the invention process; that is, the majority of citations do not represent “knowledge flows” at all, since the inventor only became aware of the cited patent after the invention was complete.

Moreover, in 2002 the US Patent and Trademark Office began to report when a citation was added by a patent examiner, the applicant, or “other parties.” For citations made between 2005 and 2014, around a quarter of citations were added by the patent examiner. Again, that means a large share of citations don’t seem to be measuring knowledge flows, in the sense that they’re not even added by the inventor.

What’s the bottom line? A fraction of citations probably do correspond to genuine knowledge flows - but only a fraction. In Jaffe, Trajtenberg, and Fogarty’s survey, they ask respondents what they learned from the cited patent, and many of the answers correspond to the kind of thing we want citations to be: inventors said they learned about a concept that could be improved, that the idea was technically feasible, or other information useful for development. These kinds of citations are probably in the minority, but they’re there.

Inventors May Play Games

So far we have two problems with citations. First, they are frequently not added by the inventors themselves. Second, even the citations an inventor adds do not necessarily serve as a simple record of the ideas that were useful in the invention process - that’s not what the citations are for. But a third problem is when patent applicants intentionally do not cite all relevant patents, but strategically cite documents as part of a strategy that they believe will help make their application more likely to be granted or their patent less likely to be invalidated.

As noted above, applicants have a legal duty to disclose all relevant information; failure to do so means a patent could - in theory - be invalidated in a subsequent court challenge. This creates an incentive not to withhold relevant information. On the other hand, if you draw the patent examiner’s attention to the existence of a patent covering some aspect of your invention, you may have to settle for a narrower set of claims about what’s protected under your patent. So there is a gamble in play - or at least, the perception of a gamble by the applicant: if you can get away with citing less, you may be able to get a patent covering a wider range of things, but at the risk of your patent not holding up in court. It looks like how patent applicants think about this gamble matters and has changed over time.

Take Lampe (2012). Lampe is looking for evidence that applicants are intentionally withholding relevant citations during their applications. To do that, he makes the assumption that applicants probably know about patents that they or their coauthors have previously cited in other patents. Then, he looks at the citations the patent examiner added; these are citations that the examiner has decided are relevant to the patent application. If the applicant knew about these patents but did not supply them, and the patent examiner thinks they are relevant, it’s possible the applicants were trying to sneak something by the examiner. (Of course, it doesn’t prove anything, but it’s suggestive.)

Does this happen much? Yeah!

To see how much it matters, let’s pick a patent and scrutinize its citations. Let’s focus on the set of citations it makes to patents that had also been previously cited by one of the co-inventors on this patent. We’re going to assume the co-inventors knew about these patents when they made their application. Even though the co-inventors knew about these patents, about 20% were not added by the applicants, but instead judged relevant and added by the examiner.

Well, maybe it was an honest mistake. But slightly more suspicious is the fact that the extent of this withholding is lowest for the patents typically considered most valuable (and therefore with the most to lose if the patent gets invalidated); that is patents for drugs and chemicals technologies, or for patents that get the most citations in the future (a common proxy for the value of a patent). On the other hand, larger firms are more likely to withhold potentially relevant citations, possibly because they have more resources to defend challenged patents or maybe because holding so many patents makes them less risk averse.

That all indicates that citations may be missing relevant work. We don’t know what applicants fail to cite and the examiners fail to catch. But there is also the opposite problem: citation of irrelevant work.

Why would a patent applicant cite irrelevant work? One potential rationale is that it could be another way to try and sneak something past the examiner (or, as importantly, the applicant might believe this, whether or not it’s true) by hiding an important citation in tons of meaningless citations, so that the examiner doesn’t have time to scrutinize it. But a more benign rationale could be that the applicant has an overly generous interpretation of the duty to disclose all relevant information. Since a patent can be invalidated for failing to cite relevant material, and since it’s costless for the applicant to cite more things, why not cite everything that is even remotely relevant? This problem can be especially salient for an applicant submitting many linked and interrelated patents; instead of trying to parcel out which citation should be rolled over from one patent to another, why not just copy them all?

Unlike omitted citations, there is some evidence that this problem has become much more severe in the last decade. Kuhn, Younge, and Marco (2020) documents the rise of super-citing patents - a relatively small share of patents who cite so many patents, that they skew the entire landscape of citations. It is most common for patents to cite less than 20 other patents; in 2014, 75% of patents fell into this category. Sometimes, however, it is appropriate for a patent to cite more patents, perhaps up to 100. In these cases, it becomes difficult for a patent examiner to carefully check every citation offered. Still, 95% of patents in 2014 made less than 100 citations, with the vast majority making less than 20.

That remaining 5% is causing problems. These patents - which were all but non-existent prior to the year 2000 - cite more than 100 patents each. In fact, this small number of super-citing patents now accounts for nearly half (46%) of all patent citations, even though they comprise under 5% of patents!

Moreover, the quality of the citations made by these super-citers is highly dubious. Kuhn, Younge, and Marco (2020) compute the similarity of the text of citing and cited patents (based on the degree to which they contain the same words that are otherwise uncommon). As indicated in the figure below, the similarity of citing and cited patents gets steadily worse, the more citations a patent makes.

What’s particularly worrying is Kuhn, Younge, and Marco (2020) show the rising share of low-quality citations is eroding the usefulness of citations for studying patents. The average textual similarity of citing and cited patents has been declining for decades, as the share of citations associated with super-citers rises.

They also replicate a few canonical results from the economics of innovation, that rely on patent citations, and show that these results are affected by the decline in the quality of citations.

For example, a famous result in the patent literature showed that firms whose patents receive more citations have higher stock market valuations than otherwise observationally similar firms whose patents receive fewer citations. But the magnitude of this correlation has halved between 2003 and 2008; patent citations just don’t seem to be “worth” as much as they used to be, as judged by the market.

Another study - which I’ve mentioned before - used patent citations in the 1980s and 1990s to measure local knowledge flows. Essentially, they showed patents were more likely to cite the patents of local inventors as compared to distant inventors of the same kind of technology. This has long been an important line of evidence about the importance of local knowledge, and why innovation tends to happen in cities. Kuhn, Marco, and Younge update this study and show that the results differ significantly if you try to control for the rise of low-quality patent citations. If you do not control for them, the propensity to cite local work has remained stable and consistent; if you adjust for the quality of citations, this propensity has fallen considerably.

Time to give up?

So, uh, citations have problems. But it’s important to remember that, at the end of the day, there is genuinely useful information in a subset of patent citations and that some information is better than none. To begin, we have that old survey evidence that the inventor knew about 38% of the citations on their patent before or during the inventive process. And in another survey from the 1990s, inventors (this time in the EU) rated the patent literature as one of the most important sources of knowledge used to develop innovations (though the survey did not ask if the debt to other patents was reflected in citations).

In the last decade, partially in response to the dawning recognition that patent citations are not nice analogues for academic citations and partially due to the growing sophistication of natural language processing tools, the research community has begun developing much better tools for analysing the raw text of patents. Increasingly scholars are tracking knowledge flows by looking at the similarity of the textual description of inventions in patents. Somewhat reassuringly, there is a lot of overlap between textual similarity and citations. Younge and Kuhn (2016) show that the similarity of text between patents is as good a predictor of citation as other methods based on the US patent classification system, while Feng (2020) shows the text of patents with a direct citation link between each other are 2.5x as similar to each other as a baseline, which is about the same as patents that share an inventor.

Given all that, is it time to give up on patent citations? I don’t think so. They’re an imperfect source of information - but that’s life in the social sciences. The best we can do is understand the strengths and weaknesses of the data, and try to find cases where different kinds of data tell a mutually confirming story. In this newsletter, wherever possible, I try to complement patent-based papers with others. It’s not always possible (patent data is one-of-a-kind), but when it’s not, I consider the results more provisional than they would otherwise be, especially if the citation data is of a more recent vintage.

Here are just some posts on work that uses patent citations as a proxy for knowledge flows: