The possibilities and limits of using data to predict scientific discoveries

Published: Feb. 3, 2017 By

Amidst the vast and varied ecosystem of modern science, the emerging interdisciplinary field known as the “science of science” is exploring a difficult, but provocative, question: In the age of data science, are future discoveries now predictable?

In an article published this week in the journal Science, CU Boulder researcher Aaron Clauset and co-authors Daniel Larremore (Santa Fe Institute) and Roberta Sinatra (Central European University) examine the possibilities and limits of using massive data sets of scientific papers and information on scientific careers to study the social processes that underlie discoveries.

“There is more interest than ever in quantifying scientific behavior,” said Aaron Clauset, an assistant professor in CU Boulder’s Department of Computer Science and a faculty member in the BioFrontiers Institute. “The question is: Can we use the abundant data on the scientific process in order to make better predictions about scientific discoveries, which could improve funding decisions, peer review and hiring decisions?”

Historically, scientific discoveries have fallen on a spectrum between highly expected (such as the Higgs boson, which evidence pointed to years in advance) and entirely unexpected (such as penicillin, which arrived with minimal preceding research). Predicting such advances has value to scientists (when choosing a research field), funding agencies (who want to allocate dollars effectively), hiring committees (who want to hire successful faculty) and taxpayers (who fund a large percentage of research projects).

The recent proliferation of bibliographic databases such as Google Scholar, Web of Science, PubMed, ORCID and others has given researchers new tools by which to examine various aspects of the scientific community as a whole, such as the number of citations a given article receives or how many journal articles a given researcher publishes. But, do such metrics make some kinds of discoveries easier to predict?

Feedback loops

One problem with using such data to make predictions is the likelihood that the scientific community and the various incentives for scientists may currently be structured in a way that creates self-reinforcing feedback loops in which future opportunity depends on being lucky, undermining the potential for other less-heralded projects to advance science.

“We tend to reward and reinvest in people and subjects that have paid off in the past, but there’s no guarantee they will continue to do so. This can create a kind of purifying selection,” said Clauset, who is also an external faculty member at the Santa Fe Institute. “Ecology teaches us that the most robust systems in the face of uncertainty are diverse systems. We may be killing the golden goose of scientific discovery very slowly by focusing on minutiae at the expense of variety.”

Clauset’s data also questions the conventional academic narrative that scientists achieve an early productivity peak followed by a long and slow decline. In a related paper published in December 2016, he and his co-authors analyzed over 200,000 publications from 2,453 tenure-track faculty in all 205 PhD-granting computer science departments in the U.S. and Canada. They found the conventional pattern accurately described only one-third of faculty while the remaining two-thirds exhibited a wide variety of productivity patterns over the course of their careers.

Sleeping beauties

Another insight into the unpredictability of scientific advances comes from so-called “sleeping beauties.” While bibliographic data illuminate that some aspects of scientific impact are predictable, the broad existence of “sleeping beauty” papers, which lay dormant for years before a sudden uptick in relevance, implies that some aspects of discovery may be fundamentally unpredictable. A notable example is a now-famous 1935 Albert Einstein paper on quantum mechanics that was only modestly cited for several decades before fairly recently becoming one of the most important papers in quantum mechanics.

“This suggests that there’s another scale to consider, one in which we need to zoom out even farther to understand how these various scientific fields and subfields are interacting with one another,” said Clauset.

The article also states that while publication data is useful in some ways, citations are fundamentally lagging indicators, which only look backward at the past, and thus may have limited utility for predicting the future.

Looking forward, Clauset and his co-authors suggest that better predictions could be made using data sets on scientific preprints, workshop papers, conference presentations and rejected grant proposals. Such databases—should they ever become available—might provide additional trends and insights that are not being captured currently by better illustrating how the frontier of scientific discovery is moving.

Overall, the authors state, the limits of data in predicting future advances point to the importance of maintaining a wide-ranging scientific community.

“We would be wise to hedge our bets by building a diverse ecosystem of scientists and approaches to science rather than focus on predicting individual discoveries,” said Clauset.

a workbench in a chemistry lab

A workbench in a chemistry laboratory. Photo: Jean-Pierre / Wikipedia