Without Code for DeepMind’s Protein AI, One Lab Wrote Its Own

For biologists who study the structure of proteins, the recent history of their field is divided into two epochs: before CASP14, the 14th biennial round of the Critical Assessment of Protein Structure conference, and after. In the decades before, scientists had spent years slowly chipping away at the problem of how to predict the structure of a protein from the sequence of amino acids that it comprises. After CASP14, which took place in December 2020, the problem had effectively been solved, by researchers at the Google subsidiary DeepMind.

A research company focused on a branch of artificial intelligence known as deep learning, DeepMind had previously made headlines by building an AI system that beat the Go world champion. But its success at protein structure prediction, which it achieved using a neural network called AlphaFold2, represented the first time it had built a model that could solve a problem of real scientific relevance. Helping scientists figure out what proteins look like can facilitate research into the inner workings of cells and, by revealing ways to inhibit the action of particular proteins, potentially aid in the process of drug discovery. On July 15, the journal Nature published an unedited manuscript detailing the workings of DeepMind’s model, and DeepMind shared its code publicly.

But in the seven months since CASP, another team had taken up that mantle. In June, a full month before the publication of DeepMind’s manuscript, a team led by David Baker, director of the Institute for Protein Design at the University of Washington, released their own model for protein structure prediction. For a month, this model, called RoseTTAFold, was the most successful protein prediction algorithm that other scientists could actually use. Though it did not reach the same peaks of performance as AlphaFold2, the team ensured the model would be accessible to even the least computationally inclined scientist by building a tool that allowed researchers to submit their amino acid sequences and get back predictions, without getting their hands dirty with computer code. A month later, on the very same day that Nature released the DeepMind early manuscript, the journal Science published the Baker lab’s paper describing RoseTTAFold.

Both RoseTTAFold and AlphaFold2 are complex, multilayered neural networks that output predicted 3D structures for a protein when given its amino acid sequence. And they share some interesting design similarities, like a “multitrack” structure that allows them to analyze different aspects of protein structure separately.

These similarities are no coincidence—the University of Washington team designed RoseTTAFold using ideas from the DeepMind team’s 15-minute presentation at CASP, in which they outlined the innovative elements of AlphaFold2. But they were also inspired by the uncertainty that followed that short talk—at that point the DeepMind team had given no indication about when it would give scientists access to its unprecedented technology. Some researchers were worried that a private company might buck standard academic practice and keep its code from the broader community. “Everyone was floored, there was a lot of press, and then it was radio silence, basically,” says Baker. “You’re in this weird situation where there’s been this major advance in your field, but you can’t build on it.”

Baker and Minkyung Baek, a postdoctoral fellow in his lab, saw an opportunity. They might not have had the code that the DeepMind team used to solve the protein structure problem, but they knew that it could be done. And they also knew, in general terms, how DeepMind had done it. “Even at that point, David was saying, ‘This is an existence proof. DeepMind has shown these sorts of methods can work,’” says John Moult, a professor at the University of Maryland College Park’s Institute for Bioscience and Biotechnology Research and organizer of the CASP event. “That was enough for him.”

With no knowledge of when—or if—the DeepMind team might make their tool available to the structural biologists who hoped to use it, Baker and Baek decided to try to build their own version.

Figuring out the three-dimensional structure of proteins is essential to understanding the inner workings of cells, says Dame Janet Thornton, director emeritus of the European Bioinformatics Institute. “The DNA codes for everything, but it doesn’t really do anything,” she says. “It’s the proteins that do all the work.” Scientists have used a variety of experimental techniques to try to figure out protein structure, but sometimes the data simply isn’t informative enough to provide a clear answer.

A computer model that uses a protein’s unique sequence of amino acids to predict what it might look like can help researchers figure out what that confusing data means. For the past 27 years, CASP has given scientists a systematic way to evaluate the performance of their algorithms. “The progress has been consistent, but rather slow,” Thornton says. But with AlphaFold2, she continues, “the improvement was pretty dramatic—more dramatic than we’ve seen for many years, actually. And so in that respect, it was a step change.”

The Baker Lab had achieved the second-best performance at CASP14 with a model of their own, which gave them a solid place to start when it came to reproducing DeepMind’s method. They systematically compared what DeepMind team members had said about AlphaFold2 to their own approach, and, once they had identified DeepMind’s most important advancements, worked on building them into a new model, one by one.

One crucial innovation that they adopted was the idea of a multi-track network. Most neural network models process and analyze data along a single “track,” or path through the network, with successive layers of simulated “neurons” transforming the outputs of the previous layer. It’s a bit like the players in a game of telephone transforming the words they hear into the words they whisper into the ear of the person next to them—only in a neural network information is gradually rearranged into a more useful form, rather than degraded, like in the game.

DeepMind designed AlphaFold2 to segregate different aspects of protein structure information into two separate tracks that fed some information back to each other—like two separate games of telephone happening in parallel, with adjacent players passing some information back and forth. RoseTTAFold, Baker and Baek found, worked best with three.

“When you draw some complicated figure, you don’t draw it all at once,” Baek says. “You will just start from very rough sketches, adding some pieces and adding some details step by step. Protein structure prediction is somewhat similar to this kind of process.”

To see how RoseTTAFold worked in the real world, Baker and Baek reached out to structural biologists who had protein structure problems that they couldn’t solve. At 7 pm one evening, David Agard, professor of biochemistry and biophysics at the University of California, San Francisco, sent them the amino acid sequence for a protein produced by bacteria infected with a particular virus. The structure predictions came back by 1 am. In six hours, RoseTTAFold had solved a problem that had bedeviled Agard for two years. “We could actually see how it evolved from a combination of two bacterial enzymes, probably millions of years ago,” Agard says. Now past this bottleneck, Agard and his lab could move forward in figuring out how the protein worked.

Even though RoseTTAFold hadn’t reached the same stratospheric level of performance as AlphaFold2, Baker and Baek knew then that it was time to release their tool into the world. “It was still clearly very useful, because these people were solving biological problems that in many cases had been outstanding for quite a long time,” Baker says. “We decided at that point, ‘Well, it’s good for the scientific community to know about this and have access to this.’” On June 15, they released the tool that allowed anyone to easily run their model, as well as a preprint of their forthcoming Science paper.

Unbeknownst to them, at DeepMind, an extensive scientific paper detailing their system was already under review at Nature, according to John Jumper, who leads the AlphaFold project. DeepMind had submitted their manuscript to Nature on May 11.

At that point, the scientific community knew little about DeepMind’s timeline. That changed three days after Baker’s preprint became available, on June 18, when DeepMind CEO Demis Hassabis took to Twitter. “We’ve been heads down working flat out on our full methods paper (currently under review) with accompanying open source code and on providing broad free access to AlphaFold for the scientific community,” he wrote. “More very soon!”

On July 15, the very same day that Baker’s RoseTTAFold paper was published, Nature released DeepMind’s unedited but peer-reviewed AlphaFold2 manuscript. Simultaneously, DeepMind made the code for AlphaFold2 freely available on github. And a week later, the team released an enormous database of 350,000 protein structures that had been predicted by their method. The revolutionary protein prediction tool, and a vast volume of its predictions, were at last in the hands of the scientific community.

According to Jumper, there’s a banal reason for why DeepMind’s paper and code weren’t released until more than seven months after the CASP presentation: “We weren’t ready to open source or put out this extremely detailed paper that day,” he says. Once the paper was submitted in May, and the team was working through the peer review process, Jumper says they tried to get the paper out as soon as possible. “We had honestly been pushing as fast as we could,” he says.

The DeepMind team’s manuscript was published through Nature’s Accelerated Article Preview workflow, which the journal most frequently uses for Covid-19 papers. In a statement to WIRED, a spokesperson for Nature wrote that this process is intended as “as a service to our authors and readers, in the interests of making particularly note-worthy and time-sensitive peer reviewed research available as quickly as possible.”

Jumper and Pushmeet Kohli, lead of DeepMind’s science team, demurred about whether Baker’s paper factored into the timing of their Nature publication. “From our perspective, we contributed and submitted the paper in May, and so it was out of our hands, in some sense,” Kohli says.

But CASP organizer Moult believes that the University of Washington team’s work may have helped DeepMind scientists convince their parent company to make their research freely available on a shorter timescale. “My sense from knowing them—they are really outstanding scientists—is that they would like to be as open as possible,” Moult says. “There is some tension there, in that it’s a commercial enterprise, and in the end it’s got to make money somehow.” The company that owns DeepMind, Alphabet, has the fourth highest market cap in the world.

Hassabis characterizes the release of AlphaFold2 as a benefit to both the scientific community and Alphabet. “This is all open science and we’re giving this to humanity, no strings attached, the system, the code, and the database,” he said in an interview with WIRED. Asked whether there was any discussion about keeping the code private for commercial reasons, he said, “It’s a good question how we deliver value. Value can be delivered in a lot of different ways, right? One is obviously commercial, but there’s also prestige.”

Baker is quick to praise the DeepMind team for the thoroughness of their paper and code release. In a sense, he says, RoseTTAFold was a hedge against the possibility that DeepMind would not act in the spirit of scientific collaboration. “If they had been less enlightened and decided not to [release the code], then then there at least would have been a starting point for the world to build on,” he says.

That said, he feels that if the information had been released earlier, his team could have worked on pushing AlphaFold2 to perform even better or adapting it to the problem of designing artificial proteins, which is the Baker Lab’s main focus. “There’s no doubt that if, say, in the beginning of December, after CASP, they had said, ‘Here’s our code, and this is how we did it … we would be way, way further ahead,” Baker says.

And time could be of the essence for some of the real-world applications of protein structure prediction. Understanding the three-dimensional structure of a protein that’s essential to the survival of a pathogen could help scientists develop drugs to fight that pathogen, for instance. The applications could even extend to the pandemic; for example, DeepMind used a version of AlphaFold2 to predict the structures of some SARS-CoV-2 proteins last August.

Baker thinks that questions about information sharing between academia and industry will only grow more pressing. Problems in artificial intelligence require enormous time and resources to solve, and companies like DeepMind have access to personnel and computing power on a scale unimaginable for a university lab. “It’s almost certain that the major advances will continue to be made at companies, and I think this will only accelerate,” Baker says. “There’s going to be internal pressure at those companies [about] whether to make the advances public, like DeepMind did here, or to monetize them.”

Additional reporting by Will Knight.