AI solves the protein-folding problem

by Adam Lee on September 1, 2022

[Previous: AI is getting scarily good]

Artificial intelligence is giving rise to an explosion of new technologies. In just the last decade, rapid advances have allowed computers to catch up to humans on tasks from driving a car to sewing clothes to performing surgery.

However, even that understates the speed of technological advancement. What happens when computers surpass humans on tasks that require rational thought and understanding of the physical world?

For example, computers are turning their attention to one of the great unsolved problems of biology. With AI to help us, we may finally be able to crack it. And if we do, we’re poised to kick off a golden age of biotechnology and medicine.

The protein-folding problem

The genetic code is one of the marvels of evolution. We know the structure of DNA, the double helix that looks like a spiraling ladder. The rungs of that ladder are called nucleotides. In genes that code for proteins, every set of three nucleotides—each triplet—stands for an amino acid, which are the building blocks of proteins.

To make a protein, the cell copies the gene into an intermediate molecule called messenger RNA. The messenger RNA leaves the nucleus and travels to a cellular machine called the ribosome. The ribosome reads the RNA, selects the amino acid for each triplet, and adds it to the growing protein, like welding a link onto a chain.

Some pairs of amino acids attract each other, while others repel each other. Once synthesis is complete, these physical forces cause the protein to fold into an intricate three-dimensional shape. The protein’s shape determines its function, from the actin and myosin proteins that make muscles contract, to the hemoglobin that carries oxygen in our blood, to the antibodies that latch onto invading viruses.

If we understood this process well enough to design a protein with a desired shape, we could solve any number of problems in biology. We could create magic-bullet drugs to kill cancer without harming the patient, artificial antibodies to halt viruses in their tracks, or new antibiotics that bypass bacterial resistance. We could treat diseases, like Alzheimer’s or Parkinson’s, that are thought to stem from the misfolding of proteins. We could create never-seen-in-nature enzymes to break down plastics, clean up oil spills, or capture carbon from the atmosphere.

Unfortunately, like fluid turbulence or the three-body problem, protein folding is a phenomenon where simple forces interact in dizzyingly complex ways. According to one famous paper, a protein with just 150 amino acids would have 10³⁰⁰ possible configurations—far more than the number of atoms in the observable universe. Brute-force solutions that test every possible configuration are so impractical as to be impossible.

Until recently, the most reliable way to find the shape of a protein was experimentally, through methods like X-ray crystallography. To do this, you first have to get the protein molecules to lock together in an orderly crystal shape—itself a laborious, trial-and-error task. Then scientists chill the protein crystal to supercold temperatures, bombard it with X-rays from a particle accelerator, and analyze the diffraction patterns. Finally, like solving a “3D jigsaw puzzle“, they have to figure out how the amino acids map onto the fuzzy image that results.

Enter AlphaFold

AlphaFold is an AI system created by the company DeepMind. It’s designed to simulate protein folding by predicting the distance between pairs of amino acids. Like many similar systems, it uses a neural network: a software architecture inspired by the human brain, with different nodes that react to different features of the input, all linked by a complex web of connections that filter down to an output.

As a broad generalization, neural networks are trained with large sets of existing data. Connections that lead to greater accuracy in the output are reinforced, whereas those that fail to match the data are pruned away. Although their inner workings defy easy explanation, they’re often able to discover subtle patterns in the fabric of nature that aren’t obvious to us. In AlphaFold’s case, the model was trained with the known structures of 170,000 proteins.

In 2020, AlphaFold competed in the Critical Assessment of Structural Prediction, or CASP, a contest to measure progress in protein folding simulation. The entrants have to predict the shape of proteins whose structure has been experimentally determined, but not yet released to the public.

And it was no contest: AlphaFold won in a walk. Its predictions had an average error of approximately 1.6 angstroms, comparable to the width of a single atom. In fact, it was so accurate that it found errors in experimental data:

The first example comes from the group of Osnat Herzberg, who were studying a phage tail protein. After appreciating the excellent agreement of DeepMind’s model with their structure, they noticed that they had a different assignment for a cis-proline. Upon reviewing the analysis, they realised they had made a mistake in the interpretation and corrected it.

In another case, AlphaFold solved a protein structure in hours that researchers had been wrestling with for two years:

The second comes from the group of Henning Tidow, who was studying an integral membrane protein… Prof. Tidow’s group worked on this model for about two years, trying different methodologies to solve the crystal structure, including experimental phasing methods. When they were given the models from DeepMind’s prediction, they managed to solve the problem by molecular replacement in a matter of hours.

Since their CASP triumph, the team behind AlphaFold hasn’t been idle. They’ve used it to find drugs to treat neglected tropical diseases. They also published its source code so that anyone can see how it works, tinker with it and try to improve it.

In 2021, they released predictions for nearly every protein in the human body. (Before then, only about 17% of human proteins had known structures.) In just over a year, over a thousand scientific papers have already cited this data.

But in 2022, they surpassed themselves with an even larger data dump. They released predicted structures for more than 200 million proteins, almost every protein known to science, in a database open to researchers. This is a gold mine of untapped potential unlike anything in scientific history, and it will almost certainly lead to a massive acceleration in the pace of research.

To be clear, AlphaFold isn’t a complete solution to the protein folding problem. For example, it still can’t predict the shape of proteins whose structure depends on interactions with other proteins. We also know of some cases where its predictions fail to match reality.

Nevertheless, it’s a quantum leap over all previous efforts, and the technology is only going to get better. There’s every reason to expect this data to yield revolutions in drug discovery, in synthetic biology, in medicine, and more. We’ve scarcely begun to imagine how it will transform the world in the years to come.