[This is a transcript of the video embedded below.]

Protein folding is one of the biggest, if not THE biggest problems in biochemistry. It has become the holy grail of drug development. Some of you may even have folded proteins yourself, at least virtually, with the crowd science app “ Foldit ”. But at the end of last year, the headlines announced that protein folding had been “solved” by artificial intelligence. Was it really resolved? And if it got resolved, what does that mean? And what was the protein folding problem again? That’s what we’ll talk about today.

Proteins are one of the main building blocks of living tissue, such as muscles, so you may be familiar with “proteins” as one of the most important nutrients in meat.

But proteins come in a bewildering number of variants and functions. They are found everywhere in biology and are of great importance: proteins can be antibodies that fight infection, proteins enable organs to communicate with one another, and proteins can repair damaged tissue. Some proteins can perform amazingly complex functions. For example, pumping molecules into and out of cells or carrying substances with you using movements that are similar to walking.

But what is a protein? Proteins are really big molecules, basically. More specifically, proteins are chains of smaller molecules called amino acids. However, long and loose chains of amino acids are unstable, so proteins fold and curl until they reach a stable three-dimensional shape. What is the stable form of a protein or stable forms when there are several? This is the “problem of protein folding”.

Understanding how proteins fold is important because a protein’s function depends on its shape. Some mutations can change the amino acid sequence of a protein, causing the protein to fold incorrectly. It can then no longer fulfill its function and the result can be a serious illness. There are many diseases caused by misfolded proteins, for example type 2 diabetes, Alzheimer’s, Parkinson’s and also ALS, which is the disease Stephen Hawking had.

Therefore, understanding how proteins fold in order to find out how these diseases arise and how they can potentially be cured is important. However, the benefit of understanding protein folding goes beyond that. In general, if we knew how proteins fold, it would be much easier to synthesize proteins with a desired function.

But protein folding is an awfully difficult problem. What makes it so difficult is that there are a multitude of ways that proteins can fold. The amino acid chains are long and can fold in many different directions, so the possibilities increase exponentially with the length of the chain.

Cyrus Levinthal estimated in the 1960s that a typical protein could fold in more than ten to one hundred and forty ways. However, don’t take this number too seriously. The number of folds possible actually depends on the size of the protein. Small proteins can have as few as ten to fifty, while some large and staggering ten to three hundred possible folds. That is almost as many vacuums as there are in string theory!

So it is clearly not possible to try all possible folds. We would never find out which is the most stable.

The problem is so difficult that you may think it is unsolvable. But not everything is bad. Scientists found in the 1950s that when proteins fold under controlled conditions, such as in a test tube, the shape into which they fold is determined quite largely by the sequence of the amino acids. And even in a natural setting, rather than in a test tube, it usually still does.

In fact, the 1972 Nobel Prize in Chemistry was awarded for it. Before that, one might have feared that proteins would have a large number of stable forms, but that does not appear to be the case. This is likely because natural selection prefers to use large molecules that reliably fold in the same way.

There are a few exceptions. For example, prions, like those responsible for mad cow disease, have several stable forms. And proteins can change their shape when their environment changes, for example when they encounter certain substances in a cell. Most of the time, however, the amino acid sequence determines the shape of the protein.

So the problem with protein folding is this: if you have the amino acid sequence, can you tell me which shape is the most stable?

How would you solve this problem? There are basically two options. For one thing, you can try to find a model for why proteins fold in one direction and not the other. You probably won’t be surprised to hear that I’ve had some physicist friends trying on this. In physics we call this a “top-down” approach. The other thing you can do is what we call the “bottom-up” approach. This means that you are watching large numbers of proteins fold and hope to extract regularities from them.

To get anywhere with protein folding, you first need examples of what folded proteins look like. One of the most important methods for this is X-ray crystallography. To do this, X-rays are fired at crystallized proteins and how the rays are dispersed. The resulting pattern depends on the position of the various atoms in the molecule, from which one can then infer the three-dimensional shape of the protein. Unfortunately, some proteins take months or even years to crystallize. But a new method has recently greatly improved the situation through the use of electron microscopy on frozen proteins. This so-called cryo-electron microscopy provides a much better resolution.

To track progress in predicting protein folding, researchers founded an initiative in 1994 called Critical Assessment of Protein Structure Prediction, or CASP for short. CASP is a competition between different research teams trying to predict the folding of proteins. The teams are given a series of amino acid sequences and are asked to indicate what shape they think the protein will fold into.

This competition takes place every two years. It uses protein structures that have only been measured experimentally but not yet published, so the competing teams do not know the correct answer. The predictions are then compared to the actual shape of the protein and given a score based on how well they match. This method of comparing the predicted to the actual three-dimensional shape is called a global distance test and is a percentage. 0% is a total failure, 100% is the high score. At the end, each team receives a full score that is the average of all of their predictive values.

CASP competition has made slow progress for the first 20 years. Then the researchers started putting artificial intelligence on the task. In fact, last year around half of the teams were using artificial intelligence, or more precisely deep learning. Deep learning uses neural networks. It is software that is trained on large amounts of data and recognizes to recognize patterns from which it then extrapolates. I explained this in more detail in an earlier video.

Until a few years ago, nobody scored more than 40% in the CASP competition. In the last two stages of the competition, one team has achieved remarkable results. This is DeepMind, a UK company that was acquired by Google in twenty-four years. It is the same company that is behind the AlphaGo computer program, which was the first to defeat a professional Go player at the age of fifteen.

DeepMind’s protein folding program is called AlphaFold. In twenty-eight years, AlphaFold scored nearly 60% in the CASP competition, and in 2020 the AlphaFold2 update hit nearly 90%.

The news hit the headlines a few months ago. In fact, many news outlets claimed that AlphaFold2 solved the problem of protein folding. But did it do?

Critics have noted that 90% is still a significant failure rate and that some of the most interesting cases are those where AlphaFold2 did not do well, such as complexes of proteins called oligomers, in which multiple amino acids interact. There is also the general problem with artificial intelligence that it can only learn to extract patterns from data on which it has been trained. This means that the data must exist in the first place. If there are entirely new features that don’t appear in the dataset, they may go undetected.

But good. I sense a certain grumpiness here from people who fear that software will make them obsolete. It’s sure to be true that AlphaFold’s success in 2020 won’t be the end of the story. There is still a lot to be done, and of course, you still need data, ie measurements, to train artificial intelligence.

Even so, I think this is a remarkable achievement and an amazing step forward. This means that in the future, protein folding predictions using artificially intelligent software will save scientists much time-consuming and expensive experiments. This could help researchers develop proteins with specific functions. Some that are on the wish list include proteins that stimulate the immune system to fight cancer, a universal flu vaccine, or proteins that break down plastics.


Please enter your comment!
Please enter your name here