It took decades of painstaking research to map the structure of just 17 % of the proteins used within our body, but less than a year for UK-based AI company DeepMind to raise that figure to 98.5 %. The business is making all of this data freely available, which could bring about rapid advances in the development of new drugs.
Determining the complex, crumpled form of proteins based on the sequence of proteins that make them has been a huge scientific hurdle. Some proteins are drawn to others, some are repelled by water, and the chains form intricate shapes that are hard to calculate accurately. Understanding these structures permits new, highly targeted drugs to be designed that bind to specific parts of proteins.
Genetic research had long provided the opportunity to determine the sequence of a protein, but a competent way of finding the shape – imperative to understanding its properties – has proven elusive. Although supercomputers and distributed computing projects have already been effective, they have didn’t make significant progress.
DeepMind published research this past year that proved that AI can resolve the situation quickly. Its AlphaFold neural network was trained on parts of previously solved protein shapes and learned to deduce the structure of new sequences, that have been then checked against experimental data.
Since then, the business has been applying and refining the technology to a large number of proteins, beginning with the human proteome, proteins relevant to covid-19 and others which will most benefit immediate research. It is now releasing the results in a database created in partnership with the European Molecular Biology Laboratory.
DeepMind has mapped the structure of 98.5 % of the 20,000 roughly proteins in our body. For 35.7 per cent of the, the algorithm gave a confidence of over 90 per cent accuracy in predicting its shape.
The business has released more than 350,000 protein structure predictions altogether, including those for 20 additional model organisms that are essential for biological research, from Escherichia coli to yeast. The team hopes that within months it could add nearly every sequenced protein known to science – a lot more than 100 million structures.
Read more: DeepMind’s AI biologist can decipher secrets of the machinery of life
John Moult at the University of Maryland says the rise of AI in the area of protein folding have been a “profound surprise”.
“It’s revolutionary in a sense that’s hard to really get your head around,” he says. “If you’re focusing on some rare disease and you never had a structure, now you’ll manage to go and appearance at structural information that was basically very, very difficult or impossible to get before.”
Demis Hassabis, leader and founder of DeepMind, says that AlphaFold – which is composed of around 32 separate algorithms and has been made open source – is currently solving protein shapes in minutes or, in some cases, seconds using hardware forget about sophisticated when compared to a standard graphics card.
“It requires one [graphics processing unit] a couple of minutes to fold one protein, which of course could have taken years of experimental work,” he says. “We’re just going to put this treasure trove of data out there. It’s a bit mind blowing in ways because going from the breakthrough of fabricating a system that may do that to actually producing all the data has only been a matter of months. We hope it’s likely to become a type of standard tool that all biologists all over the world use.”
The team also added a confidence measure to all or any structure predictions, which Hassabis says he felt was essential given that the results would be the basis for research efforts. Hassabis believes that some part of human proteins that the predicted structure had lower confidence scores could be down to errors in the sequence or simply “something intrinsic about the biology”, such as for example proteins that are inherently disordered or unpredictable. The 1.5 % remaining of the human proteome which no structure has been published for were proteins with sequences longer than 2700 segments, which were excluded for the moment to minimise runtime.
Journal reference: Nature , DOI: https://www.nature.com/articles/s41586-021-03828-1
More on these topics:
- neural network