Using deep learning to annotate the protein universe
Nature Biotechnology 21 February 2022
10.1038/s41587-021-01179-w
Deep learning models can improve protein annotations and has helped expand the Pfam database
This article was last updated on 20 October 2022.
Our protein family database – Pfam – is used by a diverse range of researchers across the globe. Open access to the protein family data stored in Pfam has helped experimental biologists understand protein function, aided structural biologists’ insights into protein structure, given computational biologists rapid access to protein sequence information, and let evolutionary biologists trace the origins of proteins.
Pfam – now only accessible in the InterPro protein family database – gives researchers access to vital protein annotations, structures, and multiple sequence alignments. It is a resource widely used to classify protein sequences into phylogenies and identify domains – functional regions – to provide insights into protein function.
With help from new deep learning models, Pfam has increased the protein sequence annotation and function data available within the database by unprecedented amounts. Research published in the journal Nature Biotechnology demonstrates how deep learning methods developed by Google Research could be trained using data from Pfam to accurately annotate many previously undescribed protein domains, shedding light on potential protein function.
Since this initial work, the collaboration has continued and the Google Research team has helped expand the number of Pfam protein annotations available even further. Additionally, the annotations have been made available in the InterPro website, as part of the latest InterPro release.
“Initially I was rather sceptical about using deep learning to reproduce the protein families within Pfam. Then I started collaborating more closely with Lucy Colwell and her team at Google Research and my scepticism quickly changed to excitement for the potential of these methods to improve our ability to classify sequences into domains and families,” said Alex Bateman, Senior Team Leader of Protein Sequence Resources at EMBL-EBI. “These models exceed my expectations. They’re not just copying the data already in Pfam, they’re able to learn from the data and find new information that is yet to be discovered. What this gives us is the ability to expand the Pfam collection and potentially that of other resources using these same deep learning methods.”
By combining deep learning models with existing methods to add new data into Pfam, in 2021 the researchers were initially able to expand the database by almost 10%. This number is still increasing due to continued efforts from the Google Research team. This work exceeds all expansion efforts made to the database over the last decade. The deep learning methods were also able to predict the function for 360 human proteins that had no previous annotation data available in Pfam.
Using additional protein family predictions generated from the Google Research team’s neural networks – a series of algorithms that looks for underlying structure in the sequences of protein domains and families – created a supplement to Pfam called Pfam-N, where N stands for network. Initially Pfam-N added nearly 1.8 million full-length protein sequences that were previously not found within Pfam. In the latest update this has increased to 5.2 million full-length protein sequences, increasing the percentage of sequences covered from the UniProt Reference Proteomes from ~75% to ~83%.
“We’re also now building on these established deep learning methods to expand the information in the database even further,” said Bateman. “We’re changing the way the existing deep learning model works so that we can call multiple protein domains at once.”
“My personal view is that there’s still a lot of scope to improve the deep learning models we’re currently using,” Bateman added. “We’re in the early days of this and I’m very hopeful for what it will mean for the future classification of protein families. This may even be something that will get solved in the next five years.”
Find out more
Find out more about Pfam’s collaboration with Google Research and more details about Pfam-N in this Pfam blog post.
This work is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.
Nature Biotechnology 21 February 2022
10.1038/s41587-021-01179-w
Looking for past print editions of EMBLetc.? Browse our archive, going back 20 years.
EMBLetc. archive