Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
PLOS Biology 9 November 2021
10.1371/journal.pbio.3001421
Bacteria are the oldest and most abundant cellular organisms on the planet. They’re incredibly diverse and adaptable, able to survive in almost any environment, from the ocean depths to volcanic springs and even the desert. The human body itself is estimated to contain more bacterial cells than human cells.
Given this impressive diversity, microbiologists trying to understand how bacteria work and evolve have a long road ahead of them. Large volumes of bacterial DNA data are available in open repositories such as EMBL-EBI’s European Nucleotide Archive (ENA). However, many of these datasets are unprocessed, and the remainder have been assembled into genomes using different techniques over the years. Studying them together is a bit like trying to navigate while jumping between your car’s GPS, a paper map and Google – it mostly works, but it will lead you astray precisely when you need it the most.
In an effort to harmonise the data, researchers at EMBL-EBI and the Wellcome Sanger Institute have reviewed all the bacterial datasets available in the ENA and used them to assemble over 660,000 bacterial genomes. Features of interest – such as antimicrobial resistance genes – have been documented, and are now easy to find in the new dataset.
“I study genomic elements that are able to move between different bacteria,” explains Grace Blackwell, Postdoctoral Fellow at EMBL-EBI and the Wellcome Sanger Institute. “To do this, I need to search and analyse as many bacterial genomes as possible. But public data can be quite messy and needs to be processed uniformly, including quality control, before it can be used for analysis. So along with a few colleagues, we decided to ‘tidy up’ the data and make it easier for scientists to ask research questions.”
Grace and her colleagues spent many months looking through the data, characterising and assembling more than 660,000 bacterial genomes, in the hope that it will help researchers across the globe. They did this for all the bacterial data available in the ENA as of December 2018.
This unique dataset includes three different indices of the data and is now accessible using an FTP site. It integrates a range of different search and distance estimates, enabling researchers to check whether a sequence, gene, mutation or plasmid of interest are present in any of the genomes, and tell how related a set of genomes are.
While trawling through the data, the researchers were surprised to find that the majority of data comes from the same 20 species of bacteria. Notably, one third came from Salmonella enterica, a bacterium that causes foodborne illnesses leading to hospitalisations and deaths worldwide.
“The exercise gave us a detailed overview of the bacteria sequenced over the last 30 years,” explained Zam Iqbal, Group Leader at EMBL-EBI who was also involved in the project. “It confirms that researchers have been focusing on a small number of known pathogens. However, we know that antimicrobial resistance exists in a much wider range of contexts. This narrow sequencing focus is leaving us blind to both AMR genes and their vectors, a host of different mobile elements, which exist in other, less studied species. It shows that we need to widen the range of species we sequence, and to create better mechanisms for sharing the data with the community, so it’s useful to researchers and public health authorities alike.”
This press release was first published on the EMBL-EBI website.
PLOS Biology 9 November 2021
10.1371/journal.pbio.3001421
Looking for past print editions of EMBLetc.? Browse our archive, going back 20 years.
EMBLetc. archive