Edit

Sparking a data revolution

On the evolution of bioinformatics and computational biology at EMBL

Credit: Creative Team/EMBL

By Wolfgang Huber, Jan Korbel, Jo McEntyre, and Oliver Stegle

The roots of bioinformatics at EMBL lie in the world’s first nucleotide sequence database, the EMBL Nucleotide Sequence Data Library (now known as the European Nucleotide Archive), established in 1980 at EMBL Heidelberg. This database of nucleotide sequences generated and submitted by the research community has, from the very beginning, provided researchers with open access to nucleotide data without restrictions. This approach facilitated a paradigm shift in how research data were shared and used across the scientific community. Along with partners in the USA and Japan, EMBL-EBI is today a key part of the International Nucleotide Sequence Database Consortium (INSDC).

The Data Library merged into the Biocomputing programme at EMBL, founded in 1986. In 1992, the EMBL Council voted to establish EMBL’s European Bioinformatics Institute (EMBL-EBI) at the Wellcome Trust Genome Campus in Hinxton, UK. In its three decades of existence, EMBL-EBI has played a major part in the bioinformatics revolution. Today, it provides a world-leading and comprehensive range of molecular biology databases and operates a training programme globally.

Over the last few decades, EMBL-EBI’s expanding portfolio of resources has reflected the growing discipline of bioinformatics and increasingly data-driven life sciences. These have included: UniProt, the knowledge base of protein information – now covering over 200,000,000 known and predicted proteins, and the Ensembl Project, originally focussed on data emerging from the Human Genome Project in late 1990s. Today, Ensembl enables researchers to browse not only the reference human genome but also genomes across nearly every branch of the evolutionary tree, allowing sophisticated comparative genomics and analysis of downstream effects of genetic variations. EMBL-EBI’s Protein Data Bank in Europe (PDBe), set up in 1996, was one of its founding members of the worldwide Protein Data Bank (wwPDB) in 2003, growing from the original Protein Data Bank, which has been in operation since 1971.

EMBL-EBI also collects gene expression data, proteomics, metabolomics and other curated data collections such as protein families, molecular interactions, pathways and ontologies, and the scientific literature. Most recently, in 2019, the BioImage Archive was launched, opening new opportunities to combine image data with data in fields such as spatial transcriptomics and proteomics.

In 2024, the EMBL-EBI websites are receiving over 100 million requests for data per day. The data are used in many ways – from straightforward information look-ups by biologists and non-experts, to supporting sophisticated algorithm development by computational biologists, to innovative product development in industry. Collecting the data and making it as easy as possible to reuse enables researchers to rapidly build thematic platforms such as the COVID-19 Data Portal, which was developed in a matter of weeks in response to the pandemic to enable effective SARS-CoV-2 data sharing, accelerating research and supporting the development of diagnostics, therapeutics, and effective vaccines.

In 2013, EMBL co-initiated the Pan-Cancer Analysis of Whole Genomes (PCAWG) project, a pioneering study for cancer genome and transcriptome data sharing, drawing participation from a global consortium of researchers. The project undertook a detailed analysis of cancer genome sequences from 2,800 patients, resulting in important insights regarding tumour evolution and progression. The project also addressed the ethical considerations surrounding patient data sharing, setting a precedent for future international studies.

EMBL-EBI initiatives are also playing a major role in understanding and conserving the rich biodiversity of our planet, through projects such as the Darwin Tree of Life (DToL) initiative, part of the Earth Biogenome Project, the African BioGenome Project (AfricaBP), and the Global Microbial Gene Catalog (GMGC).

Big data need reusable and open software and algorithms to analyse and interpret them. EMBL has played an organising and structuring role here, supporting developer platforms and code repositories such as Bioconductor (founded in 2001) to bring interoperability and sustainability into these distributed and diverse efforts.

With growing data also comes the need for standardisation, ensuring that data remain easily interpretable and reusable by scientists throughout the world. Here, EMBL-EBI has promoted standards that ensure data are not only openly available, but also Findable, Accessible, Interoperable, and Reusable (FAIR). It has also helped in developing international standards and guidelines for sharing, annotating, and archiving biological information, thus helping maintain the integrity and reliability of biological databases.

Perhaps one of the most exciting revolutions in biological data analysis has been the advent of AI-based methods. AlphaFold – an AI-powered system that can accurately predict millions of protein structures was developed by Google DeepMind and required the publicly available protein structure data in PDBe for training. In 2021, EMBL-EBI and Google DeepMind together released the AlphaFold Protein Structure Database – an open platform where anyone can search, analyse, and download an AlphaFold prediction for every single known protein in UniProt. Other than this, in recent years, EMBL scientists have been actively developing and applying AI methods in molecular biology across a variety of fields and applications.

Together with our world-class research and open data resources, it is exciting to think about what EMBL’s impact on bioinformatics is going to be in the next 50 years, especially in the face of the AI revolution, whether at the level of a single cell or an entire ecosystem. As we look towards the future, it is clear that EMBL is going to continue its leadership, not only in housing the world’s biological data resources, but also in enabling its widespread and responsible use.

Learn more about EMBL’s contribution to life science research and services in our 50th anniversary commemorative publication.


Tags: alphafold, bioinformatics, embl-ebi, embl50

EMBLetc.

Looking for past print editions of EMBLetc.? Browse our archive, going back 20 years.

EMBLetc. archive

Newsletter archive

Read past editions of our e-newsletter

For press

Contact the Press Office
Edit