Distinguishing coronavirus genome mutations from inadvertent errors
EMBL scientists have performed a large-scale analysis of SARS-CoV-2 genomes, which reveals a need for cautious interpretation of sequencing data to identify mutations within the virus’s genome
When viruses produce copies of their genomes inside host cells, mutations – changes in their genome sequence – can occur. Mutations can affect the way viruses infect cells and replicate within them. They can lead to subtle changes in viral proteins, which can prevent existing antibodies in the immune system from recognising the virus. Mutations can also reduce the efficiency of antiviral treatments. It’s important to identify and catalogue mutations, to better understand how viruses – such as the SARS-CoV-2 coronavirus that causes COVID-19 – spread and evolve over time. When scientists sequence a virus’s genome and it seems to show changes, these could be the result of actual mutations – or they could be due to inadvertent errors during the experiment, called technical artefacts. Such artefacts can be caused by different ways of preparing virus samples for analysis, of determining their genomic sequences, and of analysing the data.
Scientists working in the Goldman group at EMBL’s European Bioinformatics Institute (EMBL-EBI), together with colleagues in Vienna and Cambridge, systematically analysed over 4700 SARS-CoV-2 genome sequences from laboratories all over the world. They found that many of the most interesting changes in the SARS-CoV-2 genome that have been reported so far are likely to be technical artefacts, rather than biological mutations. Some changes were observed only in genome sequences that were reported by some laboratories, indicating that specific combinations of sample handling procedures, sequencing technology, and data analysis can cause recurrent errors. When the same mistake happens repeatedly in one lab, it can make it seem that the viruses studied there share an evolutionary origin that might not be real, or that the same mutation is happening repeatedly in just one part of the world.
Based on their analysis, the EMBL scientists and their colleagues developed a set of recommendations for filtering and masking specific parts of the SARS-CoV-2 genome when analysing sequence data. These recommendations, which they hope will be further refined by the research community as more information is shared, will help other researchers to interpret SARS-CoV-2 genome sequences and avoid potential pitfalls. This will ensure the mutations they identify are real, helping to drive forward coronavirus research.