Our mission is to train scientists. This blog is a platform for us to share updates on our annual programme, tips and tricks for scientists, new e-learning opportunities, and sometimes just something to make you smile.
The EMBO | EMBL Symposium ‘AI and biology‘ took place last month, bringing together researchers working at the intersection of AI and biology to discuss theory, methods, new application areas, and dissemination strategies.
For the inaugural edition of this meeting, we had 340 people attending on-site and 266 virtual participants, including five fellowships provided by the EMBL Corporate Partnership Programme and EMBO. We held two poster sessions up on the helices of the Advanced Training Centre during which the presenters could discuss their research — their work was then voted for by the attendees and speakers. Out of the 173 posters, five prizes were awarded during the meeting and we are pleased to share them with you below.
Presenter: Larissa Heinrich
Abstract:
Denoising diffusion probabilistic models (DDPMs) have proven to be powerful generative models for image generation while being simple to train compared to alternatives such as generative adversarial networks. We are exploring the application of DDPMs in bioimage segmentation. In particular, we perform our investigation on the example of segmentation of subcellular structures in volume electron microscopy.
Obtaining specialised training data remains a significant and time-consuming barrier in finetuning networks to new datasets and improving segmentation performance for underrepresented classes. Our work investigates the potential of using DDPMs to supplement training data for segmentation ranging from additional data augmentation to generating examples with classifier-free guidance under various conditioning signals. Our approach is designed to leverage the abundant unlabeled data during the training process, thereby facilitating domain adaptation and class balancing.
While still in its early stages, our research aims to assess the utility of integrating generative models, specifically DDPMs, in the segmentation of subcellular structures in volume electron microscopy. This exploration offers a promising avenue for reducing the human annotation effort in bioimage segmentation.
Presenter: Niklas Gesmar Madsen
Abstract:
Antimicrobial peptides (AMPs) remain a staple in last-resort treatment against antibiotic-resistant organisms, yet state-of-the-art computational methods result in low success rates in vivo. The importance of careful embeddings and their associated symmetries when using AI/DL for biological tasks is stressed, which explains the low success rates. Numerical representation of amino acid sequences are investigated to find correlations with antimicrobial activity. It is shown that state-of-the-art methods can not discriminate a sequence from its shuffled permutation. Naturally, a shuffled amino acid sequence leads to differential activity in vivo. This failure mode is necessarily the case, as most physicochemical descriptors are permutation invariant, making the task of classifying shuffled sequences impossible.
The integration of structure into a prediction is a natural way to break permutation invariance, as a permutation in sequence leads to a different three-dimensional structure. Geometric deep learning is implemented and is shown to discriminate antimicrobial peptides from inactive ones. The method respects the SO(3) group of rotations, a subset of the Euclidean isometries, and learns molecular surfaces by group equivariant convolutions on the two sphere. The method is extended to include large–language model embeddings and shows that this breaks permutation invariance. The method indicates the importance of hydrogen bonding for antimicrobial activity, which contrasts the canonical understanding of facial amphiphilicity and draws ties to self-assembling systems, which have been documented in the literature.
In conclusion, the method drastically reduces the necessary search space of antimicrobial peptides experimentally, as it can discriminate shuffled permutations. The method further shows encouraging preliminary results in predicting biological activity directly from the sequence, which has been largely elusive so far for AMPs.
Poster Prize kindly sponsored by Digital Discovery, the Royal Society of Chemistry journal
Presenter: Eva Klimentová
Abstract:
Two decades ago, an intriguing phenomenon emerged: some proteins possess a knotted structure in their polypeptide chains, making it impossible to fully unravel them by simply pulling at their ends. Since this initial discovery, a variety of knotted proteins have been identified. Despite this, the mechanisms behind their knot formation and the specific functions these knots serve are still not well understood. This knowledge gap is further compounded by the limited diversity of knotted protein families found in nature.
With the boom of contemporary Machine Learning methods such as Stable Diffusion, one can now examine the knotting phenomenon from an alternative perspective. Using state-of-the-art computational tools for novel protein generation (represented by RFdiffusion combined with ProteinMPNN and ColabDesign; and EvoDiff with OmegaFold), we monitor the knotting status of the created proteins. In the unconditional generation of proteins, approximately 0.5 % of the proteins are entangled, predominantly featuring the simplest 3_1 type of knot an observation consistent with empirical data from the Protein Data Bank (PDB).
In the subsequent phase, we ask whether the generated knotted proteins possess any shared sequence features (are separable from the unknotted ones) or if the knot in their backbone resulted from chance. In natural proteins, knotting is a phenomenon conserved within one protein family or subfamily with a very restricted number of such families, presenting challenges in revealing potential knotting patterns. The ability to artificially generate knotted proteins with common features but surpass the patterns observed in real proteins can aid in uncovering a universal knotting pattern.
Presenter: Niklas Schmacke
Abstract:
Accelerated by developments in genetic engineering, it has become feasible to investigate the effects of millions of genetic perturbations on biological processes in an unbiased manner. An approach called forward genetic screening associates phenotypes with genotypes by randomly inducing mutations and then identifying those that result in phenotypic changes of interest. Cell-based screens for comparatively simple phenotypes such as cell death are now routinely conducted at the genome scale. However, screening for more complex phenotypes, such as subcellular spatial organisation as observed by microscopy, has not been possible at scale.
We have developed spatially resolved CRISPR screening (SPARCS), a platform for microscopy-based genetic screening for subcellular spatial phenotypes. SPARCS uses fully automated high-speed laser microdissection to physically isolate phenotypic variants in situ from virtually unlimited library sizes. We demonstrate the potential of SPARCS in a genome-wide CRISPR knockout screen on autophagosome formation in 40 million cells. In combination with a deep learning based image classifier trained in a fully supervised manner, SPARCS recovered almost all known macroautophagy genes in a single experiment and discovered a new role for the ER-resident protein EI24 in autophagosome biogenesis.
An unsolved problem in these phenotypically complex screens is the identification of new, previously unseen phenotypes based on single-cell images. To learn representations of genome-wide phenotypic space and identify outliers, we now increasingly use diverse data sources to describe image based phenotypes in a given biological model system in combination with large, pre-trained computer vision models. Since SPARCS screens can be archived, it is possible to reanalyse past screens using more advanced models to discover new phenotypes as such models become available. Taken together, in combination with deep learning based computational models, SPARCS enables the identification of genes underlying complex biological processes in an unbiased manner across the human genome.
Poster Prize kindly sponsored by Molecular Omics, the Royal Society of Chemistry journal
Presenter: Vikas Shukla
Abstract:
Nucleosomes, the octameric protein complex consisting of histones H3, H4, H2A, and H2B represent the basic layer of the chromatin organization¹. Post-translational modifications of histone tails (mainly H3) and sequence variants of histones (mainly H2A, H3, and H2B) constitute epigenetic instructions relevant to enzymes that can remodel the chromatin environment. Theoretically, the combinations of modifications and variants on a given nucleosome could be next to innumerable. However, we have learned that very few of the combinations are logical and functionally relevant.
Chromatin states represent the most likely combinations of histone modifications and variants². Projecting these states on genes followed by unsupervised clustering gives rise to chromatin landscapes, the stereotypical assemblies of chromatin states over genes³. These chromatin landscapes represent broad classes of genes that are maintained in a similar epigenetic environment by the cell to modulate their transcriptional pattern in a specific manner.
The existence of chromatin landscapes suggests that a gene can be placed in a limited number of epigenetic environments and the gene’s expression pattern is a product of its presence in a particular chromatin landscape.
Comparing the chromatin landscapes of Arabidopsis thaliana and Marchantia polymorpha, we found that their chromatin landscapes were largely similar suggesting a strong degree of conservation. These findings also highlight the predictive abilities of chromatin landscapes in connecting the epigenome to expression.
Find out more about the #EESAIBio meeting from the blog post written by Eva Klimentová, who participated as an event reporter!
The EMBO | EMBL Symposium ‘AI and biology’ took place from 12 – 15 March 2024 at EMBL Heidelberg and virtually.