AI for DNA, Chemicals and Microbiome

Models, Mechanisms and Applications

📅 May 27th, 2026
⏱️ 9:30 AM
📍 Natural History Museum, The Auditorium
Øster Voldgade 5, Copenhagen, Denmark
AI Pioneer Center logo BRIGHT DTU logo

About the workshop

This workshop brings together researchers in machine learning, biology, and chemistry to advance AI methods for molecular and biological systems. It focuses on foundation models trained on genomic and molecular data, as well as multimodal approaches for integrating omics data.
We also dive into a novel field of mechanistic interpretability to understand how well models capture biological processes.
Microbiome data serves as a central application area, highlighting challenges in sparse, compositional, and heterogeneous data, and motivating generalizable modeling approaches.
The aim is to map the current state of the field, identify key challenges in modeling and evaluation, and foster new collaborations.

Speakers & Themes

Foundation models

Roman Bushuiev
PhD Student · Czech Institute of Informatics, Robotics and Cybernetics
Bio
Roman Bushuiev's research focuses on machine learning for the discovery of molecules from tandem mass spectra. He is the first author of DreaMS (Nature Biotechnology, 2025) and MassSpecGym (NeurIPS 2024, Spotlight), a Google PhD Fellow, and a member of the ELLIS PhD program.
Frederikke Isa Marin
Postdoctoral Researcher · University of Copenhagen
Bio
TBA

Mechanistic interpretability

Ihor Kendiukhov
Founder · Biodyn
Bio
Ihor is a researcher at the intersection of AI safety and AI for biology. He is the founder of BiodynAI - a project to apply mechanistic interpretability to biological foundation models. He is also a research lead at AI Safety Camp and SPAR.
Edir Sebastian Vidal Castro
Data Scientist · Biotechnology Research Institute for the Green Transition (BRIGHT)
Bio
Edir is a Bioinformatics MSc student at DTU and a Data Scientist at the Biotechnology Research Institute for the Green Transition (BRIGHT). His work has special focus on developing generative models for omics data and enhancing explainability of these methods. Currently, Edir is involved in multiple projects such as developing a diffusion model framework for cross-species translation in single-cell transcriptomics, improving horizontal integration of metagenomics studies, and building scalable pipelines using Nextflow.

Multimodality

Dewei Hu
PhD Student · University of Copenhagen
Bio
Dewei Hu's research focuses on developing machine learning methods for multi-omics integration, protein-protein interaction networks (STRING database), and single-cell biology. His current work includes contrastive learning frameworks for proteomics and metabolomics alignment (AugMent) and network-based approaches to disease protein mapping.

Benchmarking

Anton Bushuiev
PhD Student · Czech Institute of Informatics, Robotics and Cybernetics
Bio
Anton Bushuiev's research spans machine learning for molecular sciences, including small-molecule mass spectrometry and protein engineering. He is a co-first author of DreaMS (Nature Biotechnology, 2025) and MassSpecGym (NeurIPS 2024, Spotlight), and the first author of methods in protein modeling including ProteinTTT (ICLR 2024) and PPIformer (ICLR 2026).

Microbiome: data and applications

Shiraz Shah
Senior Researcher · Copenhagen Prospective Studies of Asthma in Childhood, Gentofte Hospital
Bio
Shiraz Shah has spent two decades mapping the protein dark matter within bacterial and viral genomes. This has led to the discovery of new enzymes used for genetic engineering, as well as the definition of hundreds of new viral families. Currently he is exploring the use of protein language models for tasks where he previously used sequence alignments.
Damian Rafal Plichta
Head of Data Science · Novonesis
Bio
TBA

Organizers

Svetlana Kutuzova
Assistant Professor · University of Copenhagen, Pioneer Centre for Artificial Intelligence
Alberto Santos Delgado
Director of Informatics Platform · Biotechnology Research Institute for the Green Transition (BRIGHT)

Agenda

09:30

Morning coffee

10:00

DreaMS: a Foundation Model for Tandem Mass Spectrometry

Roman Bushuiev, PhD student at Czech Institute of Informatics, Robotics and Cybernetics

The vast majority of tandem mass spectra acquired in metabolomics experiments remain unannotated. We present DreaMS, a transformer-based neural network pre-trained in a self-supervised way on millions of unannotated mass spectra from public repositories. Through masked peak prediction and retention order learning, DreaMS acquires rich representations of molecular structures without relying on annotated spectral libraries or domain expertise. We show how fine-tuning DreaMS yields state-of-the-art performance across diverse tasks including spectral similarity, molecular fingerprint prediction, and fluorine detection. We then present five new methods built on DreaMS — targeting the discovery of plant fluorinated natural products, environmental PFAS screening, novel metabolite prioritization, billion-scale spectral search, and general-purpose structure identification. Finally, we introduce DreaMS Atlas, a molecular network of 200 million mass spectra connecting compounds across thousands of independent studies.
10:30

DNA Foundation Models

Frederikke Isa Marin, Postdoctoral Researcher at University of Copenhagen

TBA
11:00

Proteome-Augmented Metabolomics Improves Disease Risk Prediction in Population Cohorts

Dewei Hu, PhD Student at University of Copenhagen

Proteomic profiles offer powerful predictors of disease risk but remain costly and scarce in population cohorts, limiting their translational impact. Metabolomic data, while widely available, carry substantial pre-analytical noise from fasting status, circadian timing, and recent dietary intake, which obscures stable disease-relevant signals. Here, we present AugMent, a contrastive alignment framework that uses the plasma proteome as a biological supervisory signal to distill the most stable and disease-informative components of the metabolome. While the framework supports various architectures, a CLIP-based contrastive implementation yielded the highest fidelity in the UK Biobank. When these distilled representations were used in standard Cox proportional hazards models, they improved prediction across 82 diseases. Mechanistic analysis reveals that the framework preferentially captures latent relationships involving lipoprotein remodeling and lipid transport, particularly pathways where proteins and metabolites are tightly coupled, adding predictive signals to cardiometabolic and related diseases. Together, our results demonstrate that proteome-guided distillation can unlock disease-relevant signals already present in the metabolome but otherwise masked by noise.
11:30

Lunch

12:30

Opening the Black Boxes of Biological AI: Mechanistic Interpretability of Single-Cell Foundation Models

Ihor Kendiukhov, Founder at Biodyn

Foundation models like Geneformer and scGPT have shown impressive performance on biological prediction tasks, but a prediction without a mechanism is just a correlation. Ihor will present work applying mechanistic interpretability, a discipline from technical AI safety research, to open the hood of these models and ask what they have actually learned about biology. We find that bio AI models build remarkably organized internal representations: protein-protein interaction networks, pathway membership, functional modules, and subcellular localization, all encoded in structured geometric arrangements that mirror known cellular organization. But we also find an important limitation: often these models encode co-expression and pathway structure, not causal regulatory logic. They know which genes go together, but not which gene controls which.
Ihor will also present the discovery and extraction of a compact developmental algorithm from inside scGPT: a representation of blood cell differentiation that we surgically removed from the model as a standalone tool roughly a thousand times smaller, faster, and competitive with established bioinformatics methods. This points toward a new way of creating bioinformatics algorithms, not by designing them, but by extracting them from foundation model internals. Ihor will close by discussing where biological foundation models can be trusted, where they should not be, and how interpretability bridges the gap between black-box AI and the mechanistic understanding that biology demands.
13:15

GEA: Graph Explainable Attribution for decomposing GNNs using Sparse Autoencoders

Edir Sebastian Vidal Castro, Data Scientist at Biotechnology Research Institute for the Green Transition (BRIGHT)

Graph Neural Networks (GNNs) are powerful black-box models for biological data, as they provide a robust way to model complex structural relationships, yet their clinical adoption is hindered by a lack of trustworthy explanations. Current GNN explainability methods provide instance-specific attributions but fail to reveal the global, recurring concepts a model has learned. We propose Graph Explainable Attribution (GEA): a new framework that reframes GNN explanation as a dictionary learning problem. By training a sparse autoencoder on a GNN's graph-, node-, and edge- level embeddings, we decompose dense representations into a sparse combination of interpretable features. Each feature in this learned dictionary corresponds to a fundamental biological motif the GNN uses for prediction. Unlike existing XAI methods that offer fragmented, local views, GEA provides a holistic, dataset-wide perspective that captures systemic model logic. We demonstrate the utility of GEA through use cases on Inflammatory Bowel Disease patient-specific gene co-expression networks, where GEA successfully recovers known biological motifs and uncovers the underlying importance of specific subgraphs, node configurations, and edge patterns that drive the model’s decisions. This approach provides a structured, decomposable view of the GNN's internal logic and a new paradigm for creating more reliable and scientifically valuable explanations.
13:45

Afternoon coffee

14:15

MassSpecGym: Benchmarking the Discovery and Identification of Molecules from Mass Spectra

Anton Bushuiev, PhD Student at Czech Institute of Informatics, Robotics and Cybernetics

Despite growing interest in applying machine learning to tandem mass spectrometry, progress has been hindered by the lack of standardized datasets and evaluation protocols. We present MassSpecGym, the first comprehensive benchmark for molecular discovery from MS/MS data, providing 231,000 high-quality labeled spectra across 29,000 molecules and defining three challenges: de novo structure generation, molecule retrieval, and spectrum simulation. We discuss key design decisions, including a novel molecular edit distance-based data split that resolves data leakage issues pervasive in prior work. We also present the first systematic evaluation audit of the MassSpecGym ecosystem, identifying recurring failure modes in recent works—including data contamination, reward hacking, and metric implementation divergence—that materially affect benchmarking conclusions, and release MassSpecGym v1.5 with more robust evaluation conditions.
14:45

Data-driven Approaches to Understanding Asthma in Childhood

Shiraz Shah, Senior Researcher at Copenhagen Prospective Studies of Asthma in Childhood, Gentofte Hospital

Recently we have discovered that the human body is full of bacteria and viruses that protect us from disease and such microbes vastly outnumber the ones that cause disease. Analysis of metagenomic data can help us to detect new bacteria and viruses that were not known before, and if they protect from disease, they can be administered to prevent at treat diseases instead of drugs. Discovery of novel microbes in metagenomic data requires approaches that are able to detect microbial dark matter. We previously used protein sequence alignments to define families of viruses that were not found in public databases. These viral families were later found to protect children from developing asthma. However, protein alignments are not quite sensitive enough to bridge the huge diversity viruses on earth, so many viruses are still unclustered singletons. The protein language models that have been developed in recent years are extremely sensitive, but we still do not have the tools to integrate them into existing microbial classification tasks. Here, we applied protein language models for classification of distant viral taxa, noting that they vastly outperformed traditional bioinformatics workflows and even manual curation. However, interpretability remains a challenge, and there is a huge need to assign biological meaning to embedding dimensions.
15:15

Foundation Models in Practice: Proteomics and Metabolomics

Damian Rafal Plichta, Head of Data Science at Novonesis

TBA
15:45-16:00

Concluding remarks and thanks for today!