Thoughts

Elixion announces launch of flagship integrated wet–dry laboratory at the heart of London’s Knowledge Quarter

Dec 3, 2024

Subscribe to our newsletter.

Back to news

Subscribe to our newsletter.

To increase the rate of success within the drug development pipeline, the most effective strategy would be to improve the choice of nominated targets at the preclinical stages. Ironically, this is where machine learning has had only a modest application within biomedical science. Here, we outline an emerging strategy, advocated for by ourselves and others, leveraging diverse datasets and experiments across population and functional genetics, single cell technologies, and structural biology.

The cost of drug discovery and development is driven primarily by failure, with only ~10% of clinical-stage drug candidates receiving approval. Around 75% of the substantial attrition in phase II is due to efficacy or safety reasons; in short, we do not sufficiently understand human biology. However, targets that survive the drug development process are enriched for genetic evidence and comprise two-thirds of 2021 new drug approvals1—suggesting a path forward.

Machine learning (ML) has thus far had little impact within target identification, but there is significant opportunity to leverage advances in molecular biology: single-cell and spatial multi-omic technologies can produce extraordinarily dense descriptive datasets pertaining to human tissue and advanced model systems. Moreover, the number of well-curated biobanks is increasing and typically provide a link between clinical phenotypes and genetic background across thousands of patients. Despite this, Sydney Brenner aptly said “we are drowning in a sea of data and starving for knowledge.”

There are a number of characteristics that transform a potential gene of interest into a putative target: causality (are there genetic signals that implicate dysregulation of a pathway?); reversibility (will appropriate modulation restore a diseased model system to a healthy state?); and druggability (is efficacious and non-toxic in vivo manipulation of the target possible via a drug-like molecule?). However, there are a few prerequisites for this to be unleashed: crucially, we need an “isolatable pathology”, whereby one can use modern single-cell omic techniques to identify effector cells, build models of their regulatory mechanisms within a high fidelity in vitro system, and only then do we design design novel therapeutic entities for delivery to the necessary site of action.

We describe the substantial opportunity to use these modern technologies for target identification via genetics, ML-enabled wet lab/dry lab systems, and within structural biology—ultimately to transform the way in which biomedical science is performed.

CAUSALITY AND HUMAN GENETICS

Healthy and diseased phenotypes emerge from complex interactions between genetics and the environment. Mutations in exonic regions may cause trivial to severe changes in protein function; these account for ~10% of single nucleotide polymorphisms (SNPs) in genome-wide association studies (GWAS) comparing healthy to diseased cohorts. The remaining 90% of SNPs consist of non-coding mutations that reveal themselves via cis- and trans-regulatory effects, typically through changing levels of gene expression or creating splice variants of the mRNA molecule. The result of these non-coding mutations are both cell type-specific and non-local in nature—the regulatory effect of a SNP might affect a gene millions of base pairs away. For example, the SNP rs1421085 within the FTO allele is associated with satiaty and obesity—and subsequently FTO became the subject of much research—however, the true regulatory effect appears to manifest by increasing expression of IRX3 and IRX5 located 516 kb and 1,164 kb away, respectively.2

For the problem of predicting the effect of distal variants, large language models (LLMs) offer a streamlined approach to interpreting genetic sequences wherein motifs of nucleotides can be interpreted as sentences within a corpus of prose. For example, in quantitative trait loci (QTL) studies, utilizing paired genetics with other omic modalities from patient tissue, statistical relationships between genetic variants and omic measurements are inferred. Most commonly, transcriptomics are used to relate SNPs to changes in gene expression (i.e., an expression QTL or eQTL), or splicing (i.e., a splicing QTL or sQTL). When the same SNP is highlighted in both a GWAS and QTL study, then one has reason to believe that (some of) the mechanism by which the disease occurs relates to the QTL effect.

Instead of fitting simple statistical models along the genome, one can train a “transformer” neural network to predict levels of transcription from genome sequence.3 This approach appears to be promising, and predicted statistical associations have been validated using orthogonal omic measurements and by genetic manipulation.

This is clearly only the beginning of LLMs being applied to nucleotide sequences, and many of the learnings from the “ChatGPT revolution” have not yet made their way into the biological domain. Conversely, models are not yet faithful to biological reality, for example, human genomes are diploid and gene copies may interact with each other in unexpected ways, via say X-inactivation.4

REVERSIBILITY AND FUNCTIONAL GENOMICS

Whereas genetics can be used to find causal statistical associations between variants, genes and phenotype—these associations do not necessarily imply reversibility. To this end, a faithful cell model is required for interventional screening, typically in conjunction with CRISPR or small molecule modalities.

Historically, univariate outputs were used for simple interpretation of complex cellular phenotypes, for example, cell viability or protein expression. However, such readouts often hide the true nature of cell states—for example, there are many pathways that lead to cell death, including apoptosis, autophagic cell death, and necrosis. Moreover, a heterogeneous collection of cells will have a distribution of cell states present, and such screens only measure the average state. Single-cell technology offers an enticing route forward to understand how a perturbation can shift the distribution of cell states and uncover complex relationships in how genes are expressed. A modern advance has been the widespread utilization of pooled CRISPR screens, wherein thousands of gene perturbations can be studied in parallel and batch effects minimized.5

Although immortalized cell lines and induced pluripotent stem cell-derived models can be cultured to millions of cells, it is not experimentally tractable to screen large numbers of gene perturbations in primary tissue. Furthermore, to truly elucidate gene–gene interactions, we may wish to perform combinatorial screens that cannot be evaluated exhaustively; for example, pairwise manipulation of the ~20,000 protein coding genes leads to ~200,000,000 unique combinations. When we consider the range of cell types, culture conditions, and emerging perturbation technologies available, including knockouts, activation, interference, base editing, and prime editing—we must have principled quantitative approaches to reduce the necessary amount of experimentation whilst building confidence in how a mechanism works.

Two key problems exist: how to model the relationships between biological entities, and how to economically acquire new data from your experimental system of choice. Graph machine learning offers a natural framework to approach the first problem, wherein an (incomplete) regulatory network structure, including protein–protein and transcription factor interactions, can be incorporated as priors within a modeling framework, further discussed in Gaudelet et al.6 With regard to the second problem, we can use sequential model optimization (also called active learning or AL) for highly efficient experimental design. For example, Bertin et al.7 used AL to predict synergism within a large drug combination space. Through five sequential rounds of experimentation, selected drug combinations became enriched for synergism and a ~5–10× gain in efficiency was estimated when compared with less sophisticated approaches.

Through building predictive models of cellular regulation, we can build confidence that targets of interest are able to reverse a diseased cell state into a healthy state without inducing aberrant gene programs, for example, immunological stress responses. Longer term, there are many routes to integrate such models across a series of primary, secondary, and tertiary assays, for example, using transfer learning to move insights between experimental systems of increasing complexity (e.g., monoculture to multicellular systems).

DRUGGABILITY AND STRUCTURAL BIOLOGY

Druggability is a somewhat nebulous concept, broadly referring to the ability to modulate the activity of a biomolecule to achieve therapeutic benefit. The study of protein structures and conformations has been the focus when designing small molecules or larger therapeutic entities, such as antibodies. Historically, X-ray crystallography or cryogenic electron microscopy data pertaining to target protein structures were desired to help define an initial skeletal structure—an optimization problem over large chemical spaces to maximize efficacious drug-target interaction, while simultaneously avoiding off-target binding. Less practical, structure-free approaches were otherwise employed, for example, via chemoproteomics.

ML-based approaches, such as AlphaFold2,8 have now successfully leveraged the Protein Data Bank to accurately predict protein structures from amino acid sequences (“protein folding”). Naturally, this is only the beginning of structure prediction and we can also consider disordered proteins and larger protein complexes, for example, the nuclear pore complex.9 Generative ML techniques, such as diffusion models and continuous normalizing flows, are also being used for different tasks in drug design, including: docking and de novomolecular design; molecular property prediction; and the linking of molecular fragments; as well as predicting synthesis reactions. The future of ML in structure-based drug design will likely therefore be a tradeoff between the speed of innovation but also the cost associated when compared with teams of medical chemists, who still have a certain level of tacit knowledge pertaining to pharmacokinetic/pharmacodynamic (PK/PD) modeling.

Beyond structural biology, predicting on target toxicity is still a key challenge. Key resources like Tabula Sapiens will aid in this task,10 wherein most of the cell types in the human body are characterized by single-cell RNA sequencing—therefore suggestive of where on target toxicity could occur. Ultimately, however, the future lies in the integration of structure-based drug design, modality selection, and PK/PD modeling to accomplish safe therapeutic benefit.

CLOSING REMARKS

Although the potential impact of ML on the drug discovery process can be exaggerated at times, emerging evidence suggests reduced time and cost within biomolecular design. This will mean little if the subsequent drug candidate fails in later clinical stages because the initial insight used to select the target has not improved from generations past. Thus, important to any transformation is deploying multi-omic approaches with ML to gain better insights into causative biology. As we assess impact, a fundamental issue is that clinical stage failure is a lagging metric; we need nearer term proof of impact, for example, impact in translationally validated assays. Regardless, we are certain that drug discovery will still need smart scientists, translational thinkers, increased computational skills, and, of high importance, great experimental laboratory capabilities. At its core, increased levels of integration and interdisciplinarity will drive clinical success—as well as astute judgment and some luck!

FUNDING

No funding was received for this work.

CONFLICT OF INTEREST

All authors hold shares/options in Relation Therapeutics from employment/advisory roles. In addition, M.B. is an advisor and shareholder in Dreamfold and is a Chief Scientist in residence and holds shares/options in VantAI; and D.R. is a non-executive director and holds shares in Sosei Heptares.

‹ The Future of Machine Learning Within Target Identification: Causality, Reversibility, and Druggability

Elixion identifies synergistic drug combinations in vitro through sequential model optimization ›