The Predictive Validity Problem

Why Better Biology Beats Bigger Models

An increasing number of companies are using AI to predict how drugs, targets, and biomarkers will behave in humans. The new bottleneck becomes our ability to effectively validate these predictions.

AI in drug discovery is bottlenecked twice, in two distinct layers. Here, we focus on the first. The training data that bio-AI models learn from is generated by experimental systems whose predictive validity, that is, their ability to forecast how a drug, target, or biomarker will actually behave in humans, is poor and largely unmeasured. AI inherits this.

Clean data, messy biology

The deep learning revolution in biology has a simple story attached to it. AlphaFold proved that a sufficiently capable model trained on a sufficiently large number of structures could predict protein folding with accuracy approaching experimental methods. The Protein Data Bank had 200,000 curated structures by 2020, and the result was transformative.

Every subsequent AI-bio pitch invokes some version of this story. Bigger models, more data, transformer architectures applied to whatever biological problem the company is trying to solve. Generative chemistry, target identification, protein-protein interaction prediction, ADMET forecasting, etc.

AlphaFold worked because the input data was clean, structured, and high-fidelity. Crystal structures are crystal structures and x-ray diffraction tells you where the atoms are. The ground truth is well-defined, and the relationship between input and output is causal. If you train on enough of those, the model can learn the underlying physics.

But most of biology is not like this. The training data for AI-driven target discovery comes from cell line experiments, where the cell lines are decades-old immortalised populations that have drifted from the patient tissues they were derived from. The training data for AI-driven efficacy prediction comes from mouse models bred for genetic homogeneity, housed in pathogen-free facilities, and fed standardised diets. The training data for AI-driven toxicity prediction comes from acute high-dose animal studies whose relationship to chronic low-dose human exposure is largely uncertain.

This asymmetry shows up in the evidence base. A recent systematic review of 100 peer-reviewed studies found that reported generative AI efficiency gains are concentrated in early discovery. While preclinical applications remain limited, prospective validation is sparse, and there is not yet robust evidence that computational acceleration reduces attrition or improves regulatory progress [1].

This is not to say that generative AI has no value in discovery (it clearly does); the point is that discovery acceleration is not the same as development productivity. Simply have access to more data does not solve a measurement problem. The bottleneck is the experimental system itself.

Figure 1. Generative AI in pharma: where value is currently concentrated versus where biological fidelity becomes limiting. Based on Riemer & Freund’s 2026 systematic review of 100 peer-reviewed studies on generative AI in pharmaceutical development.

What predictive validity actually means

The term comes out of pharmacology and gets thrown around loosely. It is worth being precise about what it should mean for AI-bio.

Predictive validity is the question of whether a preclinical experimental system actually forecasts how a drug, target, or biomarker will behave in humans. It is a measurable property: you can correlate model output to clinical outcomes across drugs that have already been tested in humans, you can publish R-squared values, and you can benchmark organoid responses against the actual patients those organoids were derived from.

Industry estimates put the failure rate from Phase 1 to drug approval at around 90% [2], with most attrition concentrated in Phase 2 and driven by efficacy failures [3]. The story often goes something like this: the model said the molecule worked, the animal data was compelling, the biomarker moved, but then humans did not respond. By definition this is a predictive validity failure. The preclinical system gave a green light that turned out to be wrong, and we discovered it after spending years and many millions of dollars, without a single new medicine reaching patients.

The conventional response is "drug discovery is hard, biology is complex, attrition is the price of doing business." That is true and is also a way of avoiding the question. The industry has a measurement failure. We use prediction tools whose calibration we do not measure, and we are surprised when the predictions miss.

A model trained on cell line data predicts cell line behaviour, not patient behaviour. The leap from one to the other is the entire translational problem.

The emerging stack

The technologies that generate higher-fidelity preclinical data are not new or exotic, but they are underused, under-capitalised relative to model-building, and rarely benchmarked against clinical outcomes with any rigour. These technologies can be grouped into four buckets.

Patient-derived tissue systems: Organoids, patient-derived xenografts, and ex vivo tumour slice cultures preserve patient biology directly.
Engineered human systems: iPSC-derived disease models and microphysiological systems (organ-on-chip) model human physiology in ways animal models cannot. The FDA Modernization Act 2.0 in 2022 opened regulatory pathways that do not require animal testing, the first material policy shift in this space in decades [4].
Humanised in vivo systems: Humanised mouse models reconstitute human immune systems in immunodeficient mice, which matters in immuno-oncology and infectious disease where the immune response is the mechanism of action.
Closing the loop: Multi-omics on existing patient samples and real-world evidence linkage to clinical outcomes turn one-shot experiments into a feedback system. If you do not measure whether your model predicts the clinic, you do not know whether your model predicts the clinic.

The right answer for any specific therapeutic question is usually a combination of these. The unifying point is that all of them generate higher-fidelity training data than the standard preclinical stack, and AI trained on higher-fidelity data is the part that matters.

The view from inside

I work at a precision oncology company that creates patient-derived microtissues to support preclinical development and translational research. The argument I am making applies to our work too, and I have seen it from both sides.

The unit economics of the data layer are brutal compared to model-building. A foundation model can train on existing public data and ship a benchmark improvement in months. A wet-lab system that generates higher-fidelity training data needs sample procurement, tissue handling protocols, growth optimisation, characterisation, and benchmarking before it produces a single useful data point. Timelines are measured in years, and capital requirements are much heavier. That is why I believe the data layer is under-capitalised.

The misallocation thesis

The market is recognising that AI can improve drug discovery, but it is still misreading where the scarce asset sits. The category leaders, in a field where clinical proof remains early and uneven, have largely converged on the same strategic direction: the truly valuable asset is the learning system around the model, as opposed to the model itself.

One example of this is Insilico Medicines: rentosertib has now produced Phase IIa human data [5] and the company has extended the programme into an inhaled formulation, while its business model has moved beyond software licensing into partnered and internal drug assets, including collaborations with Lilly and Sanofi [6]. NOETIK shows the same shift from the data side: its differentiation is a proprietary human tumour data architecture that aligns H&E, spatial protein, spatial transcriptomic and DNA data from curated clinical specimens, then trains self-supervised models on that substrate (going well beyond simply being just another cancer foudnation model) [7]. GSK’s 2026 licensing deal for NOETIK’s oncology foundation models is therefore notable, less as another AI partnership than as evidence that pharma is starting to pay for the biological data layer underneath the model [8].

That is the point the market still struggles to price. Capital and attention gravitate to the part of AI-bio that looks most like software: the model, the benchmark, the demo, the generated molecule. The harder layer to build is the one underneath it: the experimental and clinical infrastructure that determines whether the prediction is actually true. It is slower, much more capital-intensive, and harder to pitch. But if predictive validity is the bottleneck, it is also precisely where more investment is needed to drive the industry forward.

Implications for the industry and for investors

First, the diligence question on every AI-bio deal should evolve. "What is your training data and how is its predictive validity established" should be the opening. The companies that have a good answer are the ones worth paying attention to.

Second, the under-capitalised opportunity is the data layer. Companies generating proprietary high-fidelity data, particularly those that close the loop to clinical outcomes, are likely undervalued. A platform that could credibly improve Phase 2 success rates by even a few percentage points would be a very valuable infrastructure asset in biopharma. The underwriting challenge is whether that platform can generate convincing clinical-concordance evidence, at sufficient scale, inside a venture timeframe. Further, whether it can capture the value of the learning loop and create a compelling commercial model, rather than giving it away through CRO-style service projects.

Third, model architectures will commoditise faster than data does. Open-source biomolecular foundation models (e.g. Boltz, OpenFold) are already narrowing the gap with proprietary systems in areas such as structure prediction and binding-affinity modelling. That does not make AI-bio platforms worthless, but it does change what has to be proprietary. The core advantage is less likely to be the model architecture alone, and more likely to be the data, workflow integration, experimental feedback loop, and clinical-concordance evidence wrapped around it.

What I would look for

In a company pitching at this layer of the stack:

Proprietary high-fidelity biological data, not just an algorithmic moat.
Evidence of clinical concordance, ideally retrospective benchmarking against drugs with known clinical outcomes.
A scalable workflow.
A commercial model that lets the company own the compounding data asset, rather than monetising only as a service provider.
A credible path to pharma adoption.
Clear ownership of the feedback loop between model output and clinical signal.

The series

This piece is the first of three. The second turns to the downstream half of the same problem. Even if you improve predictive validity, the clinical system cannot automatically absorb more candidates; the bottleneck moves from candidate generation to first-in-human translation. The final piece asks where those learning loops can actually be built. China has industrialised the cheap and fast clinical loop. Europe’s opportunity is to build around the categories where it has a structural advantage: higher-fidelity biology, linked human data, regulator-defined evidence pathways, and access to American capital.