← Back to writing

The Predictive Validity Problem

Why Better Biology Beats Bigger Models

AI-bio's data problem is not mainly data volume; it is biological fidelity. A diagnosis of the upstream bottleneck on AI drug discovery, and where I think investors are mispricing risk. Part 1 of Beyond the Model.


AI in drug discovery is bottlenecked twice, in two distinct layers, and capital still appears to be pricing both too lightly relative to the model layer. This piece is about the first. The training data that AI models in biology learn from is generated by experimental systems whose predictive validity, that is, their ability to forecast how a drug, target, or biomarker will actually behave in humans, is poor and largely unmeasured. AI does not fix this. AI inherits it.

Clean data, messy biology

The deep learning revolution in biology has a simple story attached to it. AlphaFold proved that a sufficiently capable model trained on a sufficiently large number of structures could predict protein folding with accuracy approaching experimental methods. The Protein Data Bank had 200,000 curated structures by 2020, and the result was transformative.

Every subsequent AI-bio pitch invokes some version of this story. Bigger models, more data, transformer architectures applied to whatever biological problem the company is trying to solve. Generative chemistry, target identification, protein-protein interaction prediction, ADMET forecasting, you name it.

AlphaFold worked because the input data was clean, structured, and high-fidelity. Crystal structures are crystal structures. X-ray diffraction tells you where the atoms are, the noise is bounded, the ground truth is well-defined, and the relationship between input and output is causal. Train on enough of those, and the model learns physics.

But most of biology is not like this. The training data for AI-driven target discovery comes from cell line experiments, where the cell lines are decades-old immortalised populations that have drifted from the patient tissues they were derived from. The training data for AI-driven efficacy prediction comes from mouse models bred for genetic homogeneity, housed in pathogen-free facilities, and fed standardised diets. The training data for AI-driven toxicity prediction comes from acute high-dose animal studies whose relationship to chronic low-dose human exposure is largely uncertain.

This asymmetry shows up in the evidence base. A recent systematic review of 100 peer-reviewed studies found that reported generative AI efficiency gains are concentrated in early discovery, while preclinical applications remain limited, prospective validation is sparse, and there is not yet robust evidence that computational acceleration reduces attrition or improves regulatory progress [1].

The point is not that generative AI has no value in discovery. It clearly does. The point is that discovery acceleration is not the same as development productivity. More data does not solve a measurement problem. It scales it. The bottleneck is the experimental system itself.

Figure 1. Generative AI in pharma: where value is currently concentrated versus where biological fidelity becomes limiting. Based on Riemer & Freund’s 2026 systematic review of 100 peer-reviewed studies on generative AI in pharmaceutical development.

What predictive validity actually means

The term comes out of pharmacology and gets thrown around loosely. It is worth being precise about what it should mean for AI-bio.

Predictive validity is the question of whether a preclinical experimental system actually forecasts how a drug, target, or biomarker will behave in humans. It is a measurable property. You can correlate model output to clinical outcomes across drugs that have already been tested in humans, you can publish R-squared values, and you can benchmark organoid responses against the actual patients those organoids were derived from. The fact that we mostly do not do this is the point, not the counter-argument.

Industry estimates put the failure rate from Phase 1 to drug approval at around 90% [2], with most attrition concentrated in Phase 2 and driven by efficacy failures [3]. The story often goes something like this: the model said the molecule worked, the animal data was compelling, the biomarker moved, but then humans did not respond. By definition this is a predictive validity failure. The preclinical system gave a green light that turned out to be wrong, and we discovered it after spending years and many millions of dollars, without a single new medicine reaching patients.

The conventional response is "drug discovery is hard, biology is complex, attrition is the price of doing business." That is true and is also a way of avoiding the question. The industry has a measurement failure. We use prediction tools whose calibration we do not measure, and we are surprised when the predictions miss.

A model trained on cell line data predicts cell line behaviour, not patient behaviour. The leap from one to the other is the entire translational problem.

The emerging stack

The technologies that generate higher-fidelity preclinical data are not new or exotic, but they are underused, under-capitalised relative to model-building, and rarely benchmarked against clinical outcomes with any rigour. These technologies can be grouped into four buckets.

  1. Patient-derived tissue systems. Organoids, patient-derived xenografts, and ex vivo tumour slice cultures preserve patient biology directly.
  2. Engineered human systems. iPSC-derived disease models and microphysiological systems (organ-on-chip) model human physiology in ways animal models cannot. The FDA Modernization Act 2.0 in 2022 opened regulatory pathways that do not require animal testing, the first material policy shift in this space in decades [4].
  3. Humanised in vivo systems. Humanised mouse models reconstitute human immune systems in immunodeficient mice, which matters in immuno-oncology and infectious disease where the immune response is the mechanism of action.
  4. Closing the loop. Multi-omics on existing patient samples and real-world evidence linkage to clinical outcomes turn one-shot experiments into a feedback system. If you do not measure whether your model predicts the clinic, you do not know whether your model predicts the clinic.

The right answer for any specific therapeutic question is usually a combination of these. The unifying point is that all of them generate higher-fidelity training data than the standard preclinical stack, and AI trained on higher-fidelity data is the part that matters.

The view from inside

I work at a precision oncology company that creates patient-derived microtissues to support preclinical development and translational research. The argument I am making applies to our work too, and I have seen it from both sides.

The unit economics of the data layer are brutal compared to model-building. A foundation model can train on existing public data and ship a benchmark improvement in months. A wet-lab system that generates higher-fidelity training data needs sample procurement, tissue handling protocols, growth optimisation, characterisation, and benchmarking before it produces a single useful data point. Timelines are years, not quarters, and capital requirements are much heavier. The defensibility story is harder to tell to a tech-fluent investor trained to look for software margins. That is why I believe the data layer is under-capitalised. It is not because the value is not there, it is because the value does not look like the value tech investors know how to underwrite.

The misallocation thesis

The market is recognising that AI can improve drug discovery, but it is still misreading where the scarce asset sits. The category leaders, in a field where clinical proof remains early and uneven, have largely converged on the same strategic direction: the asset is not the model in isolation, but the learning system around it.

Insilico provides a clear example: rentosertib has now produced Phase IIa human data [5] and the company has extended the programme into an inhaled formulation, while its business model has moved beyond software licensing into partnered and internal drug assets, including collaborations with Lilly and Sanofi [6]. NOETIK shows the same shift from the data side: its differentiation is not just a cancer foundation model, but a proprietary human tumour data architecture that aligns H&E, spatial protein, spatial transcriptomic and DNA data from curated clinical specimens, then trains self-supervised models on that substrate [7]. GSK’s 2026 licensing deal for NOETIK’s oncology foundation models is therefore notable, less as another AI partnership than as evidence that pharma is starting to pay for the biological data layer underneath the model [8].

That is the point the market still struggles to price. Capital and attention gravitate to the part of AI-bio that looks most like software: the model, the benchmark, the demo, the generated molecule. The harder layer to build is the one underneath it: the experimental and clinical infrastructure that determines whether the prediction is actually true. It is slower, more capital-intensive, and less elegant to pitch. But if predictive validity is the bottleneck, it is also where the marginal dollar should matter most.

What this means for the industry and for investors

Three implications.

First, the diligence question on every AI-bio deal should evolve. "What is your training data and how is its predictive validity established" should be the opening. The companies that have a good answer are the ones worth paying attention to.

Second, the under-capitalised opportunity is the data layer. Companies generating proprietary high-fidelity data, particularly those that close the loop to clinical outcomes, are systematically undervalued because the unit economics look unfamiliar to investors trained on software comparables. A platform that could credibly improve Phase 2 success rates by even a few percentage points would be a very valuable infrastructure asset in biopharma. The underwriting challenge is whether that platform can generate convincing clinical-concordance evidence, at sufficient scale, inside a venture timeframe. Further, whether it can capture the value of the learning loop and create a compelling commercial model, rather than giving it away through CRO-style service projects.

Third, model architectures will commoditise faster than data does. Open-source biomolecular foundation models (e.g. Boltz, OpenFold) are already narrowing the gap with proprietary systems in areas such as structure prediction and binding-affinity modelling. That does not make AI-bio platforms worthless; it changes what has to be proprietary. The durable advantage is less likely to be the model architecture alone, and more likely to be the data, workflow integration, experimental feedback loop, and clinical-concordance evidence wrapped around it.

What I would look for

In a company pitching at this layer of the stack:

  1. Proprietary high-fidelity biological data, not just an algorithmic moat.
  2. Evidence of clinical concordance, ideally retrospective benchmarking against drugs with known clinical outcomes.
  3. A workflow that scales without destroying margins.
  4. A commercial model that lets the company own the compounding data asset, rather than monetising only as a service provider.
  5. A credible path to pharma adoption.
  6. Clear ownership of the feedback loop between model output and clinical signal.

The series

This piece is the first of four. The second turns to the downstream half of the same problem. Even if you fix predictive validity, the clinical system cannot absorb more candidates without widening the throat of the funnel. The third argues that the layer at which AI-bio iteration happens (the clinic, not the bench) is the deciding strategic variable. The fourth applies all of it to Europe, where the right bet looks very different from a slower copy of Boston or Beijing.

Sources

  1. Generative artificial intelligence in drug discovery and development: a systematic review
  2. BIO, Informa Pharma Intelligence and QLS Advisors — Clinical Development Success Rates 2011–2020
  3. PharmaTimes — Phase II failures on the rise, CMR analysis finds
  4. FDA Modernization Act 2.0 / non-animal testing policy context
  5. Recent review on AI, translational validity, and drug discovery productivity
  6. Reuters — Eli Lilly extends partnership with Insilico Medicine
  7. NOETIK — Human data platform
  8. Business Wire — GSK licenses NOETIK's AI foundation models