May 6, 2026
Blog
Oncology drug development is built around a slow, expensive question: is this drug actually working? Overall survival, the gold-standard endpoint, can take years to mature, particularly in earlier-line settings, forcing sponsors to either spend heavily on compounds that may ultimately fail or delay access to effective therapies. Historically low clinical success rates, estimated at roughly 3–5% from Phase I to approval (Wong, Siah & Lo, 2019), compound this risk, while the global oncology market is projected to exceed USD 500 billion by the early 2030s. As a result, the cost of late go/no-go decisions continues to rise, pushing sponsors toward more integrated, data-driven approaches that can detect meaningful signals earlier, before survival endpoints fully mature.
In this Q&A, we speak with Mohamed Tarek, PhD, Senior Product Engineer at PumasAI, about a recent application of DeepPumas: a joint tumor growth dynamics (TGD) and overall survival (OS) model for non-small cell lung cancer (NSCLC) that lets sponsors classify ongoing trials as likely successful or unsuccessful using only early interim data, well before traditional OS endpoints would be readable. Built on the DeepNLME framework and developed in close collaboration with scientists at Bristol Myers Squibb (BMS), the model came together in two phases: Mohamed laid much of the groundwork in 2023 and presented the first phase at ACoP 2024; his colleague Lorenzo Contento led the second phase, which was recently accepted as an oral presentation at PAGE 2026 in Dubrovnik, Croatia.
Read on as Mohamed walks us through the process, the challenges, and the impact of working with DeepPumas in oncology drug development.
Question: What was the sponsor trying to learn?
Answer: The sponsor came to us with one big ask: they wanted a treatment-agnostic model. Something that could look at early data from a trial and tell them, with reasonable confidence, whether a new drug was actually outperforming the standard of care, without making them wait years for the survival data to mature.
The "treatment-agnostic" piece really mattered. They were not looking for a one-off model fit to a specific compound. They wanted a framework, calibrated on historical NSCLC data once, and then reusable. Point it at any new candidate drug, feed in whatever early data you have, and get a defensible go/no-go signal at the interim that could feed into the next decision in their development plan.
Needless to say, this is a hard problem! Tumor growth data, particularly the sum of longest diameters, or SLD, shows up much sooner than survival data, which is great. But the usual clinical endpoint, something like objective response rate, throws away most of what those trajectories actually contain: kinetics, depth of response, regrowth, a rich time course collapsed into a single category. So our job at PumasAI was to use the full shape of the trajectory and turn that into something a sponsor could actually trust to tell a working drug apart from the standard of care. The approach we took was joint TGD–OS modeling augmented with machine learning. Of course, TGD–OS is an established field, so the question was not whether to link tumor dynamics to survival, but how to extract features from the shape of the TGD profiles that traditional methods tend to miss, and use those features to predict survival. Traditional models do not capture unusual trajectory shapes well, and that is where a lot of the prognostic signal may actually live.
Question: What was your approach?
Answer: We used DeepPumas, our scientific machine learning platform, to build a joint model that ties tumor growth and overall survival together. The model is treatment-agnostic by design but tuned to the tumor type. At its core it is still a tumor growth model with survival as the outcome, but DeepPumas lets us layer machine learning into that, which is what we needed to actually capture how messy and nonlinear these trajectories can be in practice. A purely mechanistic model would have a hard time with that.
For the tumor growth side, we used what we call a universal differential equation, or UDE. The idea is straightforward: instead of writing down a closed-form ODE for how SLD evolves over time, we let a neural network learn the time derivative instead, but we anchor it with domain knowledge. We constrain the network to keep SLD positive, for example, because tumor diameter cannot go negative. That kind of guardrail matters a lot in practice.
Then, to make it a proper NLME model, we feed subject-specific random effects into that neural network as additional inputs. Intuitively, this takes the huge space of possible tumor trajectories and squeezes it down into a low-dimensional space, where each patient gets their own coordinates.
We then layer a second neural network on top that looks at a patient's covariates and predicts a distribution over their random effects. This captures covariate-specific between-subject variability, which is highly prevalent in oncology. Patients with identical covariates can have very different tumor trajectories, but the distribution of those trajectories is still shaped by the covariates, and that distribution is what the second network learns to predict. This pays off enormously when you only have two or three SLD scans for a patient, which is exactly the regime we care about for early decisions. Instead of being lost without enough data, the model leans on the covariates to make a reasonable guess.
For survival, we developed a modified log-logistic proportional hazards model with a covariate model that uses a third neural network. This third network takes the patient's covariates plus the SLD trajectory predicted by the TGD model, and derived quantities like how fast SLD is changing, to estimate the hazard for each patient and predict survival.
The treatment-agnostic piece is what makes this whole thing actually work in practice. We train on historical data once, and then when a new randomized trial comes along, we feed the early data in and ask the model to predict the population survival curve for each arm. Because both predictions come from the same model, any bias that is shared across arms cancels in the comparison. Not all bias is shared; some is arm-specific. But our evaluation suggests the residual is small enough that it does not flip the go/no-go call, most of the time. And since the trial is randomized, there is no covariate drift between arms either. That means we can compare the two predicted curves directly without having to worry about whether the model is perfectly calibrated in an absolute sense. We just need it to be calibrated enough that the relative comparison is meaningful. The analogy I like to use is that we don't need to know exactly when two horses will finish a race to know that horse A is faster than horse B and will probably reach the finish line faster. In the case of oncology, reaching the finish line faster is not at all desirable of course, but the principle is the same. We just need to know which arm is likely to do better, not how well either arm will do in an absolute sense. That lowers the bar for what we need from the model, which is important, because predicting survival from early tumor growth data is a genuinely hard problem.
Evaluating the model was another big part of the project, and we had to be creative there. We did not have access to any new trials to test on, so we partially relied on simulation while still using real data for the model evaluation. We took two real NSCLC trials that were not in our training set, one that we knew was a winner and one that we knew had failed, and generated a thousand synthetic versions of each by simulating enrollment patterns that matched what those real trials actually looked like. Critically, we did not simulate the tumor trajectories themselves; we kept those as they were in the real data. We just simulated when subjects would enroll and how long they would be followed up, which is something we can do with a lot of confidence. That let us test the model under realistic data-scarce scenarios that mimic what happens in practice when you are trying to make an early decision. In each scenario, a different set of subjects from the original trial would be enrolled and each followed for a different length of time. The enrollment simulation would terminate when a desired number of subjects had reached a target duration of follow-up, which is exactly what you would do in practice if you were using the model to make an interim decision.
So you are basically asking the model: "if you had only seen the first chunk of this trial, what would you have decided, go or no-go?" Then you compare that to the actual outcome of the trial, which you know because you are using real data. This is a powerful way to evaluate the model in a way that is as close to reality as possible without having to wait for new trials to mature.
Question: What did you discover?
Answer: Honestly, the results held up better than I expected. Before I get to the numbers, a quick note on the setup: we evaluated the model at two interim-analysis triggers, an earlier one fired when 5 patients had reached at least 90 days of follow-up, and a later one fired when 15 patients had reached at least 180 days. Those thresholds define when the interim analysis runs, not how many patients are in it. By the time the trigger fires, the trial population is much larger, and the model uses everyone enrolled by that point.
The simplest decision rule is to look at the difference in predicted median survival between the two arms and call it a "go" whenever that number comes out positive, and a "no-go" whenever it comes out negative. With that rule alone and without further threshold calibration, the model's performance was already pretty good. For the truly positive trial, the model made the right call in 79.2% of the 1,000 simulated interim analyses at the earlier trigger (5 patients with 90 days of follow-up), and 94.3% at the later one (15 patients with 180 days of follow-up). For the negative trial, false-positive rates topped out at 1.5% at the earlier trigger and dropped to 0.4% at the later one. Milestone survival told a similar story.
The obvious caveat is that this is just using two test trials. We are not claiming the model will perform identically on every future readout; establishing that requires many more held-out trials. What we can say is that under realistic perturbations of enrollment and follow-up, the model's call was stable across 1,000 simulated interim analyses for each trial, one positive and one negative. That is the regime sponsors actually face at decision time, and the error rates we observed make both kinds of call, go and no-go, defensible. Nobody wants to kill a working drug, and nobody wants to keep pouring resources into one that is not working. A tool that lets you make either call early, with this kind of error profile, could meaningfully shift the economics of oncology development.
Question: What made this project stand out?
Answer: For me, this project is a really good example of what hybrid modeling, mechanistic plus machine learning, can do for oncology drug development. Oncology data is almost never as clean or as abundant as you would like; that is just the reality of the field. But that is exactly the regime DeepPumas is built for, and it gave us what we needed to turn the available data into something actionable. The model we developed here is more empirical than mechanistic, but it still incorporates more traditional pharmacometric modeling principles than your average machine learning model: the non-negativity constraint on SLD, the proportional hazards structure for survival, and the causal relationship between tumor growth and survival. Those structural constraints made the model more data-efficient and made it possible to train in a reasonable amount of time and get good performance.
Beyond the general SciML philosophy, two things in particular set this project apart. The first is the treatment-agnostic design: you train the model once on historical data, and then you can point it at a new compound without having to refit anything. So you do not need a huge dataset on your specific drug for the framework to be useful. The second is the way the NLME and SciML pieces fit together. You get the expressiveness of neural networks, which lets you capture how varied these tumor trajectories really are, but you do not lose the statistical rigor of mixed-effects modeling.
I also have to give real credit where it is due. My colleague Lorenzo Contento did a stellar job leading the second phase of this project and contributing to the first. Amit Roy, our head of consulting at PumasAI, played a key role in shaping the project and providing strategic guidance. And our collaborators and coauthors at Bristol Myers Squibb were partners on this from start to finish; their clinical and scientific input genuinely shaped the framework. I'd encourage anyone going to PAGE this year who is interested in the details of this work to catch Lorenzo's presentation and to tackle him with questions wherever you spot him at the conference!
Question: What’s next for this kind of approach?
Answer: We are seeing a lot of interest in pulling this kind of approach into other areas. Rare diseases, immuno-oncology, even real-world evidence analysis, anywhere you have sparse, multimodal data and you need to make a decision earlier than the data really wants to let you. The combination of traditional pharmacometrics and data-driven learning opens things up in concrete ways: earlier signal detection, smarter trial designs, better translation from preclinical to clinical. We are also looking at folding additional data modalities, like imaging-derived features, into the same DeepNLME structure, for example to capture the entire tumor microenvironment rather than just the SLD.
As the industry keeps moving toward more adaptive and more personalized therapies, I think tools like DeepPumas are going to be essential. You need to be able to capture patient-level nuance in a way that scientists, regulators, and clinicians can all actually trust, and that is what we are building toward.
To submit questions to Dr. Mohamed Tarek and the team, message us here.
To explore DeepPumas, request a call here.
In other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGIn other News...
BLOGLeading businesses, research labs and universities choose us as their healthcare intelligence partners.

© Pumas-AI Inc. 2026. All rights reserved.