(T-051) Efficient Generation of Plausible Virtual Populations Using Machine Learning Surrogate Classification Models

Tuesday, October 21, 2025

7:00 AM - 1:45 PM MDT

Location: Colorado A

Dennis Reddyhoff – Certara; Andrew Matteson – Certara; Andrzej Kierzek – Certara

Author(s)

DR

Dennis Reddyhoff, PhD

Senior Scientist
Certara, United Kingdom

Disclosure(s):

Dennis Reddyhoff: No financial relationships to disclose

Objectives: We define a virtual patient as a set of input parameters, satisfying some biologically informed criteria, which can be simulated using an ODE model to predict clinical outputs. If clinical data is available, a plausible patient is one whose model outputs fall within clinically expected ranges. Simulation of complex QSP models can be computationally expensive and rather than grid searching, plausible patients can be found through optimization or machine learning (ML) methods [1][2]. Simulated plausible populations can then be resampled to fit observed populations [1].

We aim to efficiently generate plausible patients for a QSP model of the Cancer Immunity Cycle (CIC) [3], by sampling parameter sets predicted to fall within clinical ranges. A surrogate ML classification model is trained on a small initial subset of inputs and outputs to classify virtual patients as plausible/not plausible. We aim to predict whether a generated virtual patient is plausible, simulating batches of predicted plausible patients and retraining the model at each step until a suitable number of patients are generated.

These plausible patients are then resampled to reproduce observed clinical distributions of tumour growth (%SLD change) and tumour microenvironment (TME) cell composition (Percentages of CD8 (%CD8) and Dendritic Cells (%DC) in TME) [4][5].

Methods: An initial input parameter set (N = 1000), generated by Sobol sampling between lower and upper bounds defined for each parameter, is used to simulate predicted outputs using an ODE-based model of the CIC, described in [3]. Input parameter sets are labelled positively if all outputs are within expected clinical ranges. This labelled dataset is used to train an XGBoost [6] classification model and predict labels for the next batch of N input parameters.

We generate 10,000 plausible patients and resample using inclusion probability as in [1] to fit observed distributions of clinical data. An additional baseline population of 10,000 patients, generated using Sobol sampling, is simulated to compare against the surrogate-based method.

Results: The fitted virtual population is compared against observed waterfall plots for %SLD change from baseline using a two-sample Kolmogorov-Smirnov (KS) test and against observed RECIST categories using the Chi2 test. We also compare against observed data for cell distributions in the TME. The KS p-value of the virtual population for %SLD change is 1.0, and Chi2 p-value is 0.88. The KS p-value for %CD8 in TME is 0.79, KS p-value for %DC in TME is 0.23.

The baseline sample produced ~2200 plausible patients from an initial sample of 10,000 (22% of sampled parameter sets), while the surrogate-based approach produced ~9000 plausible patients (90% of sampled parameter sets).

The final XGBoost classifier has a True Positive Rate of ~98% on a held-out validation set. That is, only 2% of positively labelled patients fall outside clinical bounds vs. 78% for the Sobol sampled baseline, a 39-fold improvement.

Conclusions: A machine learning-based surrogate model can be used to efficiently generate virtual populations, fitted to observed clinical distributions. This method will be included in the Certara IQ virtual population inference workflow, available in the IQ Analyze module.

Citations: [1] Allen, R J et al. “Efficient Generation and Selection of Virtual Populations in Quantitative Systems Pharmacology Models.” CPT: pharmacometrics & systems pharmacology vol. 5,3 (2016): 140-6. doi:10.1002/psp4.12063
[2] Myers RC, Augustin F, Huard J, Friedrich CM. Using machine learning surrogate modeling for faster QSP VP cohort generation. CPT Pharmacometrics Syst Pharmacol. 2023; 12: 1047-1059. doi:10.1002/psp4.12999
[3] Lazarou, Georgia et al. “Integration of Omics Data Sources to Inform Mechanistic Modeling of Immune-Oncology Therapies: A Tutorial for Clinical Pharmacologists.” Clinical pharmacology and therapeutics vol. 107,4 (2020): 858-870. doi:10.1002/cpt.1786
[4] Chatterjee, M et al. “Systematic evaluation of pembrolizumab dosing in patients with advanced non-small-cell lung cancer.” Annals of oncology: official journal of the European Society for Medical Oncology vol. 27,7 (2016): 1291-8. doi:10.1093/annonc/mdw174
[5] Thorsson, Vésteinn et al. “The Immune Landscape of Cancer.” Immunity vol. 48,4 (2018): 812-830.e14. doi:10.1016/j.immuni.2018.03.023
[6] Chen, T., & Guestrin, C. "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016): 785–794. doi:10.1145/2939672.2939785

Keywords: machine learning, qsp, virtual population