(S-068) Machine learning for exploratory data analysis and model diagnosis in oncology

Sunday, October 19, 2025

7:00 AM - 5:00 PM MDT

Location: Colorado A

Lucas Pereira – Engineering – Pumas-AI; Mohamed Tarek – Engineering – Pumas-AI

Presenter(s)

Lorenzo Contento, PhD (he/him/his)

Product Engineer
Pumas-AI Inc., Italy

Author(s)

Lucas Pereira, MSc (he/him/his)

Product engineer
Pumas-AI, Brazil

Disclosure(s):

Lorenzo Contento, PhD: No financial relationships to disclose

Lucas Pereira, MSc: No relevant disclosure to display

Objectives: In this work, data from five clinical trials investigating Atezolizumab [1], an immune checkpoint inhibitor, will be analyzed using a novel machine learning based exploratory data analysis (EDA) and model diagnostic method. In particular, clustering of the subjects' responses will be performed to achieve the following objectives: more easily visualize the response patterns in a large dataset; automatically identify sub-populations in the response (e.g. responders and non-responders); identify potential outliers; stratify model diagnostics by the response's cluster to understand which sub-populations are fitted better than others [2], aiding the model development process.

Methods: We propose a method to perform clustering of longitudinal and heterogeneous data, e.g. from PK-PD studies. Clustering is a well known unsupervised machine learning task with a number of standard algorithms. To perform clustering, a pairwise dissimilarity matrix between the subjects' responses is first computed using the dynamic time warping (DTW) method. DTW can be used to quantify the dissimilarity between heterogeneous and multi-dimensional longitudinal data from 2 different subjects with a different number of data points per subject. After computing the pairwise dissimilarity matrix, clustering is performed using the k-medoids algorithm to identify k clusters in the response. To demonstrate the value of stratifying model diagnostics by the cluster, a subset of the data from five clinical trials investigating Atezolizumab was analyzed using a nonlinear mixed effects general Bertalanffy [3] model. The model was fitted to the data and the clusters were used to produce stratified diagnostics.

Results: The clustering procedure was able to automatically reveal a number of response types in the data, aiding with the visualization of a large dataset. The number of clusters k was tuned to produce visually consistent clusters. Clusters with the smallest number of subjects in them were flagged as potential outliers. An NLME Bertalanffy model was fitted to a subset of the data and the visual predictive check (VPC) was stratified by the response's cluster. The stratification highlighted a weakness of the model in fitting a particular response type that was masked by the full population VPC.

Conclusions: Unsupervised machine learning methods such as clustering can be used to enhance EDA and model diagnostics in any data analysis task, revealing non-obvious patterns and insights both in the data and model developed. In this study, a combination of DTW and the k-medoid algorithm was used to demonstrate the feasibility of clustering heterogeneous longitudinal data from clinical trials and to demonstrate its value in an NLME analysis workflow.

Citations: [1] N. Ghaffari Laleh, C. M. L. Loeffler, J. Grajek, K. Sta ˇnkov ́a, A. T. Pearson, H. S. Muti, C. Trautwein, H. Enderling, J. Poleszczuk, and J. N. Kather. Classical mathematical models for prediction of response to chemotherapy and immunotherapy. PLOS Computational Biology, 18(2):1–18, 2022.
[2] P. L. Bonate. Pharmacokinetic-Pharmacodynamic Modeling and Simulation. Springer, 2011.
[3] A. D. Blaom and S. Okon. New tools for comparing classical and neural ode models for tumor growth, 2025.

Keywords: Oncology, Machine learning, Dynamic time warping