(M-080) Harnessing Machine Learning and Real-World Data: Comparative Analysis of Cross-Sectional and Longitudinal Clustering in Patient Trajectories from Electronic Health Records
Monday, October 20, 2025
7:00 AM - 5:00 PM MDT
Location: Colorado A
Wes Anderson – Quantitative Medicine – Critical Path Institute; Nicholas Hensheid – Quantitative Medicine – Critical Path Institute; Smith Heavner – Critical Path Institute; Shu Chin Ma – Quantitative Medicine – Critical Path Institute; Jagdeep Podichetty – Quantitative Medicine – Critical Path Institute
Quantitative Medicine Scientist Critical Path Institute, United States
Disclosure(s):
Wes Anderson, PhD.: No financial relationships to disclose
Objectives: Disease presentation and progression can vary greatly in many disease areas, but this heterogeneity can be quantified by analyzing clinical features and disease trajectories. Specifically, clustering techniques help uncover these subgroups, but the choice between cross-sectional and longitudinal data representations can significantly impact the clusters identified. Cross-sectional clustering offers a snapshot in time, while longitudinal clustering captures disease progression. The objective of this study is to compare subgroup identification between approaches for COVID-19 patients, assessing how data structure affects clinical interpretation and decision-making.
Methods: Researchers obtained data from 92,457 adults hospitalized with acute COVID-19 across 8 U.S. health systems (Mar 2020–Mar 2024). Patient-level data, including vitals, demographics, treatments, oxygen modalities, comorbidities, and labs, were harmonized to the Observational Medical Outcomes Partnership (OMOP) Common Data Model for consistency across institutions. Two cohorts were created: a cross-sectional set using covariates from the first 48 hours of admission, and a longitudinal set requiring ≥3 days of data. Clinically implausible values were removed, covariates with >50% missingness were excluded, and complete case analysis was applied, resulting in 29,452 total patients. Factor Analysis of Mixed Data (FAMD)-based agglomerative hierarchical clustering was used for cross-sectional data [1], while longitudinal data were analyzed using tri-clustering [2]. Statistical tests evaluated cluster differences, and clusters were validated using an XGBoost classification model.
Results: Three distinct inpatient clusters were identified, each showing unique clinical profiles. Significant differences (p < 0.05) in mortality, hospital length of stay, and treatment patterns were observed within both cross-sectional and longitudinal cluster sets. Comparisons across clustering methods also revealed variation in outcomes. The XGBoost classification model achieved high performance in assigning patients to clusters (0.99 accuracy: cross-sectional data clusters and 0.8 accuracy: longitudinal data clusters), supporting the robustness of subgroup distinctions.
Conclusions: By leveraging RWD and advanced clustering techniques, the study provides insights into the differences in cross-sectional and longitudinal clustering results in RWD. Resulting classification models may support more personalized, effective treatments to improve outcomes. Future work will focus on incorporating additional clinical features to further refine and explore patient trajectory patterns.
Citations: [1] Anderson, W et al. doi: 10.3389/fpubh.2025.1544904. [2] Amaral, D et al. doi: https://doi.org/10.1038/s41467-024-49954-y.
Keywords: real-world data, clustering analysis, classification, critical care