55èmes Journées de Statistique de la SFdS

sciencesconf.org:jds2024:527375

Recent advances in sequencing, mass spectrometry and cytometry technologies have enabled researchers to collect multiple ‘omics data types from a single sample. These large datasets have led to a growing consensus that a holistic approach is needed to identify new candidate biomarkers and unveil mechanisms underlying disease aetiology, a key to precision medicine. While many reviews and benchmarks have been conducted on unsupervised approaches (Bersanelli et al. 2016), their supervised counterparts have received less attention in the literature and no gold standard has emerged yet (Krassowski et al. 2020).

In this work, we present a thorough comparison of a selection of five methods, representative of the main families of integrative approaches (matrix factorization, multiple kernel methods, ensemble learning and graph-based methods). As non-integrative control, random forest was performed on concatenated and separated data types. Methods were evaluated both on simulated and real-world datasets, the latter being carefully selected to cover different medical applications (infectious diseases, oncology and vaccine) and data modalities. A set of fifteen simulation scenarios were designed from the real-world datasets to explore a large and realistic parameter space (e.g. sample size, dimensionality, class imbalance, effect size).

Overall, integrative approaches showed comparable or higher performances on simulations and outperformed non-integrative methods on real-world data. More specifically, multiple kernel and matrix factorization demonstrated a strong ability to uncover modest effects in high dimensional settings. The strengths and limitations of those methods will be discussed into details as well as guidelines for future applications.

Type :	:	oral
Thématiques	:	Multi-omique
Mots-Clés	:	benchmark ; data integration ; multi ; omics data ; prediction models ; supervised analysis

Poster

Vie privée | Accessibilité