Cellular heterogeneity in biological tissues reflects progression of disease state and is therefore useful for improved diagnostic and prognosis. Cellular composition of tissues is however difficult to assess from bulk molecular profiles, with all cells present in the tissue contributing to the recorded signals. Cell deconvolution is a common approach to unravel the heterogeneous molecular profiles observed in bulk tissues, by inferring the underlying relative abundance of individual cell types using one or more omics data, such as RNA-seq gene expressions or DNA methylation rates. So far, cellular deconvolution assumes that bulk omic profiles result from weighted sums of so-called signature cell-specific omic profiles, weights being the unknown proportions of those cell types. Consistently, most statistical methods used for cellular deconvolution are based on extensions of the Ordinary Least Squares (OLS) optimization algorithm, under nonnegativity and sum-to-one constraints on those unknown mixing coefficients. Using OLS implicitly assumes independence, homoscedasticity and normality of the residual errors, conditions under which OLS optimization guarantees optimal estimation. In cellular deconvolution applied to bulk molecular profile, all three assumptions are highly questionable. Indeed, strong violations of those assumptions may be due to the instrinsic nature of omics data, RNA-seq data being overdispersed read counts and DNA methylation rates being percentages for example, or to the dependence structure induced by the gene regulatory network, some key genes being more influent on deconvolution accuracy than others. The goal of this work is to provide a well defined statistical framework that respects the inherent characteristics of biological data for deconvolution, using multi-omic data.
Multi-omic data integration for cellular deconvolution aims at leveraging complementary viewpoints on cellular heterogeneity. The general statistical framework we propose is especially designed for integration of two frequently used omic data types for cell deconvolution mentioned previously, RNA-seq gene expression data, for which a constrained negative binomial regression model is assumed, and DNA methylation rates, using a constrained beta regression model. Many simultaneous optimization strategies are considered, either based on constrained and weighted maximum likelihood, weights being introduced to strengthen the influence of some genes based on their specific combination of signature expressions and DNA-methyation rates, or on gene selection. An extensive comparative study of cell deconvolution performance with leading single or multi-omic methods is conducted on benchmark data and using nine cell types commonly found in PDAC (pancreatic cancer). Results confirm both the gain in a multi-omic integration approach and in the use of ad-hoc probability distributions for each -omic data type. Additional improvements based on dependence models between approximation errors by the two -omic data types for each gene are finally discussed.
- Poster