Recent years have seen great technological advances in the collection of high quality '-omics' data and significant progress in analysing and associating it with biological function. Here our goal is to leverage the abundance of available information into computational models that can accurately predict complex phenotypic traits in humans. To do this, we build on advances in non-parametric methods, where predictions for new individuals are made according to their similarity to other individuals from training cohorts, with the similarity (kernel) functions learned from data. Kernel methods are particularly suited for performing inferences with biological data, as they can be defined for a wide range of data types including vectors, strings and graphs. These kernel constructions correspond to different notions of similarity and often use complementary sources of information. Therefore, considering a combination of kernels is potentially more powerful than any single one on its own. Recent advances in the machine learning community have led to Multiple Kernel Learning (MKL) algorithms that can automatically discover the most appropriate kernel combination for a prediction task, typically by applying l1 or l2 type penalties on the weights of the kernel combination.
In this work we first apply a number of standard kernels from the machine learning literature, and string kernels based on either a haplotype or a genotype representation of the data. We then construct novel kernels that utilise the wide range of available genomic annotations, such as GWAS meta-analysis hits and eQTLs, or exploit domain knowledge by slicing the genome into new and old segregating variants. We present results from carefully designed cross-validation experiments that evaluate the performance of the MKL framework on predicting height, body mass index (BMI) and high density lipoproteins (HDL) in a Croatian population cohort. We examine the merits of using multiple versus single kernels and perform a comparative analysis with three parametric models commonly used in the literature, namely ridge regression, lasso and the elastic net.
|Title of host publication||European Mathematical Genetics Meeting, EMGM 2013, Leiden, The Netherlands|
|Publication status||Published - 25 Apr 2013|