Projects per year

## Abstract

Modern molecular genetic techniques are providing data for an ever-increasing number of loci, yet the amount of usable information for genomic prediction seems to be saturating. In this contribution we summarize (i) our approach to measuring the limited dimensionality of genomic information, (ii) utilization of the limited dimensionality for fast and scalable genomic prediction and (iii) future research avenues.

(i) We conceptualized the dimensionality of genomic information using Stam’s number of independent (effective) chromosome segments (Me), which is related to the Fisher’s theory of junctions. This metric has been “rediscovered” with the advent of genomic prediction, particularly as one of the key parameters for predicting the accuracy of genomic prediction. Stam derived an expectation of Me for a random mating population of constant size and showed that it is equal to 4NeL, where Ne is effective population size and L is genome length in Morgans. However, we generally do not know Ne for a given population or dataset at hand. We therefore proposed to estimate the dimensionality of genomic information for a given dataset as the number of non-negligible singular values of a genotype matrix, or equivalently, as the number of non-negligible eigenvalues of a genomic relationship matrix (GRM). Using this approach we estimated that Me is about 10,000-15,000 in cattle and about 4,000 in pigs and broilers despite the fact that the underlying data had about 40,000 genome-wide loci for 16,000 to 81,000 genotyped animals. In these populations, the number of eigenvalues that explained 90, 95, and 98% of variation in GRM respectively correspond to NeL, 2NeL, and 4NeL.

(ii) We exploited the limited dimensionality of genomic information to speed-up calculations of genomic prediction. Specifically, we have derived specialized inverse of GRM by dividing individuals into core and non-core, where non-core individuals are modeled as a function of core individuals. Limited dimensionality was exploited by minimizing the number of core individuals, which increases the sparsity of inverse of GRM and substantially speeds-up calculations, in particular with large datasets (the core size can be constant). In a range of animal breeding populations accuracy of genomic predictions peak when we use the dimensionality corresponding to 98 to 99% of variation in GRM or around 4NeL. Therefore, the number of core individuals approximately corresponded to Me. This indicates that 1 to 2% of variation in GRM is due to the noise or that our datasets are too small to accurately estimate the effect of smallest eigenvectors. Further, we observe that the accuracies are only slightly reduced at half the “optimum” dimensionality.

(iii) The results show that genomic information is limited, which provides us with the new viewpoint on the mechanisms behind the genomic prediction. Some of the open questions for future research are the relationship between the limited dimensionality and underlying genetic processes that generate linkage disequilibrium, choice of core individuals, the impact of limited dimensionality on genomic prediction and genome-wide association studies in homogeneous and structured populations, optimal number of loci for genomic studies and theoretical formulas for genomic prediction accuracy.

(i) We conceptualized the dimensionality of genomic information using Stam’s number of independent (effective) chromosome segments (Me), which is related to the Fisher’s theory of junctions. This metric has been “rediscovered” with the advent of genomic prediction, particularly as one of the key parameters for predicting the accuracy of genomic prediction. Stam derived an expectation of Me for a random mating population of constant size and showed that it is equal to 4NeL, where Ne is effective population size and L is genome length in Morgans. However, we generally do not know Ne for a given population or dataset at hand. We therefore proposed to estimate the dimensionality of genomic information for a given dataset as the number of non-negligible singular values of a genotype matrix, or equivalently, as the number of non-negligible eigenvalues of a genomic relationship matrix (GRM). Using this approach we estimated that Me is about 10,000-15,000 in cattle and about 4,000 in pigs and broilers despite the fact that the underlying data had about 40,000 genome-wide loci for 16,000 to 81,000 genotyped animals. In these populations, the number of eigenvalues that explained 90, 95, and 98% of variation in GRM respectively correspond to NeL, 2NeL, and 4NeL.

(ii) We exploited the limited dimensionality of genomic information to speed-up calculations of genomic prediction. Specifically, we have derived specialized inverse of GRM by dividing individuals into core and non-core, where non-core individuals are modeled as a function of core individuals. Limited dimensionality was exploited by minimizing the number of core individuals, which increases the sparsity of inverse of GRM and substantially speeds-up calculations, in particular with large datasets (the core size can be constant). In a range of animal breeding populations accuracy of genomic predictions peak when we use the dimensionality corresponding to 98 to 99% of variation in GRM or around 4NeL. Therefore, the number of core individuals approximately corresponded to Me. This indicates that 1 to 2% of variation in GRM is due to the noise or that our datasets are too small to accurately estimate the effect of smallest eigenvectors. Further, we observe that the accuracies are only slightly reduced at half the “optimum” dimensionality.

(iii) The results show that genomic information is limited, which provides us with the new viewpoint on the mechanisms behind the genomic prediction. Some of the open questions for future research are the relationship between the limited dimensionality and underlying genetic processes that generate linkage disequilibrium, choice of core individuals, the impact of limited dimensionality on genomic prediction and genome-wide association studies in homogeneous and structured populations, optimal number of loci for genomic studies and theoretical formulas for genomic prediction accuracy.

Original language | English |
---|---|

Publication status | Published - 13 Nov 2019 |

Event | A Century of Genetics: Celebrating 100 years of Genetics in Edinburgh & the Genetics Society in the UK - Royal College of Physicians, Queen Street, Edinburgh, EH2 1JQ, Edinburgh, United Kingdom Duration: 13 Nov 2019 → 15 Nov 2019 http://www.genetics.org.uk/events/a-century-of-genetics/ |

### Conference

Conference | A Century of Genetics |
---|---|

Country | United Kingdom |

City | Edinburgh |

Period | 13/11/19 → 15/11/19 |

Internet address |

## Fingerprint

Dive into the research topics of 'Limited dimensionality of genomic information and implications for genomic prediction'. Together they form a unique fingerprint.## Projects

- 2 Active