Edinburgh Research Explorer

Supplementary files for study on modeling DSB with random forests

Dataset

Related Edinburgh Organisations

PublisherEdinburgh DataShare
Date made available19 Jun 2021

Abstract

Structural variants (SVs) are known to play important roles in a variety of cancers, but their origins and functional consequences are still poorly understood. The nonrandom distributions of these variants across tumour genomes are often assumed to reflect selective processes, but, as with single nucleotide variants, SV mutation rates often reflect the underlying chromatin and other features at a locus. Inferring which SVs may be under selection in tumourigenesis therefore remains challenging, though identifying such variants may lead to new diagnostic and therapeutic targets. Many SVs are thought to emerge via errors in the repair processes following DNA double strand breaks (DSBs) and a variety of studies have experimentally measured DSB frequencies across the genome in cell lines. Using these data we derive the first quantitative genome-wide models of DSB susceptibility, based upon underlying chromatin and sequence features. These models provide high predictive accuracy and novel insights into the mutational mechanisms generating DSBs. Models trained in one cell type can be successfully applied to others, but a substantial proportion of DSBs appear to reflect cell type specific processes. We also show that regions harboring unusually high tumour SV breakpoint frequencies occur within well modeled regions of the genome but often display DSB frequencies inconsistent with DSB model predictions. Using model predictions as a proxy for susceptibility to DSBs in tumours, many SV hotspots appear to be poorly explained by selectively neutral mutational bias alone. A substantial number of hotspots show unexpectedly high SV breakpoint frequencies given their predicted susceptibility to mutation, and are therefore credible targets of positive selection in tumours. These putatively positively selected hotspots are enriched for genes previously shown to be oncogenic. In contrast, several hundred regions across the genome show unexpectedly low levels of SVs, given their relatively high susceptibility to mutation. These novel ‘coldspot’ regions appear to be subject to purifying selection in tumours and are enriched for active promoters and enhancers. We conclude that models of DSB susceptibility offer a rigorous approach to the inference of SVs putatively subject to selection in tumours.

Data Citation

Ballinger, Tracy. (2018). Supplementary files for study on modeling DSB with random forests, 2017-2018 [dataset]. University of Edinburgh. Institute of Genetics and Molecular Medicine. http://dx.doi.org/10.7488/ds/2365

ID: 64402527