I have identified a number of structural variant breakpoints across multiple tumour normal comparisons.
I want to ask the question "How enriched for a particular genomic feature is my set of breakpoints?"
For example, across all samples we find a total of 200 breakpoints, and 15 of these are found in the same class of genomic feature (e.g. exon).
If the genome is 137547960 bps long, and the total fraction of the genome that is exonic is 21.8% (30095000/137547960), then I would expect to find 43.8/200 breakpoints in exons ( 200*(total_exon_length/total_genome) ) across all my samples.That we find only 15, suggests that this feature class is underrepresented in our breakpoint set.
Is this the right way of going about this sort of test? What is an appropriate statistic to use here? Chi-Squared or Fisher's Exact?
You could try a binomial with a prob of success of a single trial =0.218 ; number of trials = 200 and number of success = 15.
In R :
OK thanks for the advice. How should I interpret the output from R?
If I want to test for enrichment of a particular class of feature, shouldn't I use
alternative = "greater"
?For an enrichment you should indeed used "greater"
Yes enrichment is "greater". "Less" is for the inverse H0 thus under representation