I am trying to calculate enrichment of Structural variant breakpoints and SNV locations in genomic features (exon, intron, 5'UTR...) in a non-human genome. To know whether a feature is over/underrepresented in a SV/SNV dataset, I first need to know the fraction of the genome that is feature X. For example the total length of all exons across the genome.
Is there an existing resource that has this sort of information (for annotated genomes like Drosophila)? If not, is there an R package that can calculate this?
Secondly, it's not clear to me how this is calculated in the first place. If a gene has 10 transcript variants, how do we calculate that gene's total exotic region? Total exon length / 10? Add up all the exons from the longest transcript variant? Take the longest possible transcript length (longest exon1 + longest exon2)? I'd love to know how this is usually calculated.
I've had a look at GenomicFeatures, but it's not clear to me how I would go about doing this using this package. Any suggestions/advice would be very welcome.
This looks like it will work for exons but not other features (5'UTR, start codon, intron) etc. Is this the case?
I've updated my answer with methods to get different features.
Thanks for taking the time to spell this out!
I have a follow up question: I would need to calculate the fraction of several features c("intergenic","promoter","exon","intron"). How would you do this, if the fraction has sum up to 1. With your proposed method you only end up with fractions, which sum up with values above 1, because multiple features are assigned to one region. Thanks! Any advice appreciated!
Hello everyone,
I think that if we don't mind strands, we should tell
reduce()
usingignore.strand=TRUE
:If we don't do that we'll have a result slightly higher because of those positions where we have exons regions overlapped on strands.
Also, if what we want to know is the percentage of nucleotides in a chromosome or the entire genome, then we should consider both strands when merging exons ranges and double the length of the chromosome or genome (because we want to know the percentage of nucleotides covered by exons):
The rest will be the same as commented.
You can check: https://github.com/tmontserrat/proportion_exon_regions