I am trying to obtain the overall size of the exome, i.e. every single position that has coverage. As a control I need to find the theoretical size from the reference genome. I have a few questions:
- Is there an official size for the exome from the GRCh38 genome?
- if not, which is there efficient way to calculate it? A script to merge/coalesce all the exon from the reference genome?
Thanks,
Salvo
It should be possible to convert the gene annotation into a GRange object (like here). Once this is done, computing exome size is easy if you know R (see here for instance).
Hi Salvo,
You'll get all covered bases of your alignment, whether these fall into exons or not. I think the best way to get your statistics is to generate/download a exon bed file (Biomart would be the easiest way). With this you can compute directly the sum of all features, and intersect it with your coverage file.
You have to take care of alternative splicing within an exon, alternative polyadenylation site usage, etc.. Also you should think of including/excluding the alternative scaffold/patches (e.g. GL000194.1 , KI270726.1 ,...).
Cheers, Michael
[Update] This comment was written as a response to a given awk-script. Which is removed from the original post.