Hi,
I downloaded the NCBI Refseq curated file of Genes and Gene Predictions from the UCSC Table Browser for hg38 as I want to use the exon coordinates as a target file for calling variants on Exome Sequencing data.
I noticed however, that the exon coordinates cover approximately double the genomic region as the exon coordinates in hg19 did (~80 million bps vs ~40 million). Is it possible that the size of the exome is really double in hg38?
I do not want to call variants on all of these regions since ~30% of these exonic regions are not covered at all in my WES data and another ~10% is covered by <10x. I would definitely like to exclude these regions from the target file but I do not fully understand what these regions are/why they were included in the first place.
Any help would be greatly appreciated.
No, exons should not vary that much from freeze to freeze. But, more importantly, if this is really about exome coverage then use the bed file that came with your kit. If you are interested in coding variants then use CCDS.