I'm looking for some broad rules of thumb for how many variations to expect for WGS versus WES.
The idea being that if I am given a VCF of unknown origin and coverage, what does the number of rows in the VCF help tell me about whether it was derived from an WES or WGS sequence.
Or are there any other signals to look for to be able to easily determine this?
Well, you could check whether most of the variants are distributed in
exonic regions
or all over the genome irrespective of exons. That could already give you an idea. DownloadGTF
of your organism and extract exon regions andintersect
your variants with extracted exons.The number pretty much depends on the depth but one can expect more variants from WGS than WES.
I'm looking for something that doesn't require actually looking at the variants. Of the few WGS VCFs I have see, they tend to have 4-5M rows, so I would have though the WES which read ~1% of the genome would have on average around 40-50K variations. Can I not just count rows and draw a conclusion based on whether there are closer to 50K rows or 5M rows. Is there a flaw in this thinking?
WES may have sufficient depth outside the baited areas to call variants. So, if you just restrict to variants within baited areas, the density should be similar. But generally, if I had a bunch of WES and WGS VCF files, I'd expect the WES ones to be much smaller. Maybe on the order of 1/100th the size. I'd be surprised to see one even 1/10th the size of a WGS VCF.