Currently when calling variants you have to call variants on all people simultaneously, which allows for rare variant positions being check and called in every person without such variant.
My question is, why wouldn't it make more sense to just determine variant in every genomic position (3.2Billion) and that's it? You could merge this dataset with others easily, moreover, no need to do joint variant calling etc. It is very frustrating to see such an inefficiency becoming default.
storage is expensive.
This is false in two ways:
This is maddness. Current approach is extremely data intensive and additionally requires re-do of whole calling every time new sample comes around.
I've got more than 1000 WGS bams here. I'm frequently asked to re-genotypes a subset of samples using the latest version of gatk or with a new genotyper. Storing each vcf, in a join-genotyping world, is a nonsense.
Is there a paper that addresses why this is a nonsense? Vcf takes way less memory compared to BAM, and supposedly contains all the information to do that joint calling kinda allows. Like, GVCF is exactly that same VCF.
Doing this would mean you would have to sequence everything to higher coverage in order to confirm the rare variants are true and not artefacts. Joint-calling means you can get away with lower coverage.
Can you iterate further on this? Only benefit is determining better probabilities. However, you can save probability information in vcf and later merge with that in mind
If other individuals in the cohort also have the same variant, it makes it less likely to be a sequencing artefact than if you are looking at a single individual.
unless the variant is associated with lower mapping or read quality, then the fact you see it recur in the same sequencing run makes it more likely it's an artefact
But you can't get the same information from already called VCFs, am i wrong?