Why isn't it a common practice to call variant for every position in human genome?
1
1
Entering edit mode
3.0 years ago

Currently when calling variants you have to call variants on all people simultaneously, which allows for rare variant positions being check and called in every person without such variant.

My question is, why wouldn't it make more sense to just determine variant in every genomic position (3.2Billion) and that's it? You could merge this dataset with others easily, moreover, no need to do joint variant calling etc. It is very frustrating to see such an inefficiency becoming default.

ngs • 1.9k views
ADD COMMENT
0
Entering edit mode

My question is, why wouldn't it make more sense to just determine variant in every genomic position

storage is expensive.

ADD REPLY
0
Entering edit mode

This is false in two ways:

  1. Join calling means you have to re-call all individuals every time new individual is available. That is expensive. Also, you have to save all BAM files for each individual, which are ~100GB/ind
  2. For 100 samples you will have to store 100*100=10TB of data. On the other hand, 100VCFs of 3.2 Billion sites will take up only 10GB (not TB).

This is maddness. Current approach is extremely data intensive and additionally requires re-do of whole calling every time new sample comes around.

ADD REPLY
0
Entering edit mode

I've got more than 1000 WGS bams here. I'm frequently asked to re-genotypes a subset of samples using the latest version of gatk or with a new genotyper. Storing each vcf, in a join-genotyping world, is a nonsense.

ADD REPLY
1
Entering edit mode

Is there a paper that addresses why this is a nonsense? Vcf takes way less memory compared to BAM, and supposedly contains all the information to do that joint calling kinda allows. Like, GVCF is exactly that same VCF.

ADD REPLY
0
Entering edit mode

Doing this would mean you would have to sequence everything to higher coverage in order to confirm the rare variants are true and not artefacts. Joint-calling means you can get away with lower coverage.

ADD REPLY
0
Entering edit mode

Can you iterate further on this? Only benefit is determining better probabilities. However, you can save probability information in vcf and later merge with that in mind

ADD REPLY
0
Entering edit mode

If other individuals in the cohort also have the same variant, it makes it less likely to be a sequencing artefact than if you are looking at a single individual.

ADD REPLY
1
Entering edit mode

unless the variant is associated with lower mapping or read quality, then the fact you see it recur in the same sequencing run makes it more likely it's an artefact

ADD REPLY
0
Entering edit mode

But you can't get the same information from already called VCFs, am i wrong?

ADD REPLY
3
Entering edit mode
3.0 years ago

You are correct, but you have conflated two issues - variant calling and warehousing. First, I don't think you will see joint genotyping being routinely done in 5 years - single sample calling with instrument-specific training sets are where things are headed. Secondly, the bigger groups are moving or have already moved to variant warehousing rather than VCFs - TileDB, Google Variant Transforms, GenomicsDB - and these want gVCFs as input.

ADD COMMENT

Login before adding your answer.

Traffic: 2667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6