I want to increase accuracy in the variants using multiple variants calling tools such as Varscan, GATK, Samtools(vcftools),BreakDancer. Is there any pipeline which provide output by combining the results of each individual tools?
I want to increase accuracy in the variants using multiple variants calling tools such as Varscan, GATK, Samtools(vcftools),BreakDancer. Is there any pipeline which provide output by combining the results of each individual tools?
From my experience in the clinical genetics scene in the UK, sampling reads at 'random' from your BAM file (picard DownsampleSam
)and then calling variants on each 'sub BAM' with samtools mpileup
piped into bcftools call
(and then obtaining a consensus listing of all variants) is enough to find all true-positives that can possibly be found. On many occasions, GATK and other tools will 'miss' variants, for whatever reasons. It is erroneous to believe that simply running multiple tools on the same sample, or repeating the same sample in the lab, is enough.
A loose benchmark: 1000 Genomes variants were identified by merging the calls from multiple variant callers. However, using the method above, it was very easy to find all variants in 1000 Genomes that had already been found (and there was even evidence that the consortium had missed variants that should have been reported).
Virtually all variant callers will only look at a certain number of reads for the purpose of variant calling, and ignore all other [reads]. Other callers may do this and / or also apply a posterior probability of a variant being present or not.
By 'splitting' the BAM file into multiple BAM files of randomly-selected reads, the odds are shifted in favour of detecting a variant in at least one of the sub BAMs. It is like 'shuffling' the deck of cards.
Hi Kevin- I'm also interested and puzzled by the idea of subsampling
the odds are shifted in favour of detecting a variant in at least one of the sub BAMs.
Sure, but this should come at the expense of increasing false positives, isn't it? Even more extreme, one would call a variant wherever there is a read mismatching with the reference.
Hey, that's a good point, dariober. However, we controlled for this by never making a variant call below a read-depth of 18 (in any sub-BAM). Again, we had data to show that 18 was the absolute bare minimum at which anyone should be calling a variant. If a region of interest fell below 18, we had to send that region for Sanger seq.
Usually, targeted panels achieve 100x depth of coverage, so, even sub-sampling reads to 25%, most bases will always be > 18 read-depth.
This is all for germline variants, of course.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hello atiqueulalam ,
each variant caller has its strength and weakness. If you create a a pipeline, where you said "Only a variant that is found by x variant callers, is a true variant", the price will be sensitivity. You will lost a bunch of true variants then.
fin swimmer