I just want to know if my understanding is correct or not.
So for multi-sample,to use GATK for SNP/indel calling, what I should do is:
1.Independently run BWA for alignment and mark duplicates;
2.Independently realign bam file, and do the recalibration.
Then I got, say, A.recal.bam, B.recal.bam, C.recal.bam....
3.Then for Unified,Genotyper step (SNP-calling), I can input all those A.recal.bam, B.recal.bam, C.recal.bam and call SNP together, so that eventually I can get one VCF file integrating SNP calling across all samples.
Am I correct?
Also, GATK recommend:
Finally, if you really want to get the absolute best results, whatever the computational cost, then we recommend doing multiple sample realignment so that novel indels in one sample help to realign reads in other samples
Seems it's best to merge all bam files and do realignment together so that indels in one sample can help realignment in other sample. But in practise, esp. when we have many many exome samples, this becomes unrealistic due to extremely high computational cost, right? thx
edit:
I think for a while; and I would say there's no problem to first get independent recal.bam
files.But next we can do in different way:
1.Merge all bam together and call SNP.(This is impractical when total sample number is very large, say 200) So let's forget about this.
2.Merge all bam in a trio together and call SNP.
3.Call SNP independently for each bam file; then merge vcf of members in each trio together into a big vcf for each trio.
I'm just curious, for option 2 and option 3; the final result for each trio will be different or the same?
thx
thx, Alex; but plz see my edit. If I run bam files in trio independently, will results be different?
@bioscientist: Potentially, yes, your results may be different if analyzed separately. This is from GATK v3 best practices for variant detection:
"The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference."
(http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v3#Multi-sample_SNP_and_indel_calling)
In other words, there are crappy calls that are artifacts of sequencing that will make it into your VCF. Individually, a crap call in one .bam file may not get tossed statistically by GATK, and will make it into your VCF. If you exome sequenced a related trio on the same run, it is likely they will have the same crappy calls generated from that run, and if you group those .bam files together, you increase GATK's chances of tossing a whole lot more of them. Good luck!
thanks! that makes sense!