I am working on identifying the SNPs in the genome of an insect which is sequenced in three samples. Following the GATK best practice pipeline i have completed the following task.
- Quality filtered reads.
- Read Alignment for each sample eg. S1, S2, S3.
- Sort and mark duplicates in each sample S1, S2, S3.
- Realign target and Indel realinger on each sample S1, S2, S3. Now i am stuck with next steps. As it require the base recalibration (BSQR) step. which require the known database of snps. however i dont have any known snp database so i am following the alternative steps mentioned the GATK for non-model organism. That is
- I call the raw SNPs in gvcf model from each sample and then make joint calling for a final combined vcf file.
- Next, I applied the hard filters on this file and extract good quality SNPs and indels as a reference database of BSQR step.
Now the question is i have three realigned files S1_realigned.bam, S2_realigned.bam, S3_realigned.bam from step 4 and a reference database of SNP and Indels (if i am right) from the step 6. So how i would proceed further. Did i use against each sample separately or make a combine re-calibration table by providing all three in the same command.
Thanks Dr. Deepak
Hello deepkumar1983!
It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=82707
This is typically not recommended as it runs the risk of annoying people in both communities.
If you are using GATK4 you don't need to do Indel Realignment since HaplotypeCaller performs local realignment around SNPs. Check out this post from GATK for more details.