Question

GATK SNPs calling multiple sample without any known SNPs database

1

Entering edit mode

6.5 years ago

deepkumar1983 ▴ 70

I am working on identifying the SNPs in the genome of an insect which is sequenced in three samples. Following the GATK best practice pipeline i have completed the following task.

Quality filtered reads.
Read Alignment for each sample eg. S1, S2, S3.
Sort and mark duplicates in each sample S1, S2, S3.
Realign target and Indel realinger on each sample S1, S2, S3. Now i am stuck with next steps. As it require the base recalibration (BSQR) step. which require the known database of snps. however i dont have any known snp database so i am following the alternative steps mentioned the GATK for non-model organism. That is
I call the raw SNPs in gvcf model from each sample and then make joint calling for a final combined vcf file.
Next, I applied the hard filters on this file and extract good quality SNPs and indels as a reference database of BSQR step.

Now the question is i have three realigned files S1_realigned.bam, S2_realigned.bam, S3_realigned.bam from step 4 and a reference database of SNP and Indels (if i am right) from the step 6. So how i would proceed further. Did i use against each sample separately or make a combine re-calibration table by providing all three in the same command.

Thanks Dr. Deepak

GATK SNP BSQR Multiple-sample • 3.0k views

ADD COMMENT • link updated 6.5 years ago by BioinfGuru ★ 2.1k • written 6.5 years ago by deepkumar1983 ▴ 70

0

Entering edit mode

Hello deepkumar1983!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=82707

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

If you are using GATK4 you don't need to do Indel Realignment since HaplotypeCaller performs local realignment around SNPs. Check out this post from GATK for more details.

ADD REPLY • link 6.4 years ago by James Reeve ▴ 130

score 0 · Answer 1 · 2018-06-04

You need to re-create the BAM files by repeating step 4 (BQSR) using the SNP database (vcf file) you just created in step 6 as the argument for the --known-sites option

This is from GATK BQSR GUIDELINES:

You should almost always perform recalibration on your sequencing data. In human data, given the exhaustive databases of variation we have available, almost all of the remaining mismatches -- even in cancer -- will be errors, so it's super easy to ascertain an accurate error model for your data, which is essential for downstream analysis. For non-human data it can be a little bit more work since you may need to bootstrap your own set of variants if there are no such resources already available for you organism, but it's worth it.

Here's how you would bootstrap a set of known variants:

First do an initial round of variant calling on your original, unrecalibrated data. Then take the variants that you have the highest confidence in and use that set as the database of known variants by feeding it as a VCF file to the BaseRecalibrator. Finally, do a real round of variant calling with the recalibrated data. These steps could be repeated several times until convergence. The main case figure where you really might need to skip BQSR is when you have too little data (some small gene panels have that problem), or you're working with a really weird organism that displays insane amounts of variation.