I have question about filtering NGS data with GATK workflow. I have data from NGS Illumina (Sureselect kit), it is targeted gene panel (about 112 genes), usually we are using two softwares - Surecall and Nextgene with default settings.
Output of this analyses is VCF file with about 500 variants.
Basically it is - bam created by bwa-mem, then I remove duplicates and do base recalibration. Then I use Haplotype Caller for creating gvcf and finally VCF.
Last step that I do, is hard filtering raw snps VCF with filters from GATK -
Problem is, that after this step my vcf has about 40 000 variants passing the filter - do you guys have any idea, where I need to change parameters to get about same amount of variants as running by softwares like SureCall?
40 000 variants in 112 genes. That's much to much. Can you post the commands you use for the alignment and variant calling? And also the first few entrys of your vcf file are interessting.
Do you specify a .bed file in SureCall and Nextgene? I'd bet your abundance of variants in your pipeline come from regions outside your target region. You can filter your filtered vcf for your target region using the bed file for your Sureselect kit.
Hello,
40 000 variants in 112 genes. That's much to much. Can you post the commands you use for the alignment and variant calling? And also the first few entrys of your vcf file are interessting.
fin swimmer