Question

GATK - gene panel VCF - too many variants - how to filter out

3

Entering edit mode

8.3 years ago

stanedav ▴ 50

Hello guys,

I have question about filtering NGS data with GATK workflow. I have data from NGS Illumina (Sureselect kit), it is targeted gene panel (about 112 genes), usually we are using two softwares - Surecall and Nextgene with default settings. Output of this analyses is VCF file with about 500 variants.

Now I am trying to implement GATK workflow from fastq to VCF. (this workflow: https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS)

Basically it is - bam created by bwa-mem, then I remove duplicates and do base recalibration. Then I use Haplotype Caller for creating gvcf and finally VCF.

Last step that I do, is hard filtering raw snps VCF with filters from GATK -

--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"

Problem is, that after this step my vcf has about 40 000 variants passing the filter - do you guys have any idea, where I need to change parameters to get about same amount of variants as running by softwares like SureCall?

Thanks for any tips :)

NGS GATK VCF • 5.0k views

ADD COMMENT • link updated 6.4 years ago by Biostar 20 • written 8.3 years ago by stanedav ▴ 50

0

Entering edit mode

Hello,

40 000 variants in 112 genes. That's much to much. Can you post the commands you use for the alignment and variant calling? And also the first few entrys of your vcf file are interessting.

fin swimmer

ADD REPLY • link 8.3 years ago by finswimmer 16k

score 0 · Answer 1 · 2017-05-11

Here is my complete workflow. fastq data are from Sureselect kit:

./bwa mem -M -R \
    @RG\\tID:ID_1234\\tSM:SM_1234\\tLB:LB_1234\\tPL:Illumina\\tPI:150 '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    '/1234/1234_1.fastq' '/1234/1234_2.fastq' > '/1234/1234.sam'  

samtools sort -@ 8 \
    -T /path_to_temp/ \
    -o /1234/1234.bam

java -jar build/libs/picard.jar MarkDuplicates \
    INPUT='/1234/1234.bam' \
    OUTPUT='/1234/1234_dedup.bam' \
    METRICS_FILE='/1234/1234_metrics'

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup.bam' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/dbsnp_138.hg19.vcf' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/1000G_phase1.indels.hg19.vcf' \
    -o '/1234/1234_recal.table'

java -jar GenomeAnalysisTK.jar -T PrintReads \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup.bam' \
    -o '/1234/1234_remdup_recal.bam' \
    -BQSR '/1234/1234_recal.table'

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup_recal.bam' \
    -o '/1234/1234_remdup_recal.g.vcf' \
    --emitRefConfidence GVCF \
    --variant_index_type LINEAR \
    --variant_index_parameter 128000

java -jar GenotypeAnalysisTK.jar -T GenotypeGVCFs  \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -o '/1234/1234.vcf' \
    --variant '/1234/1234_remdup_recal.g.vcf' \

java -jar GenomeAnalysisTK.jar -T VariantFiltration \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -V '/1234/1234.vcf' \
    --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "GATKdef" \
    -o '/1234/1234_filtered.vcf'

score 0 · Answer 2 · 2017-05-11

0

Entering edit mode

8.3 years ago

Robert Sicko ▴ 640

targeted gene panel (about 112 genes)

Do you specify a .bed file in SureCall and Nextgene? I'd bet your abundance of variants in your pipeline come from regions outside your target region. You can filter your filtered vcf for your target region using the bed file for your Sureselect kit.

ADD COMMENT • link 8.3 years ago by Robert Sicko ▴ 640

score 0 · Answer 3 · 2017-05-22

0

Entering edit mode

8.3 years ago

stanedav ▴ 50

Bed file solved the problem! You are best guys!

ADD COMMENT • link 8.3 years ago by stanedav ▴ 50