Entering edit mode
3.3 years ago
Mehmet
▴
820
Dear All,
I have done variant calling in Germline data that has single sample of each individual and two genes.
I did following steps, but after checking results I found too many variants.
After Haplotypecaller (the step 6) I found 140900 known variants, and the steps 7 and 8 didn't filter any variants.
What should be causing this result?
1. index the reference fasta (hg19.fa)
bwa index hg19.fa
2. mapping to the reference genome (hg19.fa)
I randomly gave read group information.
bwa mem -t 16 -R '@RG\tID:21002\tSM:Sample_21002' hg19.fa ../../Trim_final/21002.1.FASTP.fastq.gz ../../Trim_final/21002.2.FASTP.fastq.gz > 21002.bam
3. picard sort bam file by queryname
java -jar ~/applications/picard-tools-2.23.0/picard.jar SortSam I=21002.bam O=21002.bam.PICARD.output.sorted.queryname.bam SORT_ORDER=queryname
4. mark duplicates by MarkDuplicatesSpark
gatk --java-options "-Xmx5g" MarkDuplicatesSpark -I 21002.bam.PICARD.output.sorted.queryname.bam -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam --tmp-dir ./Test/Tmps/
5. Base (Quality Score) Recalibration
Tools involved: BaseRecalibrator, Apply Recalibration
##first run BaseRecalibrator and later run ApplyBQSR using a table produced by BaseRecalibrator and a bam file produced at the step4.
###make an index of dbsnp_138.hg19.vcf
gatk IndexFeatureFile -F dbsnp_138.hg19.vcf ## this will generate dbsnp_138.hg19.vcf.idx
##### add read group and platform to bam files;
java -jar ~/applications/picard-tools-2.23.0/picard.jar AddOrReplaceReadGroups I=21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam O=21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam RGID=21002 RGLB=21002lib1 RGPL=ILLUMINA RGPU=21002 RGSM=20
#####BaseRecalibrator
gatk BaseRecalibrator -R hg19.fa -I 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam --known-sites dbsnp_138.hg19.vcf --known-sites ../Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.EDITED.gz -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.recal_data.table
#####ApplyBQSR
gatk ApplyBQSR -R hg19.fa -I 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam --bqsr-recal-file 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.recal_data.table -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam
6. HaplotypeCaller (single sample mode)
gatk HaplotypeCaller --bam-output 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.HaplotypeCaller.OUTPUT.bam --native-pair-hmm-threads 64 -R hg19.fa -I 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz -D dbsnp_138.hg19.vcf
7. CNNScoreVariants
gatk CNNScoreVariants -R hg19.fa -V 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz.CNNScoreVariants.OUTPUT.vcf
8. FilterVariantTranches
gatk --java-options "-Xmx7g" FilterVariantTranches -V 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz.CNNScoreVariants.OUTPUT.vcf --resource ../Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.EDITED.gz --resource ../dbsnp_138.hg19.vcf --info-key CNN_1D --snp-tranche 99.95 --indel-tranche 99.4 -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz.CNNScoreVariants.OUTPUT.vcf.FilterVariantTranches.OUTPUT.vcf
OK. I have figured it out. If we specify intervals of gene or genes of interest at
Haplotypecaller
or atVariantFiltration
run like-L chr4:100,232,323-101,343,433 -L chr15:104,543,211-104,403,129
(randomly typed intervals here for example), number of final variants are greatly reduced because we keep variants that we want to search at gene or genes of interest.this post is also a solution to set intervals at
Haplotypecaller
orVariantFiltration
steps that were asked and probably will be asked for other people.