Getting A Large Vcf File On Variant Calling Through Gatk
1
1
Entering edit mode
12.1 years ago

HI

I am using the following set of commands on GATK2.1.13 to generate a VCF file

               echo java -Xmx20g -jar  /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T RealignerTargetCreator -o my.intervals -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key 
               echo "Realignment Done at date" echo "Starting IndelRealigner at date"

              echo java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T IndelRealigner -targetIntervals my.intervals -o myrealignedBam.bam -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key 
             echo "Realignment done at date" 
             echo "Starting UnifiedGenotyper at date" echo java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -l INFO -R human_g1k_v37.fasta -T UnifiedGenotyper -I myrealignedBam.bam -o mygatk_vcf.vcf --output_mode EMIT_ALL_SITES -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key

When i do a 'mpileup' for B2_with_ReadGroup.ddup.sorted.bam , I get a devcent 10 MB VCF file. But on the last ste of the above pipeline, my " mygatk_vcf.vcf " is goinging into 81GBs !!

Do you know what is wrong ?

gatk variant vcf samtools • 3.5k views
ADD COMMENT
1
Entering edit mode
12.1 years ago

Your UnifiedGenotyper command line includes the option --output_mode EMIT_ALL_SITES, which will write a call for every site in the genome. This is creating the large file size. The default is to only provide calls at variant sites that differ from the reference, which be a more reasonable size. There is lots of useful documentation to help in choosing the right options:

ADD COMMENT
1
Entering edit mode

I guess that you are looking for both SNP and indels, so instead of using "--output_mode EMIT_ALL_SITES" you should use "-glm BOTH" (edited). if you are looking for something else then you'll definitely have to browse through Brad's suggestions (I would recommend it anyway, since GATK's docs are constantly changing, and it's good to have always a general plus a more straight vision of what GATK is really doing).

ADD REPLY
1
Entering edit mode

There is some confusion here. In case you need to call for both SNPs and Indels one should use genotypelikelihoodsmodel = BOTH. I dont think --output_mode has "BOTH" option.

ADD REPLY
0
Entering edit mode

yes, I probably wrote too fast my comment. it should be "-glm BOTH". the output mode option should be used when you want to get "all confident sites" or "all sites" instead of default "variants only". thanks for pointing it out, I'll edit my previous comment.

ADD REPLY

Login before adding your answer.

Traffic: 2302 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6