HI
I am using the following set of commands on GATK2.1.13 to generate a VCF file
echo java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T RealignerTargetCreator -o my.intervals -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key
echo "Realignment Done at date" echo "Starting IndelRealigner at date"
echo java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T IndelRealigner -targetIntervals my.intervals -o myrealignedBam.bam -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key
echo "Realignment done at date"
echo "Starting UnifiedGenotyper at date" echo java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -l INFO -R human_g1k_v37.fasta -T UnifiedGenotyper -I myrealignedBam.bam -o mygatk_vcf.vcf --output_mode EMIT_ALL_SITES -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key
When i do a 'mpileup' for B2_with_ReadGroup.ddup.sorted.bam , I get a devcent 10 MB VCF file. But on the last ste of the above pipeline, my " mygatk_vcf.vcf " is goinging into 81GBs !!
Do you know what is wrong ?
I guess that you are looking for both SNP and indels, so instead of using "--output_mode EMIT_ALL_SITES" you should use "-glm BOTH" (edited). if you are looking for something else then you'll definitely have to browse through Brad's suggestions (I would recommend it anyway, since GATK's docs are constantly changing, and it's good to have always a general plus a more straight vision of what GATK is really doing).
There is some confusion here. In case you need to call for both SNPs and Indels one should use genotypelikelihoodsmodel = BOTH. I dont think --output_mode has "BOTH" option.
yes, I probably wrote too fast my comment. it should be "-glm BOTH". the output mode option should be used when you want to get "all confident sites" or "all sites" instead of default "variants only". thanks for pointing it out, I'll edit my previous comment.