Question

Question about Variant Calling: Whole file VS multiple smaller files

0

Entering edit mode

9.4 years ago

Hatem Elshazly ▴ 60

Hello Everyone,

I wanted to test if the resulting VCF would be the same if I split a BAM file before variant calling against calling variants on the whole BAM as input.
I made 2 different runs: The 1st run was using the splitted BAMs as inputs to HaplotypeCaller which will output 4 VCF files for each splitted BAM Then calling GenotypeGVCFs on all of them using -V option (i.e. -V file1.vcf -V file2.vcf -V file3.vcf -V file4.vcf) the output of that step is one VCF. The 2nd run was using the whole original BAM as input to HaplotypeCaller which will output one VCF <whole_vcf> then calling GenotypeGVCFs.

Run1 Commands:
HaplotypeCaller, for each splitted file (chunk1.bam, chunk2.bam, chunk3.bam, chunk4.bam):
java -jar gatk3.jar -T HaplotypeCaller -R human_g1k_v37.fasta -D dbsnp_137.b37.excluding_sites_after_129.vcf -o chunk1.vcf -pairHMM VECTOR_LOGLESS_CACHING --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -A DepthPerAlleleBySample -stand_call_conf 30 -stand_emit_conf 10 -I chunk1.bam

GenotypeGVCFs: java -jar gatk3.jar -T GenotypeGVCFs -R human_g1k_v37.fasta -D dbsnp_137.b37.excluding_sites_after_129.vcf -o out_from_splits.vcf -A Coverage -A GCContent -A HaplotypeScore -A MappingQualityRankSumTest -A InbreedingCoeff -A FisherStrand -A QualByDepth -A ChromosomeCounts -V chunk1.vcf -V chunk2.vcf -V chunk3.vcf -V chunk4.vcf

Run2 commands are the same but they are called on a the whole original BAM instead of multiple chunks.

The resulting VCF from GenotypeGVCFs in the 2nd run <whole_vcf> was 33M whereas resulting VCF from the chunks of the 1st run was 7M.

I tried to check what went wrong so I merged the 4 VCFs from HaplotypeCaller using vcf-merge from vcftools then vcf-compare to compare between the merged VCF and the whole VCF (whole_vcf), the number of variants was identical and both files matched expected 3 mismatches detected.

Can anybody tell me what went wrong? Why are the outputs of GenotypeGVCF from the two experiments different? Should I merge the VCFs from the 1st run before Genotyping to get same results?

Thanks in advance,
Shazly

vcf genotyper variants splitting gatk • 3.7k views

ADD COMMENT • link updated 9.4 years ago by Carlos Borroto ★ 2.1k • written 9.4 years ago by Hatem Elshazly ▴ 60

0

Entering edit mode

I think you are not giving enough information. For example when you say you split the BAM file, do you mean you split a multi sample BAM file into BAM files with only data from one sample?, or do you split a single sample BAM file into BAM files with limited regions?

Also, can you share the exact commands you used in each step? The reason for the disagreement could be as simple as a missing argument in one command.

ADD REPLY • link 9.4 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

Thanks for your reply, I updated the post with commands.

I split a single sample BAM into 4 BAM files with regions, I made sure manually the 4 chunks contained the reads from the original file.

ADD REPLY • link 9.4 years ago by Hatem Elshazly ▴ 60

score 2 · Accepted Answer · 2016-03-25

Okay, thanks for the extra info. A couple of notes and a possible improvement to what you are doing.

First, GenotypeGVCFs is not meant to genotype multiple partial VCF files of the same sample. It is meant for multiple full VCF files of different samples. I'm not sure this is why you are getting the wrong output, but I wouldn't be surprised if that's the case. Second, GenotypeGVCFs value is when run on data from multiple samples. In your case I don't expect you will gain anything over if you just run HaplotypeCaller.

Assuming you are getting your pipeline ready for multiple samples to be processed together. This is what I did recently for scather-gather in the HaplotypeCaller step.

I didn't split my BAM files, I just ran HaplotypeCaller with --intervals. I split my BED intervals file in multiple pieces and past one piece at a time. With BAM indexing there should be no penalty for running gatk on the full BAM but only process part of it.
I then used bcftools concat to concatenate all the partial VCFs for a single sample
I then ran GenotypeGVCFs in all my concatenated individual sample VCFs.

Hope it helps, Carlos