I've been using HaplotypeCaller for SNP Calling based on the wheat reference. I split the wheat by chromosome (1A-7D) but I misjudged how large wheat chromosomes were because most of the resulting files are still too big to be indexed. Apparently I'm not the only one (https://github.com/broadinstitute/gatk/issues/8192) and this is a known issue (tbi indexing is limited to 2^29).
GATK's GenotypeGVCF requires an index to run.
However, instead of splitting the chromosomes, re-aligning my samples (I have >300 samples) and then SNP Calling again, I was wondering if it's possible to divide the vcf files in two based on coordinates and then index it after manually adding the headers.
Here is a few lines from one of my VCFs.
Chr1A 536331276 . A <NON_REF> . . END=536657172 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
Chr1A 536657173 . G <NON_REF> . . END=536657190 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,36
Chr1A 536657191 . C <NON_REF> . . END=537008707 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
Is this at all possible?
From the issue tracker:
Maybe you can create an index with BCFTools
bcftools index
I think the default is csi index or option -c. But the author wasn't sure if it works for VCF files in my understanding.You have to try if that works on your data or maybe just try on a smaller file, index with bcftools and see if that is accepted? (Never tried that myself). Or you could try GenomicDB Import instead of Genotype GVCF.
Hi there, I just checked and it doesn't work with a csi index and they have no intention of adding that functionality unfortunately.
I tried with GenomicDB import but it still requires an index (.idx file) which is generated from IndexFeatureFile which still requires a .tbi unfortunately. I'll try with bcftools but I may have to run it again unfortunately or use bcftools to call my files if I don't want to realign everything.
Thanks for your help.