Question

Split vcf file to fit tbi requirements

0

Entering edit mode

8 months ago

cassandriatayfernandez • 0

I've been using HaplotypeCaller for SNP Calling based on the wheat reference. I split the wheat by chromosome (1A-7D) but I misjudged how large wheat chromosomes were because most of the resulting files are still too big to be indexed. Apparently I'm not the only one (https://github.com/broadinstitute/gatk/issues/8192) and this is a known issue (tbi indexing is limited to 2^29).

GATK's GenotypeGVCF requires an index to run.

However, instead of splitting the chromosomes, re-aligning my samples (I have >300 samples) and then SNP Calling again, I was wondering if it's possible to divide the vcf files in two based on coordinates and then index it after manually adding the headers.

Here is a few lines from one of my VCFs.

Chr1A   536331276       .       A       <NON_REF>       .       .       END=536657172   GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
Chr1A   536657173       .       G       <NON_REF>       .       .       END=536657190   GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,36
Chr1A   536657191       .       C       <NON_REF>       .       .       END=537008707   GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0

Is this at all possible?

snp HaplotypeCaller vcf gatk • 658 views

ADD COMMENT • link 7 months ago by cassandriatayfernandez • 0

0

Entering edit mode

From the issue tracker:

Hi, we don't currently support indexes that long. We use a bai index for bams and tabix for vcf which only support up to 512 M. You need to use a CSI index for references that large but we don't support writing those. (Reading them is weird, I think we can read BAM csi indexes but not VCF ones).

Maybe you can create an index with BCFTools bcftools index I think the default is csi index or option -c. But the author wasn't sure if it works for VCF files in my understanding.

You have to try if that works on your data or maybe just try on a smaller file, index with bcftools and see if that is accepted? (Never tried that myself). Or you could try GenomicDB Import instead of Genotype GVCF.

ADD REPLY • link 8 months ago by Michael 55k

0

Entering edit mode

Hi there, I just checked and it doesn't work with a csi index and they have no intention of adding that functionality unfortunately.

I tried with GenomicDB import but it still requires an index (.idx file) which is generated from IndexFeatureFile which still requires a .tbi unfortunately. I'll try with bcftools but I may have to run it again unfortunately or use bcftools to call my files if I don't want to realign everything.

Thanks for your help.

ADD REPLY • link 7 months ago by cassandriatayfernandez • 0