Split vcf file to fit tbi requirements
0
0
Entering edit mode
10 weeks ago

I've been using HaplotypeCaller for SNP Calling based on the wheat reference. I split the wheat by chromosome (1A-7D) but I misjudged how large wheat chromosomes were because most of the resulting files are still too big to be indexed. Apparently I'm not the only one (https://github.com/broadinstitute/gatk/issues/8192) and this is a known issue (tbi indexing is limited to 2^29).

GATK's GenotypeGVCF requires an index to run.

However, instead of splitting the chromosomes, re-aligning my samples (I have >300 samples) and then SNP Calling again, I was wondering if it's possible to divide the vcf files in two based on coordinates and then index it after manually adding the headers.

Here is a few lines from one of my VCFs.

Chr1A   536331276       .       A       <NON_REF>       .       .       END=536657172   GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
Chr1A   536657173       .       G       <NON_REF>       .       .       END=536657190   GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,36
Chr1A   536657191       .       C       <NON_REF>       .       .       END=537008707   GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0

Is this at all possible?

snp HaplotypeCaller vcf gatk • 379 views
ADD COMMENT
0
Entering edit mode

From the issue tracker:

Hi, we don't currently support indexes that long. We use a bai index for bams and tabix for vcf which only support up to 512 M. You need to use a CSI index for references that large but we don't support writing those. (Reading them is weird, I think we can read BAM csi indexes but not VCF ones).

Maybe you can create an index with BCFTools bcftools index I think the default is csi index or option -c. But the author wasn't sure if it works for VCF files in my understanding.

You have to try if that works on your data or maybe just try on a smaller file, index with bcftools and see if that is accepted? (Never tried that myself). Or you could try GenomicDB Import instead of Genotype GVCF.

ADD REPLY
0
Entering edit mode

Hi there, I just checked and it doesn't work with a csi index and they have no intention of adding that functionality unfortunately.

I tried with GenomicDB import but it still requires an index (.idx file) which is generated from IndexFeatureFile which still requires a .tbi unfortunately. I'll try with bcftools but I may have to run it again unfortunately or use bcftools to call my files if I don't want to realign everything.

Thanks for your help.

ADD REPLY

Login before adding your answer.

Traffic: 1935 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6