Hi there,
I have three VCF files for a non-model species that I'd like to merge into a single VCF file. Each file corresponds with a unique set of individuals and a subset of SNPs that is overlapping among files.
It seems based on other posts that a popular tool for performing this task is bcftools merge
. I first compress my VCF files in bgzip format with
bgzip -ci infileA.vcf > outfileA.vcf.gz
This also creates an index file outfileA.vcf.gz.gzi
I next attempt to merge the files with
bcftools merge outfileA.vcf.gz outfileB.vcf.gz outfileC.vcf.gz
But I get an error message that the index files cannot be found, even though they are in the same directory: Failed to open outfileA.vcf.gz: could not load index
I've seen posts that discuss this error message when attempting to use other tools in bcftools...some of these posts describe the issue as resulting from the system attempting to open too many files at once. But the system I am working on allows over 4,000 files to be open at once, and either way I am working with only six files (including the compressed VCF files and index files).
When I check the format of my files with htsfile
I get the following:
htsfile outfileA.vcf.gz
outfileA.vcf.gz: VCF version 4.2 BGZF-compressed variant calling data
htsfile outfileA.vcf.gz.gzi
outfileA.vcf.gz.gzi: unknown data
This suggests there may be some problem with the index file?
I've also attempted to generate index files outside fo the bgzip command, using bcftools index
as follows:
bcftools index -t outfileA.vcf.gz -o outfileA.vcf.gz.tbi
But get an error about sorting: [E::hts_idx_push] Unsorted positions on sequence #1
. However, when I then attempt to sort prior to indexing:
bcftools sort outfileA.vcf.gz -o outfileA.sort.vcf.gz
I get another error message: [W::vcf_parse] Contig 'NW_083863.1' is not defined in the header. (Quick workaround: index the file with tabix.)
When I then attempt to index with tabix:
tabix outfileA.vcf.gz
I get another error, which seems to hint at a need for sorting: [ti_index_core] the file out of order at line 45
My main issue is getting bcftools merge
(or an analogous tool) to work for merging my VCF files. Has anyone else run into this same issue? Is there something obvious I'm doing wrong? Thanks for any tips!
Can you try:
If you get the same error, you might need to create some sort of dummy reference genome for your non-model organism.
Thanks for your quick reply @_r_am! I tried the command you suggested, which returned the error
[ti_index_core] the file out of order at line 45
Does your VCF header list all the contigs in the VCF?
Try:
The two files above should be identical.
Thanks @_r_am! I tried the code you suggested on one of my vcf.gz files (the same file for each). The task doesn't run to completion, even after several hours. Could be because my vcf is so large (~22Gb file size)? Thanks for your help.
Several hours is a lot of time. The first one should have run created some output almost immediately. Did that happen?