Hi all,
i am trying to merge two VCF files using bcftools merge.
However, my command bcftools merge -m id VCF_d.vcf.gz VCF_p.vcf.gz -o merged.vcf.gz --force-samples
returns the following
The REF prefixes differ: TG vs GA (2,2) Failed to merge alleles at 18:786377 in VCF_d.vcf.gz
These are the entries in the above mentioned files.
VCF_d 18 786377 AX-120223469 TG T . . AC=0;AN=4144
VCF_p 18 786377 AX-120223469 GA T . . AN=0;AC=0
the header info for VCF_d is
fileformat=VCFv4.1
FILTER=<ID=PASS,Description="All filters passed" file
Date=20200421
source=apt-format-result:1.20.0(2.8.0)
reference=NCBI37
phasing=none
FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype" calls-file=./AxiomGT1.calls.txt annotation-file=/home/ubuntu/projects/data_release/Axiom_PMRA.na35.annot.db/Axiom_PMRA.na35.annot.db notes="Reporting for indels: For insertions, the reported chromosome position will be the location of first base that was inserted. For deletions, it will be the location of the preceding base even for a deletion within a nucleotide repeat (e.g., TTT[-/T]TTTTT)."
contig=<ID=1,length=249250621>
contig=<ID=2,length=243199373>
contig=<ID=3,length=198022430>
contig=<ID=4,length=191154276>
contig=<ID=5,length=180915260>
contig=<ID=6,length=171115067>
contig=<ID=7,length=159138663>
contig=<ID=8,length=146364022>
contig=<ID=9,length=141213431>
contig=<ID=10,length=135534747>
contig=<ID=11,length=135006516>
contig=<ID=12,length=133851895>
contig=<ID=13,length=115169878>
contig=<ID=14,length=107349540>
contig=<ID=15,length=102531392>
contig=<ID=16,length=90354753>
contig=<ID=17,length=81195210>
contig=<ID=18,length=78077248>
contig=<ID=19,length=59128983>
contig=<ID=20,length=63025520>
contig=<ID=21,length=48129895>
contig=<ID=22,length=51304566>
contig=<ID=MT,length=16569>
contig=<ID=X,length=155270560>
contig=<ID=Y,length=59373566>
export-vcf-file=VCF_ALL.vcf INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes"> INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> bcftools_viewVersion=1.6+htslib-1.6
VCF_p is
fileformat=VCFv4.1
FILTER=<ID=PASS,Description="All filters passed"> fileDate=20180409 source=apt-format-result:1.20.0(2.8.0) phasing=none
FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> batch-folder=L:\Genotyping\other\AxiomOutputLogs\GT_17_003 annotation-file=C:\Users\Public\Documents\Library\Axiom_PMRA.r2\Axiom_PMRA.na35.annot.db notes="Reporting for indels: For insertions, the reported chromosome position will be the location of first base that was inserted. For deletions, it will be the location of the preceding base even for a deletion within a nucleotide repeat (e.g., TTT[-/T]TTTTT)." export-vcf-file=L:\Genotyping\GenetitanResults\VCFs\rawVCFs\GT_17_003_VCF_ALL.vcf
contig=<ID=1,length=249250621>
contig=<ID=2,length=243199373>
contig=<ID=3,length=198022430>
contig=<ID=4,length=191154276>
contig=<ID=5,length=180915260>
contig=<ID=6,length=171115067>
contig=<ID=7,length=159138663>
contig=<ID=8,length=146364022>
contig=<ID=9,length=141213431>
contig=<ID=10,length=135534747>
contig=<ID=11,length=135006516>
contig=<ID=12,length=133851895>
contig=<ID=13,length=115169878>
contig=<ID=14,length=107349540>
contig=<ID=15,length=102531392>
contig=<ID=16,length=90354753>
contig=<ID=17,length=81195210>
contig=<ID=18,length=78077248>
contig=<ID=19,length=59128983>
contig=<ID=20,length=63025520>
contig=<ID=21,length=48129895>
contig=<ID=22,length=51304566>
contig=<ID=MT,length=16569>
contig=<ID=X,length=155270560>
contig=<ID=Y,length=59373566>
liftOverProgram=CrossMap(https://sourceforge.net/projects/crossmap/) liftOverFile=/home/ubuntu/projects/uploads/genotyping/scripts/chains/GRCh37_to_GRCh38.chain.gz new_reference_genome=/home/ubuntu/projects/uploads/genotyping/scripts/ref/Homo_sapiens.GRCh38.dna.toplevel.fa
liftOverTime=December10,2018
The contig info length is the same in both VCF files. Both VCF files were sent to me from a company. Surely they would not use different reference genomes as the VCF contain SNPs from one cohort so will need to merge together before analysing.
I would appreciate if anyone could help me out!
I can't find information with respect to the reference genome used to call the variants in
VCF_d
, but inVCF_p
it seems they used GRCh37 as reference, and then they liftover the variant positions into GRCh38 using Crossmap liftover tool. Can you check if the liftover process was also done forVCF_d
, or if directly the reference genome used was GRCh38?I don't think the liftover process was done for VCF_d but it does say reference=NCBI37. On the NCBI link here (https://www.ncbi.nlm.nih.gov/snp/?term=rs201777329) the particular SNP causing me the issue seems to be at two different positions 18:786377 (GRCh38) and 18:786378 (GRCh37).
Then it looks like one VCF is based on GRCh37 coordinates and the other one on GRCh38. In order to merge, you need both VCFs to be based on the same reference genome. I guess, since a company did this, the best is to contact them, so that they can liftover the variants to GRCh38.
Thanks for your help. I will do that. Just to update, I downloaded the reference fasta used by the company. I then used CrossMap and lifted it over to GRCh38.
It seems to have merged successfully.
Edit your post and make the following changes:
code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or select a chunk of text and use the highlighted button to format it as a code block. You used the double quote option, which is for quoting plain text from a source, not for formatting code as it ends up messing with code elements. Paste the VCF headers again and format them properly as it is not fixable in its current form.merge
tag -bcftools
andVCF
are subject matter relevant tag,merge
is not a tag, it's merely an operation you want to perform.