Hi all,
I'm trying to merge 2 whole-genome vcf files with different individuals with vcf-merge
.
The program generates seems to merge both files but, surprisingly, it puts a ./.
in all the SNPs of the second file (as if they hadn't been genotyped, but they have!)
I've also tried to merge both files using bcftools merge
but my command returns the following:
The REF prefixes differ: GC vs C (2,1)
Failed to merge alleles at chrM:16375063 in myfile.vcf.gz
I've read in this post that I can silence this error by using bcftools +fixref
plugin, but I don't actually understand what it does, as it asks you to give a reference data file to obtain missing data from there.
If anyone could help with any of both options, I'd be really grateful! Thank you very much in advanced!
Hello sonia.olaechea ,
have you created the
vcf
files by your own or have you obtain them from somewhere? Could you please show the header of both files and the first few variants?fin swimmer
Hi @finswimmer, one of my vcfs is kind of long (I have more than 1000 individuals), so I will omit their IDs and most of their genotypes okey? Also, both vcfs have a lot of header lines describing the contigs length, but I have removed them as well, else the comment will be too long. I've done this just to make the comment smaller... but tell me if you need something else! I've just received both vcfs so I'm not quite sure whether someone has been doing something else with them before...
myfile1.vcf.gz
myfile2.vcf.gz
Hello sonia.olaechea ,
thanks for posting it. The header lines describing the contig length would be the most interesting part :) Because here one can see if the same reference genome was used.
But what you can see in the example you have posted, is that the chromosome names are different.
myfile1.vcf.gz
uses1
andmyfile2.vcf.gz
useschr1
. This can be fixed quite easily. But before let us check, if the same reference genome was used by comparing the contig length.fin swimmer
Ouch! Newbies mistake hehe sorry! And I'll start by changing the chromosomes names, thank you very much! Here are the contig lengths (lengths should be equal in both files right?) :
myfile1.vcf
myfile2.vcf
Houston, we have a problem :)
Yes, the contig length must be the same. As you can see, they aren't.
myfile2.vcf
is clearly based on GRCh37 (hg19). Compare the contig length with the length for assembled molecules. But I have no idea what reference genome is used formyfile1.vcf
. Have a look in the header if there is a line with##reference
and/or##command
. Maybe the filenames given there can give us a hint.Where are these files from?
Oh, yes, we have a problem! The only thing I can find in the header that could give us a clue is
##source:PLINKv1.90
and##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
, what is actually not very informative, right'By the way, I've tried changing the chromosomes names in the first column and it still gives an error, so I guess the main thing is the reference issue... I'll try asking someone and I'll let you know if I get something else! Thank you sooo much for everything :)
Hi again! It turns out data
myfile1.vcf
comes from a chip andmyfile2.vcf
from a sequencing process (and it has been aligned afterwards)!! Both vcfs have as a reference hg19, but `myfile2.vcf' has more SNPs than the other file!So I guess I will generate a list of the SNPs in
myfile1.vcf
, extract them frommyfile2.vcf
and generate a new file containing all the individuals! I hope this works out!Thank you very much for your time and advices :)
I'm afraid this will not work. To compare variants or merge variant information is absolut mandatory that the coordinates (chromosome and position) depends on the same reference genome. Your contig length differ. It cannot be the same reference for both file. You cannot be sure that a
chr1 69899
in the one file, represents the same region in the other.There are ways to liftover positions from one reference genome to the other. But for this you must know your origin and your destination genome.
Hi @finswimmer! Yes, you were right, it hasn't worked :( So, right now I'm trying to fix the REF alleles using
bcftools
plugin+fixref
. I've launched this command:And it returns this output:
Do you know what is meant by TOP and BOT-compatible and do you know if TOP 0 means that the process has run correctly? I've read this post, but I don't achieve to totally understand it...