Question

vcf-merge outputting all genotypes as ./.

0

Entering edit mode

6.2 years ago

biosol ▴ 170

Hi all,

I'm trying to merge 2 whole-genome vcf files with different individuals with vcf-merge. The program generates seems to merge both files but, surprisingly, it puts a ./. in all the SNPs of the second file (as if they hadn't been genotyped, but they have!)

I've also tried to merge both files using bcftools merge but my command returns the following: The REF prefixes differ: GC vs C (2,1) Failed to merge alleles at chrM:16375063 in myfile.vcf.gz

I've read in this post that I can silence this error by using bcftools +fixrefplugin, but I don't actually understand what it does, as it asks you to give a reference data file to obtain missing data from there.

If anyone could help with any of both options, I'd be really grateful! Thank you very much in advanced!

vcf-merge vcf • 2.5k views

ADD COMMENT • link 6.2 years ago by biosol ▴ 170

0

Entering edit mode

Hello sonia.olaechea ,

have you created the vcf files by your own or have you obtain them from somewhere? Could you please show the header of both files and the first few variants?

fin swimmer

ADD REPLY • link 6.2 years ago by finswimmer 16k

0

Entering edit mode

Hi @finswimmer, one of my vcfs is kind of long (I have more than 1000 individuals), so I will omit their IDs and most of their genotypes okey? Also, both vcfs have a lot of header lines describing the contigs length, but I have removed them as well, else the comment will be too long. I've done this just to make the comment smaller... but tell me if you need something else! I've just received both vcfs so I'm not quite sure whether someone has been doing something else with them before...

myfile1.vcf.gz

##fileformat=VCFv4.2
##fileDate=20181023
##source=PLINKv1.90
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Unknown_BNADN_74075_3   Unknown_BNADN_74075_28  Unknown_BNADN_74075_33  
Unknown_BNADN_74075_49  Unknown_BNADN_74075_70 ...
1       752721  rs3131972       G       A       .       .       PR      GT      0/0     0/0     0/0     0/0     0/0     0/1     0/1     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0    0/1      0/1     
0/0     0/1     0/1     0/0     0/0     0/0     0/1     0/0     0/0     0/1     0/1     0/1     0/0 ...

myfile2.vcf.gz

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Unphased genotypes">
##fileDate=2018-08-15
##bcftools_viewVersion=1.3.1+htslib-1.3.1
##bcftools_viewCommand=view -m2 -M2 -v snps,indels -Oz -o CDC0040.platypus.biallelic_SNPs_INDEL.vcf.gz CDC0040.platypus.vcf.gz
##bcftools_annotateVersion=1.3.1+htslib-1.3.1
##bcftools_annotateCommand=annotate -x FORMAT CDC0040.platypus.biallelic_SNPs_INDEL.vcf.gz
##bcftools_mergeVersion=1.3.1+htslib-1.3.1
##bcftools_annotateCommand=annotate -x QUAL,FILTER,INFO 
##/data/d2/genomics/Santos/vcf/WGS_CDC.platypus.46ind.vcf.gz
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CDC0040 CDC0061 CDC0091 CDC0094...
chr1    69899   .       T       A       .       .       .       GT      0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     ./.     ./.     ./.     0/0     0/0     0/0     0/0     0/0     0/0     ./.     ./.     0/0     ./. ./.      0/0     ./.     ./.     0/0     ./.     0/0     ./.     0/0     0/0     ./.     0/0     ./.     ./.     ./.     ./.     0/0     0/0     ./.     0/0     ./.     0/0     ./.     ./.     ./.
chr1    727841  .       G       A       .       .       .       GT      0/0     1/0     0/0     1/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     1/0     0/0     0/0     0/0     0/0 0/0      0/0     0/0     0/0     1/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     1/0     0/0     0/0     0/0

ADD REPLY • link 6.2 years ago by biosol ▴ 170

0

Entering edit mode

Hello sonia.olaechea ,

thanks for posting it. The header lines describing the contig length would be the most interesting part :) Because here one can see if the same reference genome was used.

But what you can see in the example you have posted, is that the chromosome names are different. myfile1.vcf.gz uses 1 and myfile2.vcf.gz uses chr1. This can be fixed quite easily. But before let us check, if the same reference genome was used by comparing the contig length.

fin swimmer

ADD REPLY • link 6.2 years ago by finswimmer 16k

0

Entering edit mode

Ouch! Newbies mistake hehe sorry! And I'll start by changing the chromosomes names, thank you very much! Here are the contig lengths (lengths should be equal in both files right?) :

myfile1.vcf

`##contig=<ID=1,length=249222528>
 ##contig=<ID=2,length=243041412>
 ##contig=<ID=3,length=197848557>
 ##contig=<ID=4,length=190904951>
 ##contig=<ID=5,length=180698414>
 ##contig=<ID=6,length=170907735>
 ##contig=<ID=7,length=159124174>
 ##contig=<ID=8,length=146296405>
 ##contig=<ID=9,length=141077353>
 ##contig=<ID=10,length=135440227>
 ##contig=<ID=11,length=134945121>
 ##contig=<ID=12,length=133831320>
 ##contig=<ID=13,length=115103151>
 ##contig=<ID=14,length=107287664>
 ##contig=<ID=15,length=102397318>
 ##contig=<ID=16,length=90170096>
 ##contig=<ID=17,length=81103683>
 ##contig=<ID=18,length=78015181>
 ##contig=<ID=19,length=59097934>
 ##contig=<ID=20,length=62960293>
 ##contig=<ID=21,length=48099611>
 ##contig=<ID=22,length=51217955>
 ##contig=<ID=23,length=155232839>
 ##contig=<ID=24,length=24444623>
 ##contig=<ID=25,length=2699247>
 ##contig=<ID=26,length=16392>
 ##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">`

myfile2.vcf

 ##contig=<ID=chrM,length=16571>
 ##contig=<ID=chr1,length=249250621>
 ##contig=<ID=chr2,length=243199373>
 ##contig=<ID=chr3,length=198022430>
 ##contig=<ID=chr4,length=191154276>
 ##contig=<ID=chr5,length=180915260>
 ##contig=<ID=chr6,length=171115067>
 ##contig=<ID=chr7,length=159138663>
 ##contig=<ID=chr8,length=146364022>
 ##contig=<ID=chr9,length=141213431>
 ##contig=<ID=chr10,length=135534747>
 ##contig=<ID=chr11,length=135006516>
 ##contig=<ID=chr12,length=133851895>
 ##contig=<ID=chr13,length=115169878>
 ##contig=<ID=chr14,length=107349540>
 ##contig=<ID=chr15,length=102531392>
 ##contig=<ID=chr16,length=90354753>
 ##contig=<ID=chr17,length=81195210>
 ##contig=<ID=chr18,length=78077248>
 ##contig=<ID=chr19,length=59128983>
 ##contig=<ID=chr20,length=63025520>
 ##contig=<ID=chr21,length=48129895>
 ##contig=<ID=chr22,length=51304566>
 ##contig=<ID=chrX,length=155270560>
 ##contig=<ID=chrY,length=59373566>
 ##fileDate=2018-08-15

ADD REPLY • link 6.2 years ago by biosol ▴ 170

0

Entering edit mode

Houston, we have a problem :)

Yes, the contig length must be the same. As you can see, they aren't. myfile2.vcf is clearly based on GRCh37 (hg19). Compare the contig length with the length for assembled molecules. But I have no idea what reference genome is used for myfile1.vcf. Have a look in the header if there is a line with ##reference and/or ##command. Maybe the filenames given there can give us a hint.

Where are these files from?

ADD REPLY • link 6.2 years ago by finswimmer 16k

0

Entering edit mode

Oh, yes, we have a problem! The only thing I can find in the header that could give us a clue is ##source:PLINKv1.90 and ##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome"> , what is actually not very informative, right'

By the way, I've tried changing the chromosomes names in the first column and it still gives an error, so I guess the main thing is the reference issue... I'll try asking someone and I'll let you know if I get something else! Thank you sooo much for everything :)

ADD REPLY • link 6.2 years ago by biosol ▴ 170

0

Entering edit mode

Hi again! It turns out data myfile1.vcf comes from a chip and myfile2.vcf from a sequencing process (and it has been aligned afterwards)!! Both vcfs have as a reference hg19, but `myfile2.vcf' has more SNPs than the other file!

So I guess I will generate a list of the SNPs in myfile1.vcf, extract them from myfile2.vcfand generate a new file containing all the individuals! I hope this works out!

Thank you very much for your time and advices :)

ADD REPLY • link 6.2 years ago by biosol ▴ 170

0

Entering edit mode

I'm afraid this will not work. To compare variants or merge variant information is absolut mandatory that the coordinates (chromosome and position) depends on the same reference genome. Your contig length differ. It cannot be the same reference for both file. You cannot be sure that a chr1 69899 in the one file, represents the same region in the other.

There are ways to liftover positions from one reference genome to the other. But for this you must know your origin and your destination genome.

ADD REPLY • link 6.2 years ago by finswimmer 16k

0

Entering edit mode

Hi @finswimmer! Yes, you were right, it hasn't worked :( So, right now I'm trying to fix the REF alleles using bcftools plugin +fixref. I've launched this command:

bcftools +fixref myfile1.vcf.gz -- -f hg37.all.fa.gz

And it returns this output:

# SC, guessed strand convention
SC      TOP-compatible  0
SC      BOT-compatible  0
# ST, substitution types
ST      A>C     26078   3.9%
ST      A>G     108869  16.4%
ST      A>T     3872    0.6%
ST      C>A     29644   4.5%
ST      C>G     15605   2.4%
ST      C>T     146525  22.1%
ST      G>A     146363  22.1%
ST      G>C     17303   2.6%
ST      G>T     29676   4.5%
ST      T>A     4394    0.7%
ST      T>C     108839  16.4%
ST      T>G     25933   3.9%
# NS, Number of sites:
NS      total           681862
NS      ref match       561656  84.7%
NS      ref mismatch    101449  15.3%
NS      flipped         2775    0.4%
NS      swapped         98670   14.9%
NS      flip+swap       17817   2.7%
NS      unresolved      0       0.0%
NS      fixed pos       0       0.0%
NS      skipped         18757
NS      non-ACGT        1458
NS      non-SNP         17299
NS      non-biallelic   0

Do you know what is meant by TOP and BOT-compatible and do you know if TOP 0 means that the process has run correctly? I've read this post, but I don't achieve to totally understand it...

ADD REPLY • link 6.2 years ago by biosol ▴ 170