Merge two phased vcf files from 2 pops keeping same ref allele in merged file
1
0
Entering edit mode
6.2 years ago
Mr Locuace ▴ 180

Hello, I want to merge 2 vcf files from 2 different human populations. For each population, this is the pipeline that I am using starting with Plink files:

  1. Flip strands according to reference genome using snpflip: https://github.com/mdshw5/snp-flip
  2. Get vcf files for each chromosome using Plink
  3. Phasing with Beagle v4.1
  4. Compress files with tabix -bgzip
  5. Concatenate chromosome files using bcftools -concat
  6. Merge 2 pops with bcftools -merge??

Even though I flip ref alleles according to the reference genome (hg19) in each population, at some SNPs the REF alleles are still flipped between the populations. When I try to merge the files with bcftools -merge, I get this message:

#The REF prefixes differ: T vs C (1,1)
#Failed to merge alleles at 10:252693 in ...

This is not a phasing Beagle effect, since the files immediately before and after phasing have the same REF alleles. I thought about just flipping the REF and ALT columns accordingly using a custom-made R script, but then my phased genotypes at these SNPs will not correspond with the flipped alleles. I would like the phased genotypes to reflect the corresponding REF and ALT alleles.

Does anyone know how to handle this with bcftools of with other program?

Below find the headers of my vcf files with only 3 individuals per population.

Any help or suggestion would be very welcome !!

pop1.vcf.gz:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##filedate=20180903
##source="beagle.jar (r1399)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated correlation between most probable ALT dose and true ALT dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated correlation between estimated ALT dose [P(RA) + 2*P(AA)] and true ALT dose">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=10>
##bcftools_annotateVersion=1.9+htslib-1.9
##bcftools_annotateCommand=annotate -x FORMAT/DS,FORMAT/GP -o /path/pop1_195_10_GT.vcf /path/pop1_195_10_phased.vcf.gz; Date=Tue Sep 4 15:50:41 2018
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=1>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##bcftools_concatVersion=1.9+htslib-1.9
##bcftools_concatCommand=concat path/pop1_195_10_GT.vcf.gz /path/pop1_195_11_GT.vcf.gz /path/pop1_195_12_GT.vcf.gz /path/pop1_195_13_GT.vcf.gz /path/pop1_195_14_GT.vcf.gz /path/pop1_195_15_GT.vcf.gz /path/pop1_195_16_GT.vcf.gz /path/pop1_195_17_GT.vcf.gz /path/pop1_195_18_GT.vcf.gz /path/pop1_195_19_GT.vcf.gz /path/pop1_195_1_GT.vcf.gz /path/pop1_195_20_GT.vcf.gz /path/pop1_195_21_GT.vcf.gz /path/pop1_195_22_GT.vcf.gz /path/pop1_195_2_GT.vcf.gz /path/pop1_195_3_GT.vcf.gz //path/pop1_195_4_GT.vcf.gz /path/pop1_195_5_GT.vcf.gz /path/pop1_195_6_GT.vcf.gz /path/pop1_195_7_GT.vcf.gz /path/pop1_195_8_GT.vcf.gz /path/pop1_195_9_GT.vcf.gz; Date=Tue Sep 4 16:13:54 2018
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  s_1 s_2 s_3
10  144847  rs11253478  C   T   .   PASS    AR2=1;DR2=1;AF=0.082    GT  0|0 0|0 0|1
10  244561  rs2448366   G   A   .   PASS    AR2=1;DR2=1;AF=0.446    GT  1|1 0|1 0|0
10  252693  rs2379078   T   C   .   PASS    AR2=1;DR2=1;AF=0.238    GT  0|0 0|0 0|1

pop2.vcf.gz:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##filedate=20180830
##source="beagle.jar (r1399)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated correlation between most probable ALT dose and true ALT dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated correlation between estimated ALT dose [P(RA) + 2*P(AA)] and true ALT dose">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=10>
##bcftools_annotateVersion=1.9+htslib-1.9
##bcftools_annotateCommand=annotate -x FORMAT/DS,FORMAT/GP -o /path/pop2_10_GT.vcf /path/pop2_10_phased.vcf.gz ' '; Date=Tue Sep 4 12:38:31 2018
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=1>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##bcftools_concatVersion=1.9+htslib-1.9
##bcftools_concatCommand=concat /path/pop2_10_GT.vcf.gz /path/pop2_11_GT.vcf.gz /path/pop2_12_GT.vcf.gz /path/pop2_13_GT.vcf.gz /path/pop2_14_GT.vcf.gz /path/pop2_15_GT.vcf.gz /path/pop2_16_GT.vcf.gz /path/pop2_17_GT.vcf.gz /path/pop2_18_GT.vcf.gz /path/pop2_19_GT.vcf.gz /path/pop2_1_GT.vcf.gz /path/pop2_20_GT.vcf.gz /path/pop2_21_GT.vcf.gz /path/pop2_22_GT.vcf.gz /path/pop2_2_GT.vcf.gz /path/pop2_3_GT.vcf.gz /path/pop2_4_GT.vcf.gz /path/pop2_5_GT.vcf.gz /path/pop2_6_GT.vcf.gz /path/pop2_7_GT.vcf.gz /path/pop2_8_GT.vcf.gz /path/pop2_9_GT.vcf.gz; Date=Tue Sep 4 16:14:14 2018
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ind_1   ind_2   ind_3
10  144847  rs11253478  C   T   .   PASS    AR2=1;DR2=1;AF=0.474    GT  0|0 0|1 0|0
10  244561  rs2448366   G   A   .   PASS    AR2=1;DR2=1;AF=0.321    GT  1|1 1|0 0|1
10  252693  rs2379078   C   T   .   PASS    AR2=1;DR2=1;AF=0.359    GT  1|1 1|0 0|1
bcftools beagle plink • 6.8k views
ADD COMMENT
3
Entering edit mode

Hello Mr Locuace ,

  1. Flip strands according to reference genome using snpflip: https://github.com/mdshw5/snp-flip

I don't know snp-flip. But the manual says "outputs each SNP on the reverse strand" and not in the reference strand. To make sure that REF and ALT are set according the reference genome I would suggest the bcftools plugin fixref.

$ bcftools plugin fixref -h                                                                                                                     

About: This tool helps to determine and fix strand orientation.
       Currently the following modes are recognised:
           flip  .. flips non-ambiguous SNPs and ignores the rest
           id    .. swap REF/ALT and GTs using the ID column to determine the REF allele
           stats .. collect and print stats
           top   .. converts from Illumina TOP strand to fwd

       WARNING: Do not use the program blindly, make an effort to
       understand what strand convention your data uses! Make sure
       the reason for mismatching REF alleles is not a different
       reference build!!

Usage: bcftools +fixref [General Options] -- [Plugin Options]
Options:
   run "bcftools plugin" for a list of common options

Plugin options:
   -d, --discard               Discard sites which could not be resolved
   -f, --fasta-ref <file.fa>   Reference sequence
   -i, --use-id <file.vcf>     Swap REF/ALT using the ID column to determine the REF allele, implies -m id.
                               Download the dbSNP file from
                                   https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf
   -m, --mode <string>         Collect stats ("stats") or convert ("flip", "id", "top") [stats]

Examples:
   # run stats
   bcftools +fixref file.bcf -- -f ref.fa

   # convert from TOP to fwd
   bcftools +fixref file.bcf -Ob -o out.bcf -- -f ref.fa -m top

   # match the REF/ALT alleles based on the ID column, discard unknown sites
   bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -i All_20151104.vcf.gz

   # assuming the reference build is correct, just flip to fwd, discarding the rest
   bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -m flip

fin swimmer

ADD REPLY
0
Entering edit mode

Thank you very much @finswimmer !!

ADD REPLY
0
Entering edit mode
3.8 years ago

Thanks ! bcftools norm isnt the same ??

ADD COMMENT

Login before adding your answer.

Traffic: 2524 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6