Hello, I want to merge 2 vcf files from 2 different human populations. For each population, this is the pipeline that I am using starting with Plink files:
- Flip strands according to reference genome using snpflip: https://github.com/mdshw5/snp-flip
- Get vcf files for each chromosome using Plink
- Phasing with Beagle v4.1
- Compress files with tabix -bgzip
- Concatenate chromosome files using bcftools -concat
- Merge 2 pops with bcftools -merge??
Even though I flip ref alleles according to the reference genome (hg19) in each population, at some SNPs the REF alleles are still flipped between the populations. When I try to merge the files with bcftools -merge, I get this message:
#The REF prefixes differ: T vs C (1,1)
#Failed to merge alleles at 10:252693 in ...
This is not a phasing Beagle effect, since the files immediately before and after phasing have the same REF alleles. I thought about just flipping the REF and ALT columns accordingly using a custom-made R script, but then my phased genotypes at these SNPs will not correspond with the flipped alleles. I would like the phased genotypes to reflect the corresponding REF and ALT alleles.
Does anyone know how to handle this with bcftools of with other program?
Below find the headers of my vcf files with only 3 individuals per population.
Any help or suggestion would be very welcome !!
pop1.vcf.gz:
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##filedate=20180903
##source="beagle.jar (r1399)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated correlation between most probable ALT dose and true ALT dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated correlation between estimated ALT dose [P(RA) + 2*P(AA)] and true ALT dose">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=10>
##bcftools_annotateVersion=1.9+htslib-1.9
##bcftools_annotateCommand=annotate -x FORMAT/DS,FORMAT/GP -o /path/pop1_195_10_GT.vcf /path/pop1_195_10_phased.vcf.gz; Date=Tue Sep 4 15:50:41 2018
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=1>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##bcftools_concatVersion=1.9+htslib-1.9
##bcftools_concatCommand=concat path/pop1_195_10_GT.vcf.gz /path/pop1_195_11_GT.vcf.gz /path/pop1_195_12_GT.vcf.gz /path/pop1_195_13_GT.vcf.gz /path/pop1_195_14_GT.vcf.gz /path/pop1_195_15_GT.vcf.gz /path/pop1_195_16_GT.vcf.gz /path/pop1_195_17_GT.vcf.gz /path/pop1_195_18_GT.vcf.gz /path/pop1_195_19_GT.vcf.gz /path/pop1_195_1_GT.vcf.gz /path/pop1_195_20_GT.vcf.gz /path/pop1_195_21_GT.vcf.gz /path/pop1_195_22_GT.vcf.gz /path/pop1_195_2_GT.vcf.gz /path/pop1_195_3_GT.vcf.gz //path/pop1_195_4_GT.vcf.gz /path/pop1_195_5_GT.vcf.gz /path/pop1_195_6_GT.vcf.gz /path/pop1_195_7_GT.vcf.gz /path/pop1_195_8_GT.vcf.gz /path/pop1_195_9_GT.vcf.gz; Date=Tue Sep 4 16:13:54 2018
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT s_1 s_2 s_3
10 144847 rs11253478 C T . PASS AR2=1;DR2=1;AF=0.082 GT 0|0 0|0 0|1
10 244561 rs2448366 G A . PASS AR2=1;DR2=1;AF=0.446 GT 1|1 0|1 0|0
10 252693 rs2379078 T C . PASS AR2=1;DR2=1;AF=0.238 GT 0|0 0|0 0|1
pop2.vcf.gz:
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##filedate=20180830
##source="beagle.jar (r1399)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated correlation between most probable ALT dose and true ALT dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated correlation between estimated ALT dose [P(RA) + 2*P(AA)] and true ALT dose">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=10>
##bcftools_annotateVersion=1.9+htslib-1.9
##bcftools_annotateCommand=annotate -x FORMAT/DS,FORMAT/GP -o /path/pop2_10_GT.vcf /path/pop2_10_phased.vcf.gz ' '; Date=Tue Sep 4 12:38:31 2018
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=1>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##bcftools_concatVersion=1.9+htslib-1.9
##bcftools_concatCommand=concat /path/pop2_10_GT.vcf.gz /path/pop2_11_GT.vcf.gz /path/pop2_12_GT.vcf.gz /path/pop2_13_GT.vcf.gz /path/pop2_14_GT.vcf.gz /path/pop2_15_GT.vcf.gz /path/pop2_16_GT.vcf.gz /path/pop2_17_GT.vcf.gz /path/pop2_18_GT.vcf.gz /path/pop2_19_GT.vcf.gz /path/pop2_1_GT.vcf.gz /path/pop2_20_GT.vcf.gz /path/pop2_21_GT.vcf.gz /path/pop2_22_GT.vcf.gz /path/pop2_2_GT.vcf.gz /path/pop2_3_GT.vcf.gz /path/pop2_4_GT.vcf.gz /path/pop2_5_GT.vcf.gz /path/pop2_6_GT.vcf.gz /path/pop2_7_GT.vcf.gz /path/pop2_8_GT.vcf.gz /path/pop2_9_GT.vcf.gz; Date=Tue Sep 4 16:14:14 2018
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind_1 ind_2 ind_3
10 144847 rs11253478 C T . PASS AR2=1;DR2=1;AF=0.474 GT 0|0 0|1 0|0
10 244561 rs2448366 G A . PASS AR2=1;DR2=1;AF=0.321 GT 1|1 1|0 0|1
10 252693 rs2379078 C T . PASS AR2=1;DR2=1;AF=0.359 GT 1|1 1|0 0|1
Hello Mr Locuace ,
I don't know snp-flip. But the manual says "outputs each SNP on the reverse strand" and not in the reference strand. To make sure that REF and ALT are set according the reference genome I would suggest the
bcftools
pluginfixref
.fin swimmer
Thank you very much @finswimmer !!