Question

How to merge variants for the same samples from different vcf files?

2

Entering edit mode

3.6 years ago

steve ★ 3.5k

I have two vcf files, each with the same samples and the same variants, but with different INFO and FORMAT fields. I need to merge them together into a single vcf file.

It seems like the obvious answer would be something like bcftools concat or GATK MergeVcfs, however neither of them are working correctly for this.

The vcf files look like this;

$ cat 1.vcf
##fileformat=VCFv4.2
##FORMAT=<ID=FL_AD,Number=1,Type=Integer,Description="Depth matching alternate (ALT) allele">
##FORMAT=<ID=FL_ADN,Number=1,Type=Integer,Description="Alternate depth on negative strand">
##FORMAT=<ID=FL_ADP,Number=1,Type=Integer,Description="Alternate depth on postitive strand">
##FORMAT=<ID=FL_DP,Number=1,Type=Integer,Description="Total depth">
##FORMAT=<ID=FL_DPN,Number=1,Type=Integer,Description="Depth on negative strand">
##FORMAT=<ID=FL_DPP,Number=1,Type=Integer,Description="Depth on postitive strand">
##FORMAT=<ID=FL_RD,Number=1,Type=Integer,Description="Depth matching reference (REF) allele">
##FORMAT=<ID=FL_RDN,Number=1,Type=Integer,Description="Reference depth on negative strand">
##FORMAT=<ID=FL_RDP,Number=1,Type=Integer,Description="Reference depth on postitive strand">
##FORMAT=<ID=FL_VF,Number=1,Type=Float,Description="Variant frequence (AD/DP)">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3
1       43804359        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     603:374:229:898:546:352:295:172:123:0.671492     222:131:91:390:228:162:168:97:71:0.569231       243:123:120:972:502:470:727:378:349:0.25
2       29551282        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     616:322:294:1346:702:644:729:380:349:0.457652    242:112:130:705:333:372:462:221:241:0.343262    237:124:113:1360:673:687:1120:547:573:0.174265
2       29551282        .       CCT     TCA     .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     593:307:286:1331:699:632:679:364:315:0.44553     236:107:129:709:337:372:447:218:229:0.332863    230:119:111:1373:688:685:1085:538:547:0.167516

$ cat 2.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic Depths of REF and ALT(s) in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3
1       43804359        .       C       T       .       .       AC=3;AN=6       GT:AD:DP        0/1:284,559:845 0/1:167,219:387 0/1:687,323
2       29551282        .       C       T       .       .       AC=2;AN=4       GT:AD:DP        0/1:651,558:1211        0/1:459,241:702 ./.
2       29551282        .       CCT     TCA     .       .       AC=1;AN=2       GT:AD   ./.     ./.     0/1:825,306

notably, 2.vcf has some empty FORMAT fields for some variants for some samples

The results of standard merge operations look like this:

bcftools concat;

$ bcftools concat -a 1.vcf.gz 2.vcf.gz --output bcftools_concat.vcf --output-type v

$ cat bcftools_concat.vcf
##fileformat=VCFv4.2
## ...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3
1       43804359        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     603:374:229:898:546:352:295:172:123:0.671492     222:131:91:390:228:162:168:97:71:0.569231       243:123:120:972:502:470:727:378:349:0.25
1       43804359        .       C       T       .       .       AC=3;AN=6       GT:AD:DP        0/1:284,559:845 0/1:167,219:387 0/1:687,323:.
2       29551282        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     616:322:294:1346:702:644:729:380:349:0.457652    242:112:130:705:333:372:462:221:241:0.343262    237:124:113:1360:673:687:1120:547:573:0.174265
2       29551282        .       C       T       .       .       AC=2;AN=4       GT:AD:DP        0/1:651,558:1211        0/1:459,241:702 ./.:.:.
2       29551282        .       CCT     TCA     .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     593:307:286:1331:699:632:679:364:315:0.44553     236:107:129:709:337:372:447:218:229:0.332863    230:119:111:1373:688:685:1085:538:547:0.167516
2       29551282        .       CCT     TCA     .       .       AC=1;AN=2       GT:AD   ./.:.   ./.:.   0/1:825,306

GATK;

$ gatk MergeVcfs -I 1.vcf -I 2.vcf -O gatk_concat.vcf

$ cat gatk_concat.vcf
##fileformat=VCFv4.2
## ...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3
1       43804359        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     603:374:229:898:546:352:295:172:123:0.671492     222:131:91:390:228:162:168:97:71:0.569231       243:123:120:972:502:470:727:378:349:0.25
1       43804359        .       C       T       .       .       AC=3;AN=6       GT:AD:DP        0/1:284,559:845 0/1:167,219:387 0/1:687,323
2       29551282        .       C       T       .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     616:322:294:1346:702:644:729:380:349:0.457652    242:112:130:705:333:372:462:221:241:0.343262    237:124:113:1360:673:687:1120:547:573:0.174265
2       29551282        .       C       T       .       .       AC=2;AN=4       GT:AD:DP        0/1:651,558:1211        0/1:459,241:702 ./.
2       29551282        .       CCT     TCA     .       .       .       FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF     593:307:286:1331:699:632:679:364:315:0.44553     236:107:129:709:337:372:447:218:229:0.332863    230:119:111:1373:688:685:1085:538:547:0.167516
2       29551282        .       CCT     TCA     .       .       AC=1;AN=2       GT:AD   ./.     ./.     0/1:825,306

In both cases, the lines are duplicated, not merged together.

Trying instead with bcftools merge does not seem to work right either;

$ bcftools merge --force-samples 1.vcf.gz 2.vcf.gz --output bcftools_merge.vcf

$ cat bcftools_merge.vcf
##fileformat=VCFv4.2
## ...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3 2:Sample1       2:Sample2       2:Sample3
1       43804359        .       C       T       .       .       AN=6;AC=3       GT:FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:AD:DP     ./.:603:374:229:898:546:352:295:172:123:0.671492:.:.    ./.:222:131:91:390:228:162:168:97:71:0.569231:.:.       ./.:243:123:120:972:502:470:727:378:349:0.25:.:.       0/1:.:.:.:.:.:.:.:.:.:.:284,559:845     0/1:.:.:.:.:.:.:.:.:.:.:167,219:387     0/1:.:.:.:.:.:.:.:.:.:.:687,323:.
2       29551282        .       C       T       .       .       AN=4;AC=2       GT:FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:AD:DP     ./.:616:322:294:1346:702:644:729:380:349:0.457652:.:.   ./.:242:112:130:705:333:372:462:221:241:0.343262:.:.    ./.:237:124:113:1360:673:687:1120:547:573:0.174265:.:. 0/1:.:.:.:.:.:.:.:.:.:.:651,558:1211    0/1:.:.:.:.:.:.:.:.:.:.:459,241:702     ./.:.:.:.:.:.:.:.:.:.:.:.:.
2       29551282        .       CCT     TCA     .       .       AN=2;AC=1       GT:FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:AD        ./.:593:307:286:1331:699:632:679:364:315:0.44553:.      ./.:236:107:129:709:337:372:447:218:229:0.332863:.      ./.:230:119:111:1373:688:685:1085:538:547:0.167516:.   ./.:.:.:.:.:.:.:.:.:.:.:.       ./.:.:.:.:.:.:.:.:.:.:.:.       0/1:.:.:.:.:.:.:.:.:.:.:825,306

This time, the variants are not duplicated, but the samples are duplicated instead of being merged.

Ultimately, what I want is something like this;

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3
1   43804359    .   C   T   .   .   AC=3;AN=6   FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:GT:AD:DP  603:374:229:898:546:352:295:172:123:0.671492:0/1:284,559:845    222:131:91:390:228:162:168:97:71:0.569231:0/1:167,219:387   243:123:120:972:502:470:727:378:349:0.25:0/1:687,323
2   29551282    .   C   T   .   .   AC=2;AN=4   FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:GT:AD:DP  616:322:294:1346:702:644:729:380:349:0.457652:0/1:651,558:1211  242:112:130:705:333:372:462:221:241:0.343262:0/1:459,241:702    237:124:113:1360:673:687:1120:547:573:0.174265:./.
2   29551282    .   CCT TCA .   .   AC=1;AN=2   FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:GT:AD 593:307:286:1331:699:632:679:364:315:0.44553:./.    236:107:129:709:337:372:447:218:229:0.332863:./.    230:119:111:1373:688:685:1085:538:547:0.167516:0/1:825,306

Is there a way to get this output from either of these tools, or some other tool? I was trying to avoid having to resort to sed/awk/etc. if there is something already available that can do this correctly.

vcf • 2.0k views

ADD COMMENT • link 3.5 years ago by steve ★ 3.5k

score 0 · Answer 1 · 2021-05-07

0

Entering edit mode

3.6 years ago

Pierre Lindenbaum 164k

you want bcftools concat with --remove-duplicates

ADD COMMENT • link 3.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

thanks for the suggestion, but that did not work. It deleted the extra rows instead of merging their INFO and FORMAT values together. So the output ends up being basically a copy of 1.vcf and all data from 2.vcf is lost.

ADD REPLY • link 3.6 years ago by steve ★ 3.5k

1

Entering edit mode

I'm afraid you need to create and index a tsv file with bcftools query containing the annotations of one of the two files and then use this file with bcftools annotate with the other vcf file.

ADD REPLY • link 3.6 years ago by Pierre Lindenbaum 164k

score 0 · Answer 2 · 2021-05-10

As suggested by Pierre, looks like bcftools annotate will do this;

# get the header lines
grep -E '##FORMAT|##INFO' 2.vcf > 2.header.txt
# annotate the first vcf with the second vcf fields
bcftools annotate --header-lines 2.header.txt --annotations 2.vcf.gz  --columns 'FORMAT/GT,FORMAT/AD,FORMAT/DP,INFO/AC,INFO/AN' --output merge.vcf --output-type v 1.vcf.gz

https://samtools.github.io/bcftools/howtos/annotate.html