I have two vcf files, each with the same samples and the same variants, but with different INFO and FORMAT fields. I need to merge them together into a single vcf file.
It seems like the obvious answer would be something like bcftools concat
or GATK MergeVcfs
, however neither of them are working correctly for this.
The vcf files look like this;
$ cat 1.vcf
##fileformat=VCFv4.2
##FORMAT=<ID=FL_AD,Number=1,Type=Integer,Description="Depth matching alternate (ALT) allele">
##FORMAT=<ID=FL_ADN,Number=1,Type=Integer,Description="Alternate depth on negative strand">
##FORMAT=<ID=FL_ADP,Number=1,Type=Integer,Description="Alternate depth on postitive strand">
##FORMAT=<ID=FL_DP,Number=1,Type=Integer,Description="Total depth">
##FORMAT=<ID=FL_DPN,Number=1,Type=Integer,Description="Depth on negative strand">
##FORMAT=<ID=FL_DPP,Number=1,Type=Integer,Description="Depth on postitive strand">
##FORMAT=<ID=FL_RD,Number=1,Type=Integer,Description="Depth matching reference (REF) allele">
##FORMAT=<ID=FL_RDN,Number=1,Type=Integer,Description="Reference depth on negative strand">
##FORMAT=<ID=FL_RDP,Number=1,Type=Integer,Description="Reference depth on postitive strand">
##FORMAT=<ID=FL_VF,Number=1,Type=Float,Description="Variant frequence (AD/DP)">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3
1 43804359 . C T . . . FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF 603:374:229:898:546:352:295:172:123:0.671492 222:131:91:390:228:162:168:97:71:0.569231 243:123:120:972:502:470:727:378:349:0.25
2 29551282 . C T . . . FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF 616:322:294:1346:702:644:729:380:349:0.457652 242:112:130:705:333:372:462:221:241:0.343262 237:124:113:1360:673:687:1120:547:573:0.174265
2 29551282 . CCT TCA . . . FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF 593:307:286:1331:699:632:679:364:315:0.44553 236:107:129:709:337:372:447:218:229:0.332863 230:119:111:1373:688:685:1085:538:547:0.167516
$ cat 2.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic Depths of REF and ALT(s) in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3
1 43804359 . C T . . AC=3;AN=6 GT:AD:DP 0/1:284,559:845 0/1:167,219:387 0/1:687,323
2 29551282 . C T . . AC=2;AN=4 GT:AD:DP 0/1:651,558:1211 0/1:459,241:702 ./.
2 29551282 . CCT TCA . . AC=1;AN=2 GT:AD ./. ./. 0/1:825,306
- notably,
2.vcf
has some empty FORMAT fields for some variants for some samples
The results of standard merge operations look like this:
bcftools concat
;
$ bcftools concat -a 1.vcf.gz 2.vcf.gz --output bcftools_concat.vcf --output-type v
$ cat bcftools_concat.vcf
##fileformat=VCFv4.2
## ...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3
1 43804359 . C T . . . FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF 603:374:229:898:546:352:295:172:123:0.671492 222:131:91:390:228:162:168:97:71:0.569231 243:123:120:972:502:470:727:378:349:0.25
1 43804359 . C T . . AC=3;AN=6 GT:AD:DP 0/1:284,559:845 0/1:167,219:387 0/1:687,323:.
2 29551282 . C T . . . FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF 616:322:294:1346:702:644:729:380:349:0.457652 242:112:130:705:333:372:462:221:241:0.343262 237:124:113:1360:673:687:1120:547:573:0.174265
2 29551282 . C T . . AC=2;AN=4 GT:AD:DP 0/1:651,558:1211 0/1:459,241:702 ./.:.:.
2 29551282 . CCT TCA . . . FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF 593:307:286:1331:699:632:679:364:315:0.44553 236:107:129:709:337:372:447:218:229:0.332863 230:119:111:1373:688:685:1085:538:547:0.167516
2 29551282 . CCT TCA . . AC=1;AN=2 GT:AD ./.:. ./.:. 0/1:825,306
GATK;
$ gatk MergeVcfs -I 1.vcf -I 2.vcf -O gatk_concat.vcf
$ cat gatk_concat.vcf
##fileformat=VCFv4.2
## ...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3
1 43804359 . C T . . . FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF 603:374:229:898:546:352:295:172:123:0.671492 222:131:91:390:228:162:168:97:71:0.569231 243:123:120:972:502:470:727:378:349:0.25
1 43804359 . C T . . AC=3;AN=6 GT:AD:DP 0/1:284,559:845 0/1:167,219:387 0/1:687,323
2 29551282 . C T . . . FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF 616:322:294:1346:702:644:729:380:349:0.457652 242:112:130:705:333:372:462:221:241:0.343262 237:124:113:1360:673:687:1120:547:573:0.174265
2 29551282 . C T . . AC=2;AN=4 GT:AD:DP 0/1:651,558:1211 0/1:459,241:702 ./.
2 29551282 . CCT TCA . . . FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF 593:307:286:1331:699:632:679:364:315:0.44553 236:107:129:709:337:372:447:218:229:0.332863 230:119:111:1373:688:685:1085:538:547:0.167516
2 29551282 . CCT TCA . . AC=1;AN=2 GT:AD ./. ./. 0/1:825,306
In both cases, the lines are duplicated, not merged together.
Trying instead with bcftools merge
does not seem to work right either;
$ bcftools merge --force-samples 1.vcf.gz 2.vcf.gz --output bcftools_merge.vcf
$ cat bcftools_merge.vcf
##fileformat=VCFv4.2
## ...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 2:Sample1 2:Sample2 2:Sample3
1 43804359 . C T . . AN=6;AC=3 GT:FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:AD:DP ./.:603:374:229:898:546:352:295:172:123:0.671492:.:. ./.:222:131:91:390:228:162:168:97:71:0.569231:.:. ./.:243:123:120:972:502:470:727:378:349:0.25:.:. 0/1:.:.:.:.:.:.:.:.:.:.:284,559:845 0/1:.:.:.:.:.:.:.:.:.:.:167,219:387 0/1:.:.:.:.:.:.:.:.:.:.:687,323:.
2 29551282 . C T . . AN=4;AC=2 GT:FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:AD:DP ./.:616:322:294:1346:702:644:729:380:349:0.457652:.:. ./.:242:112:130:705:333:372:462:221:241:0.343262:.:. ./.:237:124:113:1360:673:687:1120:547:573:0.174265:.:. 0/1:.:.:.:.:.:.:.:.:.:.:651,558:1211 0/1:.:.:.:.:.:.:.:.:.:.:459,241:702 ./.:.:.:.:.:.:.:.:.:.:.:.:.
2 29551282 . CCT TCA . . AN=2;AC=1 GT:FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:AD ./.:593:307:286:1331:699:632:679:364:315:0.44553:. ./.:236:107:129:709:337:372:447:218:229:0.332863:. ./.:230:119:111:1373:688:685:1085:538:547:0.167516:. ./.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:. 0/1:.:.:.:.:.:.:.:.:.:.:825,306
This time, the variants are not duplicated, but the samples are duplicated instead of being merged.
Ultimately, what I want is something like this;
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3
1 43804359 . C T . . AC=3;AN=6 FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:GT:AD:DP 603:374:229:898:546:352:295:172:123:0.671492:0/1:284,559:845 222:131:91:390:228:162:168:97:71:0.569231:0/1:167,219:387 243:123:120:972:502:470:727:378:349:0.25:0/1:687,323
2 29551282 . C T . . AC=2;AN=4 FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:GT:AD:DP 616:322:294:1346:702:644:729:380:349:0.457652:0/1:651,558:1211 242:112:130:705:333:372:462:221:241:0.343262:0/1:459,241:702 237:124:113:1360:673:687:1120:547:573:0.174265:./.
2 29551282 . CCT TCA . . AC=1;AN=2 FL_AD:FL_ADN:FL_ADP:FL_DP:FL_DPN:FL_DPP:FL_RD:FL_RDN:FL_RDP:FL_VF:GT:AD 593:307:286:1331:699:632:679:364:315:0.44553:./. 236:107:129:709:337:372:447:218:229:0.332863:./. 230:119:111:1373:688:685:1085:538:547:0.167516:0/1:825,306
Is there a way to get this output from either of these tools, or some other tool? I was trying to avoid having to resort to sed
/awk
/etc. if there is something already available that can do this correctly.
thanks for the suggestion, but that did not work. It deleted the extra rows instead of merging their INFO and FORMAT values together. So the output ends up being basically a copy of
1.vcf
and all data from2.vcf
is lost.I'm afraid you need to create and index a tsv file with
bcftools query
containing the annotations of one of the two files and then use this file withbcftools annotate
with the other vcf file.