Hello all!
I have several VCF files from the same patient (not identicals). The purpose is to combine the different VCF in one, but keeping only the common ones in at least 2 VCFs. I have tried with bcftools, and also with findCommonVariants of Rsubread library. But they give me only the commons in all, or I do not know how to obtain the intersections.
My VCF files are formed from Normal-Tumor pair fastq, here a little example (I will have n
VCF files):
VCF1:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr1 1013466 . T A . PASS SOMATIC;QSS=49;TQSS=1;NT=ref;QSS_NT=49;TQSS_NT=1;SGT=TT->AT;DP=133;MQ=60.00;MQ0=0;ReadPosRankSum=-2.44;SNVSB=0.00;SomaticEVS=7.18 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 32:0:0:0:0,0:0,0:0,0:32,32 100:0:0:0:5,5:0,0:0,0:95,96
chr1 1264801 . A G . PASS SOMATIC;QSS=50;TQSS=1;NT=ref;QSS_NT=50;TQSS_NT=1;SGT=AA->AG;DP=132;MQ=60.00;MQ0=0;ReadPosRankSum=-2.09;SNVSB=0.00;SomaticEVS=10.57 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 50:0:0:0:50,50:0,0:0,0:0,0 80:2:0:0:73,76:0,0:5,6:0,0
chr1 2312653 . G A . PASS SOMATIC;QSS=16;TQSS=1;NT=ref;QSS_NT=16;TQSS_NT=1;SGT=GG->AG;DP=14;MQ=60.00;MQ0=0;ReadPosRankSum=-1.55;SNVSB=0.00;SomaticEVS=8.28 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 10:0:0:0:0,0:0,0:10,10:0,0 4:0:0:0:2,2:0,0:2,2:0,0
VCF2:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 26-M 7-M
chr1 1013466 . A G . PASS AS_FilterStatus=SITE;AS_SB_TABLE=9,8|5,0;DP=23;ECNT=1;GERMQ=21;MBQ=33,30;MFRL=352,345;MMQ=23,40;MPOS=38;NALOD=0.778;NLOD=3.00;POPAF=6.00;TLOD=10.60 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:10,0:0.083:10:7,0:3,0:2,8,0,0 0/1:7,5:0.428:12:3,5:4,0:7,0,5,0
chr1 1738127 . C T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=38,10|6,5;DP=61;ECNT=1;GERMQ=56;MBQ=35,20;MFRL=313,238;MMQ=60,45;MPOS=50;NALOD=1.15;NLOD=3.91;POPAF=6.00;TLOD=18.09 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:14,0:0.067:14:6,0:8,0:12,2,0,0 0/1:34,11:0.194:45:17,6:17,4:26,8,6,5
chr1 2312653 . G A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=17,13|3,2;DP=35;ECNT=1;GERMQ=33;MBQ=20,20;MFRL=179,170;MMQ=58,60;MPOS=53;NALOD=0.997;NLOD=2.66;POPAF=6.00;TLOD=9.52 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:13,0:0.092:13:4,0:8,0:7,6,0,0 0/1:17,5:0.212:22:9,4:7,1:10,7,3,2
VCF3:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT N T
chr1 1013466 . T A . PASS SOMATIC;QSS=49;TQSS=1;NT=ref;QSS_NT=49;TQSS_NT=1;SGT=TT->AT;DP=133;MQ=60.00;MQ0=0;ReadPosRankSum=-2.44;SNVSB=0.00;SomaticEVS=7.18 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 32:0:0:0:0,0:0,0:0,0:32,32 100:0:0:0:5,5:0,0:0,0:95,96
chr1 1738127 . C T . PASS SOMATIC;QSS=50;TQSS=1;NT=ref;QSS_NT=50;TQSS_NT=1;SGT=AA->AG;DP=132;MQ=60.00;MQ0=0;ReadPosRankSum=-2.09;SNVSB=0.00;SomaticEVS=10.57 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 50:0:0:0:50,50:0,0:0,0:0,0 80:2:0:0:73,76:0,0:5,6:0,0
chr1 2312847 . C T . PASS SOMATIC;QSS=16;TQSS=1;NT=ref;QSS_NT=16;TQSS_NT=1;SGT=GG->AG;DP=14;MQ=60.00;MQ0=0;ReadPosRankSum=-1.55;SNVSB=0.00;SomaticEVS=8.28 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 10:0:0:0:0,0:0,0:10,10:0,0 4:0:0:0:2,2:0,0:2,2:0,0
As the content of the last 4 columns are not the same in the other VCF files I think it can be done if the columns 1-7 from VCF1 are the same in at least one of the rest VCF, then copy columns 8-11 of the commons VCF to the VCF1.
VCFcombined: Output expected (It will be great if it is possible to add a column named FILE
with the filenames presents):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR FILTER INFO FORMAT 26-M 7-M N T FILE
chr1 1013466 . T A . PASS SOMATIC;QSS=49;TQSS=1;NT=ref;QSS_NT=49;TQSS_NT=1;SGT=TT->AT;DP=133;MQ=60.00;MQ0=0;ReadPosRankSum=-2.44;SNVSB=0.00;SomaticEVS=7.18 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 32:0:0:0:0,0:0,0:0,0:32,32 100:0:0:0:5,5:0,0:0,0:95,96 AS_FilterStatus=SITE;AS_SB_TABLE=9,8|5,0;DP=23;ECNT=1;GERMQ=21;MBQ=33,30;MFRL=352,345;MMQ=23,40;MPOS=38;NALOD=0.778;NLOD=3.00;POPAF=6.00;TLOD=10.60 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:10,0:0.083:10:7,0:3,0:2,8,0,0 0/1:7,5:0.428:12:3,5:4,0:7,0,5,0 VCF1,VCF2,VCF3
chr1 2312653 . G A . PASS SOMATIC;QSS=16;TQSS=1;NT=ref;QSS_NT=16;TQSS_NT=1;SGT=GG->AG;DP=14;MQ=60.00;MQ0=0;ReadPosRankSum=-1.55;SNVSB=0.00;SomaticEVS=8.28 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 10:0:0:0:0,0:0,0:10,10:0,0 4:0:0:0:2,2:0,0:2,2:0,0 AS_FilterStatus=SITE;AS_SB_TABLE=17,13|3,2;DP=35;ECNT=1;GERMQ=33;MBQ=20,20;MFRL=179,170;MMQ=58,60;MPOS=53;NALOD=0.997;NLOD=2.66;POPAF=6.00;TLOD=9.52 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:13,0:0.092:13:4,0:8,0:7,6,0,0 0/1:17,5:0.212:22:9,4:7,1:10,7,3,2 VCF1,VCF2
chr1 1738127 . C T . PASS . . AS_FilterStatus=SITE;AS_SB_TABLE=38,10|6,5;DP=61;ECNT=1;GERMQ=56;MBQ=35,20;MFRL=313,238;MMQ=60,45;MPOS=50;NALOD=1.15;NLOD=3.91;POPAF=6.00;TLOD=18.09 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:14,0:0.067:14:6,0:8,0:12,2,0,0 0/1:34,11:0.194:45:17,6:17,4:26,8,6,5 VCF2,VCF3
Any help is more than welcome! Thanks!!
Hi!
I'm sure R can do this as well, but pandas module has a merge function like:
You can do the same with the other combinations and at the end, concat them and drop duplicates. Maybe you can add extra column for the filenames before.
bcftools isec should be able to give you "present in n files" variants with the
-n
parameterEnsure that your VCF is decomposed and normalized (left aligned, parsimonious representation) before you do this though, multi-allelic variants and non-normalized entries can mess up comparison.