VCF combination for common variants
0
0
Entering edit mode
3.3 years ago
windsur ▴ 20

Hello all!

I have several VCF files from the same patient (not identicals). The purpose is to combine the different VCF in one, but keeping only the common ones in at least 2 VCFs. I have tried with bcftools, and also with findCommonVariants of Rsubread library. But they give me only the commons in all, or I do not know how to obtain the intersections.

My VCF files are formed from Normal-Tumor pair fastq, here a little example (I will have n VCF files):

VCF1:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
chr1    1013466 .   T   A   .   PASS    SOMATIC;QSS=49;TQSS=1;NT=ref;QSS_NT=49;TQSS_NT=1;SGT=TT->AT;DP=133;MQ=60.00;MQ0=0;ReadPosRankSum=-2.44;SNVSB=0.00;SomaticEVS=7.18   DP:FDP:SDP:SUBDP:AU:CU:GU:TU    32:0:0:0:0,0:0,0:0,0:32,32  100:0:0:0:5,5:0,0:0,0:95,96
chr1    1264801 .   A   G   .   PASS    SOMATIC;QSS=50;TQSS=1;NT=ref;QSS_NT=50;TQSS_NT=1;SGT=AA->AG;DP=132;MQ=60.00;MQ0=0;ReadPosRankSum=-2.09;SNVSB=0.00;SomaticEVS=10.57  DP:FDP:SDP:SUBDP:AU:CU:GU:TU    50:0:0:0:50,50:0,0:0,0:0,0  80:2:0:0:73,76:0,0:5,6:0,0
chr1    2312653 .   G   A   .   PASS    SOMATIC;QSS=16;TQSS=1;NT=ref;QSS_NT=16;TQSS_NT=1;SGT=GG->AG;DP=14;MQ=60.00;MQ0=0;ReadPosRankSum=-1.55;SNVSB=0.00;SomaticEVS=8.28    DP:FDP:SDP:SUBDP:AU:CU:GU:TU    10:0:0:0:0,0:0,0:10,10:0,0  4:0:0:0:2,2:0,0:2,2:0,0

VCF2:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  26-M    7-M
chr1    1013466 .   A   G   .   PASS    AS_FilterStatus=SITE;AS_SB_TABLE=9,8|5,0;DP=23;ECNT=1;GERMQ=21;MBQ=33,30;MFRL=352,345;MMQ=23,40;MPOS=38;NALOD=0.778;NLOD=3.00;POPAF=6.00;TLOD=10.60 GT:AD:AF:DP:F1R2:F2R1:SB    0/0:10,0:0.083:10:7,0:3,0:2,8,0,0   0/1:7,5:0.428:12:3,5:4,0:7,0,5,0
chr1    1738127 .   C   T   .   PASS    AS_FilterStatus=SITE;AS_SB_TABLE=38,10|6,5;DP=61;ECNT=1;GERMQ=56;MBQ=35,20;MFRL=313,238;MMQ=60,45;MPOS=50;NALOD=1.15;NLOD=3.91;POPAF=6.00;TLOD=18.09    GT:AD:AF:DP:F1R2:F2R1:SB    0/0:14,0:0.067:14:6,0:8,0:12,2,0,0  0/1:34,11:0.194:45:17,6:17,4:26,8,6,5
chr1    2312653 .   G   A   .   PASS    AS_FilterStatus=SITE;AS_SB_TABLE=17,13|3,2;DP=35;ECNT=1;GERMQ=33;MBQ=20,20;MFRL=179,170;MMQ=58,60;MPOS=53;NALOD=0.997;NLOD=2.66;POPAF=6.00;TLOD=9.52    GT:AD:AF:DP:F1R2:F2R1:SB    0/0:13,0:0.092:13:4,0:8,0:7,6,0,0   0/1:17,5:0.212:22:9,4:7,1:10,7,3,2

VCF3:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  N   T
chr1    1013466 .   T   A   .   PASS    SOMATIC;QSS=49;TQSS=1;NT=ref;QSS_NT=49;TQSS_NT=1;SGT=TT->AT;DP=133;MQ=60.00;MQ0=0;ReadPosRankSum=-2.44;SNVSB=0.00;SomaticEVS=7.18   DP:FDP:SDP:SUBDP:AU:CU:GU:TU    32:0:0:0:0,0:0,0:0,0:32,32  100:0:0:0:5,5:0,0:0,0:95,96
chr1    1738127 .   C   T   .   PASS    SOMATIC;QSS=50;TQSS=1;NT=ref;QSS_NT=50;TQSS_NT=1;SGT=AA->AG;DP=132;MQ=60.00;MQ0=0;ReadPosRankSum=-2.09;SNVSB=0.00;SomaticEVS=10.57  DP:FDP:SDP:SUBDP:AU:CU:GU:TU    50:0:0:0:50,50:0,0:0,0:0,0  80:2:0:0:73,76:0,0:5,6:0,0
chr1    2312847 .   C   T   .   PASS    SOMATIC;QSS=16;TQSS=1;NT=ref;QSS_NT=16;TQSS_NT=1;SGT=GG->AG;DP=14;MQ=60.00;MQ0=0;ReadPosRankSum=-1.55;SNVSB=0.00;SomaticEVS=8.28    DP:FDP:SDP:SUBDP:AU:CU:GU:TU    10:0:0:0:0,0:0,0:10,10:0,0  4:0:0:0:2,2:0,0:2,2:0,0

As the content of the last 4 columns are not the same in the other VCF files I think it can be done if the columns 1-7 from VCF1 are the same in at least one of the rest VCF, then copy columns 8-11 of the commons VCF to the VCF1.

VCFcombined: Output expected (It will be great if it is possible to add a column named FILE with the filenames presents):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR   FILTER  INFO    FORMAT  26-M    7-M N   T   FILE
    chr1    1013466 .   T   A   .   PASS    SOMATIC;QSS=49;TQSS=1;NT=ref;QSS_NT=49;TQSS_NT=1;SGT=TT->AT;DP=133;MQ=60.00;MQ0=0;ReadPosRankSum=-2.44;SNVSB=0.00;SomaticEVS=7.18   DP:FDP:SDP:SUBDP:AU:CU:GU:TU    32:0:0:0:0,0:0,0:0,0:32,32  100:0:0:0:5,5:0,0:0,0:95,96 AS_FilterStatus=SITE;AS_SB_TABLE=9,8|5,0;DP=23;ECNT=1;GERMQ=21;MBQ=33,30;MFRL=352,345;MMQ=23,40;MPOS=38;NALOD=0.778;NLOD=3.00;POPAF=6.00;TLOD=10.60 GT:AD:AF:DP:F1R2:F2R1:SB    0/0:10,0:0.083:10:7,0:3,0:2,8,0,0   0/1:7,5:0.428:12:3,5:4,0:7,0,5,0    VCF1,VCF2,VCF3
    chr1    2312653 .   G   A   .   PASS    SOMATIC;QSS=16;TQSS=1;NT=ref;QSS_NT=16;TQSS_NT=1;SGT=GG->AG;DP=14;MQ=60.00;MQ0=0;ReadPosRankSum=-1.55;SNVSB=0.00;SomaticEVS=8.28    DP:FDP:SDP:SUBDP:AU:CU:GU:TU    10:0:0:0:0,0:0,0:10,10:0,0  4:0:0:0:2,2:0,0:2,2:0,0 AS_FilterStatus=SITE;AS_SB_TABLE=17,13|3,2;DP=35;ECNT=1;GERMQ=33;MBQ=20,20;MFRL=179,170;MMQ=58,60;MPOS=53;NALOD=0.997;NLOD=2.66;POPAF=6.00;TLOD=9.52    GT:AD:AF:DP:F1R2:F2R1:SB    0/0:13,0:0.092:13:4,0:8,0:7,6,0,0   0/1:17,5:0.212:22:9,4:7,1:10,7,3,2  VCF1,VCF2
    chr1    1738127 .   C   T   .   PASS        .        .  AS_FilterStatus=SITE;AS_SB_TABLE=38,10|6,5;DP=61;ECNT=1;GERMQ=56;MBQ=35,20;MFRL=313,238;MMQ=60,45;MPOS=50;NALOD=1.15;NLOD=3.91;POPAF=6.00;TLOD=18.09    GT:AD:AF:DP:F1R2:F2R1:SB    0/0:14,0:0.067:14:6,0:8,0:12,2,0,0  0/1:34,11:0.194:45:17,6:17,4:26,8,6,5   VCF2,VCF3

Any help is more than welcome! Thanks!!

dataframe VCF R • 1.3k views
ADD COMMENT
0
Entering edit mode

Hi!

I'm sure R can do this as well, but pandas module has a merge function like:

df_merged = pd.merge(df_vcf1, df_vcf2, on=['Chrom', 'Pos', ...], how='inner')

You can do the same with the other combinations and at the end, concat them and drop duplicates. Maybe you can add extra column for the filenames before.

ADD REPLY
0
Entering edit mode

bcftools isec should be able to give you "present in n files" variants with the -n parameter

bcftools isec -n+2 -p output_dir *.vcf.gz

Ensure that your VCF is decomposed and normalized (left aligned, parsimonious representation) before you do this though, multi-allelic variants and non-normalized entries can mess up comparison.

ADD REPLY

Login before adding your answer.

Traffic: 2154 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6