HI,
I recently obtained TCGA VCF files to search for germline variants. The variants were called by Washington University using several callers i.e Samtools, Sniper, Varscan, and strelka , which were separately lumped into one VCF file. Upon checking the files, most of the variants called by all callers except Varscan are uninformative . So I can only annotate variants that were called by Varscan.
This is how the variant header looks like :
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCGA-PG-A914-01A-11D-A37N-09 TCGA-PG-A914-01A-11D-A37N-09-[Samtools] TCGA-PG-A914-10A-01D-A37N-09 TCGA-PG-A914-10A-01D-A37N-09-[Sniper] TCGA-PG-A914-01A-11D-A37N-09-[Sniper] TCGA-PG-A914-10A-01D-A37N-09-[VarscanSomatic] TCGA-PG-A914-01A-11D-A37N-09-[VarscanSomatic] TCGA-PG-A914-10A-01D-A37N-09-[Strelka] TCGA-PG-A914-01A-11D-A37N-09-[Strelka]
The problem comes when the format column is not consistent. These are all the formats in the VCF files.
GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:FA:AD:FDP:SDP:SUBDP:AU:CU:GU:TU GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:IGT:DP4:BCOUNT:JGQ:AMQ:SSC:FDP:SDP:SUBDP:AU:CU:GU:TU GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:GQ:MQ:IGT:BCOUNT:JGQ:AMQ:SSC GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:DP4:FDP:SDP:SUBDP:AU:CU:GU:TU GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:FA GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:GQ:MQ GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:IGT:DP4:BCOUNT:JGQ:AMQ:SSC GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:FDP:SDP:SUBDP:AU:CU:GU:TU:DP4 GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ:GQ:MQ:IGT:BCOUNT:JGQ:AMQ:SSC GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:AD:FA:FDP:SDP:SUBDP:AU:CU:GU:TU GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:FA GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ GT:DP:DP4:BQ:FA:VAQ:SS:FT:GQ:MQ:AD:IGT:BCOUNT:JGQ:AMQ:SSC GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:AD:FA GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ:GQ:MQ GT:DP:DP4:BQ:FA:VAQ:SS:FT:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC GT:DP:DP4:BQ:FA:VAQ:SS:FT:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC:AD GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:DP4 GT:DP:DP4:BQ:FA:VAQ:SS:FT GT:DP:DP4:BQ:FA:VAQ:SS:FT:GQ:MQ:AD
I'm not too sure what are the strategies to annotate these kind of VCF files and would really appreciate any help if you have encountered this kind of VCF formatting.
Disclaimers: 1) I have the right authorization to use the data 2) I have emailed TCGA regarding this issue and no solution was given 3) I have emailed Washington University a few weeks ago and haven't received any reply
How are you interested in annotating the VCF -- with another program like snpEff or vep, or with custom scripts? The former should not be a problem if the VCF is valid; for the latter, try a VCF parsing library like pyvcf which will keep track of the format tags for you.