Hello, I'm looking for a quick way to remove PGT, PID, and PS from the FORMAT of a vcf file output by GATK. Currently, some sites have these flags while others don't (see example below). For downstream processing in another program, I need all sites to have the same flags and phasing information doesn't matter, so the easiest way to achieve this will be removing the phasing data entirely from the vcf file. Do you have any suggestions for how to do this? It feels like it should be easy enough to do using sed or awk, but I can't figure it out.
chr1 1 . G A 100 . DP=10 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0
chr2 4 . C T 100 . DP=10 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:1,0:1:3:.:.:0,3,45:.
chr3 2 . A T 100 . DP=10 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:1,2:3:36:0|1:153_C_T:81,0,36:153
I can use sed -e 's/:\.:\.:/:/g' | sed -e 's/:\.\t/\t/g' | sed -e 's/GT:AD:DP:GQ:PGT:PID:PL:PS/GT:AD:DP:GQ:PL/g'
to remove most of the information (shown below), but can't figure out how to deal with the third case where there is phased data.
chr1 1 . G A 100 . DP=10 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0
chr2 4 . C T 100 . DP=10 GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,45
chr3 2 . A T 100 . DP=10 GT:AD:DP:GQ:PL 0|1:1,2:3:36:0|1:153_C_T:81,0,36:153
This is what I would like as the final result. Transforming 0|1 into 0/1 is straightforward, but I'm having a difficult time figuring out how to remove the information contained in the PGT, PID, and PS areas when it's not consistent across sites.
chr1 1 . G A 100 . DP=10 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0
chr2 4 . C T 100 . DP=10 GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,45
chr3 2 . A T 100 . DP=10 GT:AD:DP:GQ:PL 0/1:1,2:3:36:81,0,36
Thank you in advance!
Awesome, thank you! I didn't know bcf would get rid of tags in addition to writing new ones