To remove genotype "./." from vcf file with awk
2
1
Entering edit mode
10.4 years ago
ivivek_ngs ★ 5.2k

Hi,

I need help with awk commands, I have 4 samples in my vcf files, so field $10,$11,$12,$13 are the fields which have the genotype for each row, now I want remove the rows where in any of the any rows at least one sample is showing the genotype ./. and want to print the rest in another vcf file, can this be done? Am not so familiar with awk substr. Any assistance? Below is the example of my vcf file, it does not have any header.

chr3    75787186    rs150410646    C    T    53.89    .    AC=4;AF=0.500;AN=8;BaseQRankSum=-4.341;DB;DP=424;Dels=0.00;FS=0.000;HaplotypeScore=2.2684;MLEAC=4;MLEAF=0.500;MQ=6.41;MQ0=371;MQRankSum=-3.553;QD=0.13;ReadPosRankSum=-1.007    GT:AD:DP:GQ:PL    0/1:63,21:80:48:48,0,127    0/1:25,5:29:21:21,0,64    0/1:142,41:174:10:10,0,94    0/1:95,31:120:6:6,0,120
chr3    75787576    rs141348932    A    G    61.87    .    AC=2;AF=1.00;AN=2;DB;DP=195;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=4.17;MQ0=189;QD=0.69    GT:AD:DP:GQ:PL    ./.    ./.    1/1:68,22:86:9:87,9,0    ./.
chr3    75787583    rs144348996    A    G    100.62    .    AC=2;AF=1.00;AN=2;DB;DP=203;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=4.33;MQ0=197;QD=1.12    GT:AD:DP:GQ:PL    ./.    ./.    1/1:65,25:86:12:126,12,0    ./.
chr3    75787584    rs151027881    C    A    93.62    .    AC=2;AF=1.00;AN=2;DB;DP=203;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=4.33;MQ0=197;QD=1.04    GT:AD:DP:GQ:PL    ./.    ./.    1/1:64,26:86:12:119,12,0    ./.
chr3    75787620    rs145606249    T    C    153.42    .    AC=2;AF=1.00;AN=2;DB;DP=224;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=4.38;MQ0=217;QD=1.70    GT:AD:DP:GQ:PL    ./.    ./.    1/1:52,38:86:18:179,18,0    ./.
chr3    75787728    rs111389701    C    T    643.34    .    AC=8;AF=1.00;AN=8;DB;DP=186;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=8;MLEAF=1.00;MQ=10.21;MQ0=140;QD=3.46    GT:AD:DP:GQ:PL    1/1:0,32:32:3:28,3,0    1/1:0,23:23:9:82,9,0    1/1:0,82:82:51:503,51,0    1/1:0,49:49:6:55,6,0

I want to remove the rows where if any of the column $10,$11,$12,$13 is having ./. no genotype then I want to eliminate those rows. Sorry for the formatting, I am not being able to get the correct format. Any suggestions?

vcftools snp vcf • 5.5k views
ADD COMMENT
2
Entering edit mode

Why don't you use vcftools with the --phase option?

ADD REPLY
1
Entering edit mode

Thanks a lot,

I have figured it out with the below command

sed '/\.\/\./d' input.vcf > out.vcf
ADD REPLY
2
Entering edit mode

grep -vw with appropriate escaping would do it as well

ADD REPLY
4
Entering edit mode
10.2 years ago

You might take a look at vawk, which is a tool from Aaron Quinlan's group and acts as an intelligent wrapper for awk on VCF.

ADD COMMENT
0
Entering edit mode

@Matt Shirley seems quite amazing tool. However I worked it out how to do the same with my vcf file and I have already put it as an answer above but seems others missed it , its a one liner to remove the missed genotypes , however thanks everyone for the other smart ways and Matt thanks for the tool, pretty useful for other stuffs I am interested in.

ADD REPLY
0
Entering edit mode
10.2 years ago
axelwilhelm ▴ 120

Something like

awk 'substr($10,0,3)!="./.", substr($11,0,3)!="./.", substr($12,0,3)!="./.", substr($13,0,3)!="./."'
ADD COMMENT

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6