Entering edit mode
3.4 years ago
HL
▴
10
Hi, I have a vcf file where is about 60 000 columns. Here is example of the first three lines:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10022-20416-17 10024-34469-18A 10025-34469-18B 10034-31625-18A 10035-31625-18B 10036-31625-18C 10042-29083-18 10044-34485-18A 10045-34485-18B 10046-34485-18C 10069-33802-18 10070-20895-17 10072-20901-17 10074-20904-17 10080-20908-17 10109-34224-18 1011-22957-18 10118
2 179391728 . C T 1109.77 PASS BaseQRankSum=-2.601;ClippingRankSum=0;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=11.81;ReadPosRankSum=0.626;SOR=0.76;DP=95;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL ./.:.:.:.:. ./.:.:.:.:. 0/1:44,47:91:99:1053,0,1069 ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:.
2 179391738 . C G 2090.77 PASS BaseQRankSum=0.25;ClippingRankSum=0;ExcessHet=3.0103;FS=2.282;MQ=60;MQRankSum=0;QD=14.32;ReadPosRankSum=0.857;SOR=0.953;DP=370;AF=0.5;MLEAC=1;MLEAF=0.5;AN=6;AC=3 GT:AD:DP:GQ:PL ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. 0/1:88,68:156:99:2586,0,4687 ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:.
So there is many different sample numbers as columns and there is for every sample column there is some information at some variant. I would like to get the output so that there would only show that column where is information for every line like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10025-34469-18B
2 179391728 . C T 1109.77 PASS BaseQRankSum=-2.601;ClippingRankSum=0;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=11.81;ReadPosRankSum=0.626;SOR=0.76;DP=95;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL 0/1:44,47:91:99:1053,0,1069
It would also be important to see the sample number in the headers that includes this GT:AD:DP:GQ:PL info. I think this would be possible somehow with awk, but I just don't know how. It would be really good if this is possible to be done with unix.
I don't understand the difference between the two examples.
In the end of every line there is removed all the columns that has empty genotype informations.
I don't know if I correctly understood your question... Do you want to output lines where all your 60 000 patients have been genotyped?
No, those I can print now, but I would like to print for every line just the one column where is the genotype informations and not the rest 60 000 that are empty columns. Because now the end of every line is something like this
and I want to print only the column where is this
0/1:44,47:91:99:1053,0,1069
.hum... sounds like a xy problem. What do you want to do at the end ?
I would like to have a file where is not these "./.:.:.:.:." empty columns and every variant would have their own genotype informations printed in the end of every line like this.
In the end of headers it's okay to have all the different samples, but not necessary.
So basically just if the column has "./.:.:.:.:." it should not be printed.
yes but WHY ????!!! you could just split VCFs per sample Splitting vcf files to individual samples and then use "bcftools view --exclude-uncalled " to keep the called .
If I would split the file by samples and every sample makes a new file, then I would have over 60 000 different files that does not sound very nice to go through.