Hello,
I am relatively new to the field of bioinformatics and I am currently working on a small program which should, among other things, filter a multisample VCF file for all genotypes except one of them. Seven genotpyes have been sampled and all variants, which belong to one of those genotpyes are to be "erased" (or every other variant except those should be copied to a new file).
A few lines from my file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Dom HOR2932 HOR3036 HOR3726 KWSBambina Rec S42IL_124
chr1H 58025 . A G 387.19 . AC=6;AF=0.429;AN=14;DP=114;ExcessHet=8.2628;MLEAC=6;MLEAF=0.429;QD=29.78 GT:AD:AF:DP:GQ:PL 0/1:4,9:0.6000:15:27:34,0,27 0/1:24,0:.:24:47:47,0,527 0/1:12,0:.:12:23:23,0,263 0/1:8,0:.:8:65:65,0,125 0/1:49,0:.:49:99:212,0,962 0/1:5,0:.:5:14:14,0,104 0/0:1,0:.:1:3:0,3,29
chr1H 58051 . T C 82.02 . AC=4;AF=0.286;AN=14;DP=109;ExcessHet=0.0921;MLEAC=4;MLEAF=0.286;QD=2.93 GT:AD:AF:DP:GQ:PL 1/1:1,17:0.8947:19:33:77,33,0 0/0:26,0:.:26:18:0,18,659 0/0:12,0:.:12:6:0,6,299 0/1:2,3:0.4286:7:3:8,0,3 0/0:39,0:.:39:99:0,117,1169 0/1:2,3:0.6000:5:16:16,0,41 0/0:1,0:.:1:3:0,3,29
chr1H 58057 . T C 89.43 . AC=3;AF=0.214;AN=14;DP=112;ExcessHet=1.1394;MLEAC=3;MLEAF=0.214;QD=17.89 GT:AD:AF:DP:GQ:PL 0/0:19,0:.:19:57:0,57,569 0/0:26,0:.:26:51:0,51,749 0/0:12,0:.:12:6:0,6,299 0/1:7,0:.:7:8:8,0,158 0/1:42,0:.:42:83:83,0,923 0/1:3,2:0.4000:5:13:13,0,46 0/0:1,0:.:1:3:0,3,29
What I got from my research so far is, that the QUAL column doesn't help, since I have a multisample VCF.
I thought of filtering for the phred-score of each Genotype. Also there is a lot of posts talking about bcftools, which I never used before, so I don't know if that would be the right tool to use.
I don't expect code or anything, I just need an idea to get on the right track.
Thanks!
filter for what ?
I think OP wants to remove one genotype for all samples from a multisample VCF.
The seven genotypes being 58025AA, 58025AG, 58051TT, 58051TC, 58051CC, 58057TT, 58057TC
I probably got some of the vocabulary wrong. I thought that "Dom HOR2932 HOR3036 HOR3726 KWSBambina Rec S42IL_124" were representing my genotpyes.
Anyway, what i want to remove from my file is all Variants which "belong" to KWSBambina.
Is that possible? How do I identify, which of the Variants belong to Bambina?
Those are samples. If a sample has a 0/1 or a 1/1 genotype for that variant, they have the variant.
Your question is ambiguous because you haven't provided an example of the end result would look like.
You want to remove a sample?
You want to remove a variant that is unique to a certain sample?
You want to remove any variant for which a particular sample is a carrier?
Show us what the inputs and outputs are for a given example.
Ok so I definitely didn't understand at first what my goal was.
The goal is to remove all the variants which are unique to the KWSBambina sample.
The input is my normal VCF, output should only be the variants which are not unique to KWSBambina, copied to a new file.