Hi Team ...
I want to use bcftools, or any other non-hail way, to achieve below filtering on my vcf files ...
#filter for samples that are > 95% call rate
hail way
mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.95)
I'm very new to the bioinformatics world and have used bcftools till now to do following tasks ...
- normalization of vcf files ( via norm option of bcftools )
- filtering samples from vcf file ( by passing -s and -S flags to view option of bcftools )
- basic column based filtering like, via filter -i 'FILTER=="PASS" || FILTER=="MONOALLELIC"'
- concatenating blocks of vcf files ( via concat option of bcftools )
Can anyone please help me to achieve filtering for samples that have 95% of call rate?
or at least break down the steps required to achieve it and I will try to decode the manual of bcftools to achieve those steps ..
I tried finding solution in already answered question but didn't find a solution ... if someone has bookmarked the solution then please share that ...
May be this would help others too who don't want to use hail for basic filtering ...
I would appreciate any help, any pointer ...
thanks team ...
ps : if someone know a good video lecture or course that walks through some example scenarios with bcftools then please do share ...
What is "call rate" and how is it measured? I have a hunch that hail does QC while converting VCF to the custom format it uses (VDS?) and that's how you're getting ready access to this metric. If hail calculates it on the fly, maybe we can figure out a way to do it using bcftools.
IMHO , I think call rate is as defined by @Kevin Blighe in below thread !!!
What is SNP call rate?
Though he talks only about SNP ... in VCF context we can consider all the variants like indels, SNPs, etc
Much Thanks
So you're looking for a fraction of samples that have a called GT. I took a quick look at the bcftools manual, and
F_MISSING
seems like the computed metric you're looking for.So,
Thanks a ton @_r_am, I will try to dig into this aspect of bcftools ... this tool is so so extensive !!! I would also think through if this metrics matches with the calculation done by hail I had added some code and comment from the hail documentation( you added response before I could read through the hail code, I don't know how you do what you do, I had seen you also moderated my original question to make portion of it as "code" ... much respect !!!)...
I again thank you for your time and energy @_r_ram ... muchas gracias !!!
One more quick question _r_am ...
if I want to filter my sample who have mean of DP i.e. Approximate read depth, higher than a particular value say 20 .... then can I use AVG(DP)>20 directly like below?
asking as there is another function in hail which I would like to use, and your answer above has probably nudged me in right direction ... just want to confirm, as you know your kind response would really help !!!
thanks ... much respect !!!
I don't know, does it work? I cannot predict if bcftools' output will be concordant to your expectations, I can only point you to a feature. Why don't you run a few tests and see if the hail output corresponds to bcftools output?
thanks for the kind response _r_am ... hail requires cluster enviornment and thus I wanted to skip it altogether, as I don't want to add more complexity to my pipeline ...
I didn't mean to ask if they will give same response, but if "logically" AVG(DP)> {INTEGER} would do filtering of samples which have average DP more than the integer values...
I again thank you for taking out time to respond to beginners like me ...
Much Thanks ...
I mean, I think it would, but I can't be sure. Plainly speaking, I don't want you to take my word for it as I might be wrong. Test it yourself.
From the documentation of the HAIL
this is how call_rate is calculated
where,
( which I think would translate to total number of variants in vcf file)
They also mention that n_called , call_rate etc are calculated based from GT ...
digging further in source code of hail provides below info
where
comments in code