Hello all,
I am a beginner in this field and trying to reveal de novo variants in a trio vcf file (Parents-unaffected, child-affected). First, I used PhaseByTransmission tool and then formed a new .vcf file consisting of only unphased variants ("/" instead of "|"). To my knowledge, de novo variants cannot be phased by the tool because they are not transmitted from the parents.
Here is my question, what are the next steps to identify more accurate de novo variants? Because there is almost 500k unphased variants in the final vcf file and I think it is not possible that all the candidates are true de novo variants.
Thank you so much for your helps!
Best regards,
unless I'm wrong, de novo variants are not related to the phasing information.
True, but you could just filter for variants which are found in the child and not in the parents. Phasing might work to filter a bit, but why would you?
There are of course also other reasons why a variant didn't get phased. So, no, these are not all de novo.
So basically you mean that I should filter 0/1 or 1/1 in child, 0/0 for both parents. I also applied the Genotype Refinement Workflow of GATK on the .vcf file and the output vcf file consists of only 0/1 or 1/1 for child and hom ref for both parents as expected. But what should I do for other possibilities, for instance 1/1 for child, 0/1 for mother and 0/0 for father? Is there any specific term to call this kind of situation?
Essentially, you want to filter out lines in which the number of alternative alleles is higher in the child than the sum of the alternative alleles in the parents.
But I'd say that a scenario where you have 1/0 and 0/0 parents and a 1/1 child is extremely unlikely. Also, you are probably looking for highly penetrant mutations, which you would expect to be heterozygous.
You could try to identify genotypes that violates mendelian rules.
If you have a multisample
vcf
file, this can be done quite easily withbcftools
.fin swimmer