I have a raw whole genome sequence data of a kind of fish trio: father, mother and offspring. I would like to know how many SNV loci there are in the child but not in the parent (i.e. de novo SNV loci) .
I asked this question because I thought this method also could select the de novo SNVs. However, I understood that this method is inappropriate. Thank you.
And I have one more question. How is plugin mendelian different from plugin trio-dnm2 in bcftools?
It's possible; but given trio data it is far better to call variants together (ensure each bam file has a @RG in the header with the appropriate SM sample id), as this will significantly impact the non-reference likelihood at even moderate-coverage (~15x) sites in the offspring.
VCFtools makes this a pain to filter out. Most of the sites can be obtained by splitting into per-individual VCFs and using filter --max-non-ref-ac 0 on the parents to obtain hom-ref sites; filter --max-non-ref-ac 1 on the child to obtain heterozygous sites, and vcf-isec to intersect; this will give sites of 0|0 0|0 1|0 which should be the vast majority of de-novos.
You have to do this again with --non-ref-ac 2 on the parents to get 1|1 1|1 1|0 sites which should be the minority. Note that 1|1 2|2 1|2 multi-allelic sites are disallowed by the requirement that the child have one reference allele. However you need to do this yet another time to find any 1|1 1|1 1|2 sites (de-novo compound het).
Generally I just use JEXL expressions in the GATK (or a simple python script) to do this.
1) how is it different from your previous question ? How to detect de novo variants with trio-data ?
2) vcftools is deprecated.
I asked this question because I thought this method also could select the de novo SNVs. However, I understood that this method is inappropriate. Thank you.
And I have one more question. How is plugin mendelian different from plugin trio-dnm2 in bcftools?