Identifying De Novo Variants In Trio Data
3
11
Entering edit mode
12.4 years ago
Vivek ★ 2.7k

I have trio datasets that I have phased using GATK's PhaseByTransmission and ReadbackedPhasing walkers.

My target is to identify de novo mutations from this data.

I'm creating a candidate de novo mutations dataset by checking for variants that are present in the offspring and not in either of the parents as well as looking for variant sites where there are mendelian violations.

I'd like to know how to proceed in filtering through this dataset to confidently ascertain variants that are de novo from the rest.

I'd appreciate any inputs/ideas on creating a methodology to go about this analysis.

variant gatk differential-expression • 9.3k views
ADD COMMENT
0
Entering edit mode

Hi:

Have you identified de novo variants in trio data? Now, I have been working about it. Could you please share your workflow or some scripts with me ?

Thx in advance!

ADD REPLY
7
Entering edit mode
12.4 years ago

Your workflow might look something like this:

Generate VCF files of your trios with SNPs and indels with GATK, and then annotate with annovar or seattleseq. Also, use GATK to calculate your depth of coverage for your target.

Start by filtering out SNPs that are in dbSNP -- these are not likely to be pathogenic variants (but could be rare disease alleles, so be careful, you may need to go back and reanalyze...)

If you have scripting skills in something nice like Perl or Python, write a couple of scripts to pull out nonsynonymous (nonsense, missense) variants that obey your hypotheses (you mention de novo/sporadic). This gives you a shortened list of potential disease-causing variants.

Using your depth-of-coverage data should let you weed out further variants in areas of low coverage that may be crap. Then again, be careful, they might not be, and you may need to go back and reanalyze...

Annotate your shorter list of variants through the Exome Variant Server, to kick out the variants seen there that are likely to be not-so-rare alleles that do not cause disease.

Mix well, and repeat steps as needed. Remember, you may need to alter key parameters at each step and reanalyze... If you are unlucky, you may need to pull in gene ontology data or data about gene function in other organisms to help you rank variants...

Finally, any variants you identify need to be validated with Sanger... and then the fun begins. You need to validate further by sequencing in larger cohorts or do some functional wet-lab experiments to generate biologically relevant data.

Good luck!

ADD COMMENT
0
Entering edit mode

Thanks for the input. I already annotate my VCFs with data from dbSNP, 1000 genomes and ESP variants so I can remove the variants with relatively high allele frequencies in these databases.

By doing a quick parsing with perl I'm still ending up with quite a high number, so I will likely need to look for further filtering criteria.

The read coverage for my data is around 90x, which is sufficiently good to expect quality variant calls.

ADD REPLY
0
Entering edit mode

Yes, that can be the way it goes with sporadics. There can be more than enough sporadic variants. Do your trios have the same phenotype? If so, look for de novo nonsynonymous variants in shared genes among your probands. The same phenotype can also be caused by mutations in genes in the same pathway, so some pathway analysis may help you. Are there known genes causing similar phenoytpe to the one you are studying? Look for variants in genes in the same pathways (assuming any of these data are known...often they are not...)

ADD REPLY
0
Entering edit mode

I need to find out the de novo mutation rate as well so I don't think I can confine myself to non synonymous mutations. However going after a filtering criterion based on read depth at the candidate positions seems to be a promising option.

I could remove sites that have a low number of reads supporting the variant call in any of the trio samples.

ADD REPLY
0
Entering edit mode
12.4 years ago
JC 13k

Definitively you will need strong statistics in coverage and quality calls in each candidate position, because a large portion of them will be artefacts from the sequencer (platforms have their own bias). I also double check with other SNP callers (samtools, varscan, ...).

ADD COMMENT
0
Entering edit mode

The variants themselves are an intersection from GATK and Samtools callers but the phasing was done using GATK walkers. I'm looking at relevant publications to check for any existing methods as well.

ADD REPLY
0
Entering edit mode

This is really true if coverage is low. If you have good quality coverage, however (~90-100x) my experience is the proportion of artifacts of sequencing after running through BWA and samtools/Picard is rather low. By the time you get done with GATK and have generated vcf files, you should be dealing with mostly good-quality calls.

ADD REPLY
0
Entering edit mode
8.2 years ago
daniel ▴ 30

Just thought I'd shamelessly give our new haplotype-based variant caller octopus a mention here. It has a built in trio model that is able to classify called variants as de novo. There is no need for read pre-processing or messy post-hoc VCF intersections. Calls are phased by default.

We are in an early alpha release right now but are eager to get feedback, especially on the de novo calling (octopus also has standard germline calling, and a somatic caller built in).

ADD COMMENT
0
Entering edit mode

I'm interested in utilizing your variant caller, the link you provided isn't functional. Would you please provide updated information?

ADD REPLY
0
Entering edit mode

Octopus is now back online - the link should now work.

ADD REPLY

Login before adding your answer.

Traffic: 1802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6