Question

Analyses With Case-Only Exomes

3

Entering edit mode

11.3 years ago

brentp 24k

Let's say you had about 300 exomes, all cases. What are some analyses that you would do?

For example, is it worthwhile (and/or with precedent) to use exome controls from another project?

Many of these individuals are related, so one could look for family-specific variants, how would you do that?

exome exome-sequencing snp • 5.2k views

ADD COMMENT • link 11.3 years ago by brentp 24k

1

Entering edit mode

It is with precedent to use exomes from another project as controls (for example, the NHLBI EVS is used this way, as are other exomes generated by the same lab on other projects). If you are dealing with a rare disease, one can argue that it is unlikely the causative variant(s) will also be found in other exomes of different phenotypes.

ADD REPLY • link 11.3 years ago by Alex Paciorkowski 3.5k

0

Entering edit mode

is it a common/rare disease ?

ADD REPLY • link 11.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

let's say about 40 cases /100K. Though I'm interested how this would change your recs.

ADD REPLY • link 11.3 years ago by brentp 24k

0

Entering edit mode

common disease: at 1st glance, you cannot remove all the known snp from dbSNP/1K etc...

ADD REPLY • link 11.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for posting this question. I have a similar situation of having 100 case-only exomes (individuals are unrelated). Besides standard filtering-based approach like ANNOVAR, what other approaches have you used to analyse your case-only data?

I have tried VAAST(1.0.4) using a given 1000G data as controls but ended up having too many false-positives, even after including only regions that are >10x in both our cases and 1000G controls.

Not sure whether I should post this as a separate question but I was wondering if you heard of anyone using (successfully) public NGS data by more than just their allele frequency? (e.g. for gene burden test instead)

Most people I talked to were against the idea of using public data directly as control data for gene burden test analysis because of the big biases/differences between our data and the public data.

ADD REPLY • link 10.6 years ago by Angel • 0

score 3 · Answer 1 · 2013-07-29

3

Entering edit mode

11.3 years ago

Pierre Lindenbaum 164k

for a first (optimistic) try, i would :

annotate my VCF with VEP/SnpEff
remove all the known #rs
remove the intronic, intergenic, synonymous etc.. variants
remove the same variants found in two unrelated families
group by transcript
select the interesting cRNA (using GO, disease ontology, found more than xxx times, etc...)

ADD COMMENT • link 11.3 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Be cautious when removing rs/dbsnp sites. There's a lot of chaff in there, and small minority of sites are almost certainly pathogenic.

ADD REPLY • link 11.3 years ago by Chris Miller 22k

0

Entering edit mode

that why I said "optimistic";-)

ADD REPLY • link 11.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Typically I only remove sites with rs numbers if they also have a corresponding MAF.

ADD REPLY • link 11.3 years ago by DG 7.3k

score 2 · Answer 2 · 2013-07-31

We have snp/indel variations from 97 patient exomes in a SQL database (all are family members showing AUTOSOMAL DOMINANT inheritance, as well as unrelated individuals with AUTOSOMAL DOMINANT traits)

Variations are annotated using Seattle Seq Annotation, plus in-house perl scripts, for gene, consequence [i.e. missense, nonsense etc], frequency in 1000 genomes, frequency in 6500 exomes from the NHLBI exome sequencing project, and frequency amongst our cohort. We have also used Illumina's BodyMap2 RNASeq data to crudely show the rank of gene expression level in the heart (we are a molecular cardiology group).

For a family study, we begin by querying heterozygous variations present in all affected members of a family; and less than 1% in 1000 genomes and ESP6500 exomes; and missense, nonsense, splice-site. Of the variants retrieved, we can prioritise based on the gene's plausibility to disease, cardiac RNASeq expression level, conservation level of nucleotide and amino acid, and then genotype the remaining family members to look for co-inheritance with disease. We can also use the SQL database to query for other rare variations in the same gene in additional unrelated cases with the same disease.
For unrelated cases with the same disease, we tend to firstly look for rare missense, nonsense splice site variations etc, in known causal genes (~50% of cases explained), and then expand the search to other candidate genes; perhaps looking for genes in the same (KEGG) pathway, or genes with high expression levels in the heart, or previously implicated in disease.

Other options include VAAST to sort for high priority variants; plink/seq tool or the EPACTS software for a gene burden test, comparing cases with controls, to look for genes with an overrepresentation of rare variations (neither tried yet).

score 1 · Answer 3 · 2013-07-29

1

Entering edit mode

11.3 years ago

aindap ▴ 120

Is this for a Mendelian or complex trait?

ADD COMMENT • link 11.3 years ago by aindap ▴ 120

1

Entering edit mode

your answer might be better as a question. it is for a complex trait.

ADD REPLY • link 11.3 years ago by brentp 24k

0

Entering edit mode

Do you have any families with multiple affecteds?

ADD REPLY • link 11.3 years ago by Alex Paciorkowski 3.5k

0

Entering edit mode

yes, see the last sentence of my question.

ADD REPLY • link 11.3 years ago by brentp 24k

score 1 · Answer 4 · 2013-07-31

1

Entering edit mode

11.3 years ago

Jeremy Leipzig 22k

You might want to look into the terms "burden test" and "non-burden test", those are often used in GWAS-Seq and I don't see why you would need new controls for such a study. Also brush up on "models of inheritance" such as compound hets etc. I don't know any libraries that specifically deal with such filtering for these events since the operations are so trivial.

Glisson et al. Disease gene identification strategies for exome sequencing http://www.nature.com/doifinder/10.1038/ejhg.2011.258

ADD COMMENT • link 11.3 years ago by Jeremy Leipzig 22k

0

Entering edit mode

not sure how you could apply a burden test without controls since they are based on a "burden" of variants in cases. Am I missing something?

ADD REPLY • link 11.3 years ago by brentp 24k

0

Entering edit mode

you just draw the controls from publicly available sources, basically allele freqs

ADD REPLY • link 11.3 years ago by Jeremy Leipzig 22k