It is with precedent to use exomes from another project as controls (for example, the NHLBI EVS is used this way, as are other exomes generated by the same lab on other projects). If you are dealing with a rare disease, one can argue that it is unlikely the causative variant(s) will also be found in other exomes of different phenotypes.
Thanks for posting this question. I have a similar situation of having 100 case-only exomes (individuals are unrelated). Besides standard filtering-based approach like ANNOVAR, what other approaches have you used to analyse your case-only data?
I have tried VAAST(1.0.4) using a given 1000G data as controls but ended up having too many false-positives, even after including only regions that are >10x in both our cases and 1000G controls.
Not sure whether I should post this as a separate question but I was wondering if you heard of anyone using (successfully) public NGS data by more than just their allele frequency? (e.g. for gene burden test instead)
Most people I talked to were against the idea of using public data directly as control data for gene burden test analysis because of the big biases/differences between our data and the public data.
We have snp/indel variations from 97 patient exomes in a SQL database (all are family members showing AUTOSOMAL DOMINANT inheritance, as well as unrelated individuals with AUTOSOMAL DOMINANT traits)
Variations are annotated using Seattle Seq Annotation, plus in-house perl scripts, for gene, consequence [i.e. missense, nonsense etc], frequency in 1000 genomes, frequency in 6500 exomes from the NHLBI exome sequencing project, and frequency amongst our cohort. We have also used Illumina's BodyMap2 RNASeq data to crudely show the rank of gene expression level in the heart (we are a molecular cardiology group).
For a family study, we begin by querying heterozygous variations present in all affected members of a family; and less than 1% in 1000 genomes and ESP6500 exomes; and missense, nonsense, splice-site. Of the variants retrieved, we can prioritise based on the gene's plausibility to disease, cardiac RNASeq expression level, conservation level of nucleotide and amino acid, and then genotype the remaining family members to look for co-inheritance with disease. We can also use the SQL database to query for other rare variations in the same gene in additional unrelated cases with the same disease.
For unrelated cases with the same disease, we tend to firstly look for rare missense, nonsense splice site variations etc, in known causal genes (~50% of cases explained), and then expand the search to other candidate genes; perhaps looking for genes in the same (KEGG) pathway, or genes with high expression levels in the heart, or previously implicated in disease.
Other options include VAAST to sort for high priority variants; plink/seq tool or the EPACTS software for a gene burden test, comparing cases with controls, to look for genes with an overrepresentation of rare variations (neither tried yet).
You might want to look into the terms "burden test" and "non-burden test", those are often used in GWAS-Seq and I don't see why you would need new controls for such a study. Also brush up on "models of inheritance" such as compound hets etc. I don't know any libraries that specifically deal with such filtering for these events since the operations are so trivial.
It is with precedent to use exomes from another project as controls (for example, the NHLBI EVS is used this way, as are other exomes generated by the same lab on other projects). If you are dealing with a rare disease, one can argue that it is unlikely the causative variant(s) will also be found in other exomes of different phenotypes.
is it a common/rare disease ?
let's say about 40 cases /100K. Though I'm interested how this would change your recs.
common disease: at 1st glance, you cannot remove all the known snp from dbSNP/1K etc...
Thanks for posting this question. I have a similar situation of having 100 case-only exomes (individuals are unrelated). Besides standard filtering-based approach like ANNOVAR, what other approaches have you used to analyse your case-only data?
I have tried VAAST(1.0.4) using a given 1000G data as controls but ended up having too many false-positives, even after including only regions that are >10x in both our cases and 1000G controls.
Not sure whether I should post this as a separate question but I was wondering if you heard of anyone using (successfully) public NGS data by more than just their allele frequency? (e.g. for gene burden test instead)
Most people I talked to were against the idea of using public data directly as control data for gene burden test analysis because of the big biases/differences between our data and the public data.