Is anybody aware of a published software tool for pathway enrichment analysis of NGS data in families (in particular, exome sequencing data)? My understanding is that common enrichment analysis techniques from microarrays or SNP arrays cannot be applied here due to sequencing data specific biases. I would also be interested in pathway analysis tools for case/controls studies, but family sequencing is the main area of interest.
Many thanks in advance.
Most pathway enrichment software just takes lists of genes as input, and frankly, most of the stats aren't operating off great models anyway so I wouldn't be that concerned about the biases introduced by exome sequencing. But more important is what question you are asking? Are you planning on compiling lists of genes with variants in them and looking for enrichment? I'm not exactly sure what sorts of questions you are planning on asking with exome sequencing data in families versus gene expression data, which is where you typically apply these sorts of tools.
We are looking for combinatorial effects of variants that explain complex polygenetic disease phenotypes using family exome sequencing data. Since I expect the variant data to be very noisy (or to have many random effects), a pathway analysis could theoretically help to identify robust and interpretable deregulations in specific cellular processes. However, I think the biases cannot be ignored (e.g. some genes have longer sequences and accordingly also more SNPs, and then there is linkage disequilibrium between genes). What I am looking for is a proper permutation-based p-value estimation approach for pathway enrichment analysis in this case.
Please I hope you did a power analysis before you started to determine the sample size you will need to see these combinatorial effects. This was a major (as in, major) weakness of the previous generation of GWAS studies. Also, I agree with Dan about the currently available pathway analysis stuff. Most publicly available pathway analysis tools do not integrate enough context specific data (primarily timing and cell-type of expression), their results are heavy on the false positives.
We are still in the planning phase, so the main first question is which analysis tools are available for this purpose (if any) and what are their strengths/limitations. Sample size estimation is of course also to be done afterwards, but first we need to know what analysis methods are available at all.