Hello,
I am trying to carry out a SNP genomic enrichment analysis and I was hoping you could help.
Basically, I have the following two sets of SNPs:
set_A
: 1,695 foreground SNPs. These are 1000g variants which, in addition, are QTL for a trait I'm interested in. They all are within ChIP-seq peak intervals for a TF.set_B
: 116,000 background SNPs. These are a superset of set_A and all are within ChIP-seq peaks for the same TF above. These represent all the SNPs I had tested for the QTL property above.
I want to determine whether set_A
is enriched in some particular annotation compared to set_B
. In other words, I want to know whether, compared to all SNPs tested for QTL in my ChIP-seq peaks, my set_A is enriched in some annotation. For example, this annotation might be strong LD intervals around GWAS genome wide significant SNPs from the GWAS catalog. Therefore I want to ask:
"Are my set_A
variants more likely to be in GWAS LD blocks for some disease/trait compared to the background set of SNPs?"
I have ascertained already that set_A
are MAF matched to set_B
(bootstrapped KS test of the two MAF distributions), so this should not be a problem. I ran the GAT simulation-based enrichment tool: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3722528/
which works fine and has returned enrichment results. However, I believe my foreground and background sets need more pre-processing: there is LD structure both within set_A
and within set_B
. So some SNPs in A are in LD across them and some SNPs in B are in LD across them. I believe I need to correct for this, too, to avoid inflation of enrichment. I would probably need to LD-match set_A
and set_B
, or maybe pool or subsample independent SNPs only from set_A
and set_B
. The GAT, which is designed to compute simple interval enrichments, cannot do this.
There is a tool which might be able to help me, by the BROAD, called SNPsnap: http://www.broadinstitute.org/mpg/snpsnap/
Interestingly, SNPsnap should be able to carry out LD-clumping of the foreground SNPs, so it can correct for LD-derived inflation of enrichments. However, SNPsnap only returns a frequency matched background of (at most) 20.000 snps: I don't need this, because I believe I already have the most suitable background set (set_B
) (and in any case I need my background snps to be in the ChIP-seq peaks).
Additionally, it seems SNPsnap is quite experimental (I have had about 80% of runs fail on me) and any mails to the authors go unanswered. So I believe the program is not really supported.
Therefore I was hoping anyone on here had ideas on how to do this:
LD clumping: what if I mapped my set_A snps to strong LD intervals and computed, instead of the enrichment of set_A
snps in GWAS LD blocks, the enrichment of set_A
LD blocks in GWAS LD blocks?
Else, for each LD block containing more than 1 set_A
SNP, I could select the "best" according to some metric? Any other ideas or suitable tools?
Thanks for any suggestions you might be willing to share.