We have some data that's similar to typical GWAS data where we have a p-value for association
with a given trait at millions of SNPs across a genome. When viewing these as a manhattan plot, we see peaks.
My task is to find the extent of the peaks, not just the highest point--since significant peaks tend to occur near each other. In the past, I've used some ad-hoc windowing method, but my question is: are there standard methods (and implementations) in GWAS for this type of thing?
Note: After reading these biostar threads, I'm aware of the ChIP-seq tools but I'm looking for something more targeted to regions rather than single nucleotides.
The difficulty in the problem is that significant SNP's are in the same region, but may be separated by non-signficiant SNP's.
With most GWAS result sets, you can count the genome-wide significant SNPs on one hand. Presumably your results are more complex. Can you provide details about the rough shape of the plot: many peaks or few, focal or broad, close enough that you can't guess whether one peak is really two, etc? What are the factors that make this hard?
I have added a bit of clarification. As of now, I have a script that seems to do this ok. It takes a seed parameter and extends out from peaks allowing to skip a certain number of lower values. I'm still interested to hear more traditional approaches.
I am not sure how a peak can be defined easily just with SNP p-value result. If you are looking for a method that could give a correspndance pb-position/probability of location of the functional sNP, I'd advise running some hapltype clustering test that will translate the fuzzy peak shape into a linkage like plot (HapCluster from Balding - extensively coded by Mailund, or ancesHC by Coin). Gives you a credibility interval.
But you need first to ensure that your significant SNPs scattered throuhg the region represent one single association signal - maybe using nested likelihood tests like Thesias.
Finally, some people use imputation. However, this is not very different in essence from hapl clustering and make the assumption that all the SNPs are present in the panel.
Finally, you can list all you "significant" SNPs - need to choose a p-value threshold- , list the blocks they are lying in (block limits are to be found on IMPUTE software page) and take the union of these blocks as the area of the peak (because the functional SNP is post likely in high LD with your observed significant SNPs and can only be found within the same block). Beware that all this is under the assumption that your association is driven by a common variant.
If your goal is not fine-mapping but providing an additional evidence for true signal based on the "length" of a peak (something similar to what was proposed for linkage in a 90s Terwilliger paper), then haplotype clustering is not bad.
However, I am not sure here the length of a peak is really reflecting the true positiveness of an association as it is highly dependent on the recombination pattern of the region - a peak in a short recombination block can show very significant p-values (and be a true positive) but association (hence peak length) would fade very quickly once the recombinaiton hot-spots are met.
There are standard peak ID and measurement techniques, e.g. using Matlab, which essentially boil down to:
clip region interval from genome
smooth the -log10(pval) signal
find zero-crossing(s) in 1st derivative of (2) -> these are the tips of the peaks
Additionally, you can find the boundaries of the peak(s) by determining the zero-crossings in the 2nd derivative of (2) which is also the 1st derivative of (3). I imagine this should work in Octave as well without any toolboxes.
Note of caution: you may be trying to identify a "region of association" this way, but LD will confound any interpretation of this peak. You should still probably perform haplotype and LD analysis as already suggested.
These are good standard techniques but I am not sure whethere one would have an interpretation that could make sense in the context of association - although I might be wrong. Probably, in association context , the haplotypes will be the thing to do.
Again, this also depends on what you exactly want out of your smoothing.
With most GWAS result sets, you can count the genome-wide significant SNPs on one hand. Presumably your results are more complex. Can you provide details about the rough shape of the plot: many peaks or few, focal or broad, close enough that you can't guess whether one peak is really two, etc? What are the factors that make this hard?
I have added a bit of clarification. As of now, I have a script that seems to do this ok. It takes a seed parameter and extends out from peaks allowing to skip a certain number of lower values. I'm still interested to hear more traditional approaches.