Not quite sure if this answers your question entirely, but this is a start. You can do this in R
with the help of the bioconductor
package IRanges
or even better GenomicRanges
(it uses IRanges
).
I'll generate some random data of SNPs that are significant. I'll assume you've 5 chromosomes
and a total of 150 significant SNPs
# 1) dummy significant snp data:
set.seed(45)
chr <- rep(paste0("chr", 1:5), (1:5)*10)
pos <- unlist(lapply(1:5, function(x) sort(sample(1:1e4, 10*x))))
pval <- sample(seq(1e-5, 1e-3, length.out=1000), 150, replace=T)
# significant snps
sig.snps <- data.frame(chr, pos, pval)
# 2) dummy other snps data:
# no: of insignificant SNPs according to your threshold will be generated randomly.
pos <- lapply(1:5, function(x) sample(setdiff(1:1e4, sig.snps$pos[sig.snps$chr == paste0("chr", x)]), sample(1000:6000, 1)))
chr <- rep(paste0("chr", 1:5), sapply(pos, length))
pos <- unlist(pos)
pval <- sample(seq(.0011, 0.5, length.out=1e4), length(pos), replace=T)
insig.snps <- data.frame(chr, pos, pval)
Actual answer starts here: Now using GenomicRanges
to get non-significant SNPs that lie close to each of the significant SNPs within a given threshold:
require(GenomicRanges)
sig.gr <- GRanges(Rle(sig.snps$chr), IRanges(sig.snps$pos, width=1))
insig.gr <- GRanges(Rle(insig.snps$chr), IRanges(insig.snps$pos, width=1))
threshold <- 100 # find all insignificant snps for each significant SNP that lies +/- 100 bases apart
olaps <- findOverlapssig.gr, insig.gr, maxgap=threshold, type="any")
# all insignificant snps that overlap with any significant snp within given threshold
out <- insig.snps[subjectHits(olaps), ]
# contains a list of data.frames equal to the length of significant snps
# each element of the list contains all insignificant snps that overlap with that significant snp
out.l <- split(out, sig.snps$pos[queryHits(olaps)])
require(plyr)
out.f <- ldply(out.l)
> head(out.f)
.id chr pos pval
1 49 chr5 15 0.4576392
2 49 chr5 16 0.4798923
3 49 chr5 35 0.2895429
4 49 chr5 133 0.3823476
5 49 chr5 98 0.3736160
6 49 chr5 117 0.2939836
.id
gives the position of significant SNP and the rest of the column are the corresponding insignificant SNPs that fall within your threshold
size.
From here, you can filter based on pval
as well, if you wish. I'll leave it for you to work it out.
Please add some context. What biological problem are you working on? Manhattan plots can represent many different types of data.
Also, how do you generate the Manhattan plots? Are you talking about Manhattan plots that you have generated by yourself, or plots generated by somebody else?
I have edited my question...I hope this explains more clearly what I'd like to do.
When you mean peaks, these are the SNPs for which you get a significant p-value? And when you mean on both sides, you mean, +/- 3 SNP positions? And what do you mean by detect 2-3 annotated SNPs? If you get a peak (and you know its location), then you already can calculate +/- 3 positions... right?
@Arun - yes I can detect the ones located near the genome wide significant ones because they are generally a few. I dont know what can I do with the other observed peaks.. Well some of the SNPs might be novel, which is why I would like to validate them somehow by using other already known ones located +/- 3 positions near the "interesting" SNP. Also I am not sure if this is the way to do it...
So, basically you have a list of positions on the genome (corresponding to the SNPs that are significant in your study), and you want to know if there are other SNPs in the proximity of these positions. Is this correct?
yes, for the genome wide significant ones; I don't know how to detect the peaks with a lower significance(still lower than 0.001)...how to distinguish them from the non relevant ones...setting thresholds is not a solution I think because it would return too many...
LD, haplotype analysis?
I will do that after I will be able to identify my "peaks"