Question

Detection Of Snps In Proximity Of Manhattan Plot Peaks

1

Entering edit mode

11.8 years ago

Agatha ▴ 350

I am currently interested in finding the SNPs that are in the proximity of all the peaks in a Manhattan plot (not only the genome wide significant ones) Basically for each peak, I would like to detect 2, 3 annotated SNPs on both sides(+/- 3 SNP positions) . I have also been looking Find The Extent Of A Peak In A Manhattan Plot which is partially what I am trying to achieve as well.

I have a list of SNPs (200 000)(defined by chromosome and position) and their associated P-value. I have used the data to generate the Manhattan plots. The data is resulted after a Chip experiment with DNA from cancer patients and healthy controls. I have detected the significant variants wich might be associated with cancer, but I am also trying to see if in the vicinity of those SNPs there are other variants with a known association to cancer. I would like to do this for the detected genome wide significant SNPs and for the other observed peaks.

snp • 5.8k views

ADD COMMENT • link updated 11.8 years ago by Arun 2.4k • written 11.8 years ago by Agatha ▴ 350

0

Entering edit mode

Please add some context. What biological problem are you working on? Manhattan plots can represent many different types of data.

ADD REPLY • link 11.8 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Also, how do you generate the Manhattan plots? Are you talking about Manhattan plots that you have generated by yourself, or plots generated by somebody else?

ADD REPLY • link 11.8 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I have edited my question...I hope this explains more clearly what I'd like to do.

ADD REPLY • link 11.8 years ago by Agatha ▴ 350

0

Entering edit mode

When you mean peaks, these are the SNPs for which you get a significant p-value? And when you mean on both sides, you mean, +/- 3 SNP positions? And what do you mean by detect 2-3 annotated SNPs? If you get a peak (and you know its location), then you already can calculate +/- 3 positions... right?

ADD REPLY • link 11.8 years ago by Arun 2.4k

0

Entering edit mode

@Arun - yes I can detect the ones located near the genome wide significant ones because they are generally a few. I dont know what can I do with the other observed peaks.. Well some of the SNPs might be novel, which is why I would like to validate them somehow by using other already known ones located +/- 3 positions near the "interesting" SNP. Also I am not sure if this is the way to do it...

ADD REPLY • link 11.8 years ago by Agatha ▴ 350

0

Entering edit mode

So, basically you have a list of positions on the genome (corresponding to the SNPs that are significant in your study), and you want to know if there are other SNPs in the proximity of these positions. Is this correct?

ADD REPLY • link 11.8 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

yes, for the genome wide significant ones; I don't know how to detect the peaks with a lower significance(still lower than 0.001)...how to distinguish them from the non relevant ones...setting thresholds is not a solution I think because it would return too many...

ADD REPLY • link 11.8 years ago by Agatha ▴ 350

0

Entering edit mode

LD, haplotype analysis?

ADD REPLY • link 11.8 years ago by zx8754 12k

1

Entering edit mode

I will do that after I will be able to identify my "peaks"

ADD REPLY • link 11.8 years ago by Agatha ▴ 350

score 2 · Answer 1 · 2013-02-21

Not quite sure if this answers your question entirely, but this is a start. You can do this in R with the help of the bioconductor package IRanges or even better GenomicRanges (it uses IRanges).

I'll generate some random data of SNPs that are significant. I'll assume you've 5 chromosomes and a total of 150 significant SNPs

# 1) dummy significant snp data:
set.seed(45)
chr <- rep(paste0("chr", 1:5), (1:5)*10)
pos <- unlist(lapply(1:5, function(x) sort(sample(1:1e4, 10*x))))
pval <- sample(seq(1e-5, 1e-3, length.out=1000), 150, replace=T)

# significant snps
sig.snps <- data.frame(chr, pos, pval)

# 2) dummy other snps data:
# no: of insignificant SNPs according to your threshold will be generated randomly.
pos <- lapply(1:5, function(x) sample(setdiff(1:1e4, sig.snps$pos[sig.snps$chr == paste0("chr", x)]), sample(1000:6000, 1)))
chr <- rep(paste0("chr", 1:5), sapply(pos, length))
pos <- unlist(pos)
pval <- sample(seq(.0011, 0.5, length.out=1e4), length(pos), replace=T)
insig.snps <- data.frame(chr, pos, pval)

Actual answer starts here: Now using GenomicRanges to get non-significant SNPs that lie close to each of the significant SNPs within a given threshold:

require(GenomicRanges)
sig.gr <- GRanges(Rle(sig.snps$chr), IRanges(sig.snps$pos, width=1))
insig.gr <- GRanges(Rle(insig.snps$chr), IRanges(insig.snps$pos, width=1))

threshold <- 100 # find all insignificant snps for each significant SNP that lies +/- 100 bases apart

olaps <- findOverlapssig.gr, insig.gr, maxgap=threshold, type="any")

# all insignificant snps that overlap with any significant snp within given threshold    
out <- insig.snps[subjectHits(olaps), ]

# contains a list of data.frames equal to the length of significant snps
# each element of the list contains all insignificant snps that overlap with that significant snp
out.l <- split(out, sig.snps$pos[queryHits(olaps)])

require(plyr)
out.f <- ldply(out.l)

> head(out.f)
  .id  chr pos      pval
1  49 chr5  15 0.4576392
2  49 chr5  16 0.4798923
3  49 chr5  35 0.2895429
4  49 chr5 133 0.3823476
5  49 chr5  98 0.3736160
6  49 chr5 117 0.2939836

.id gives the position of significant SNP and the rest of the column are the corresponding insignificant SNPs that fall within your threshold size.

From here, you can filter based on pval as well, if you wish. I'll leave it for you to work it out.