Hi,
I would like to find overlapping sequences from differentially methylated regions from three different samples. Is there an easy way to do it for a non-bioinformatician?
Thank you in advance,
Matea.
Hi,
I would like to find overlapping sequences from differentially methylated regions from three different samples. Is there an easy way to do it for a non-bioinformatician?
Thank you in advance,
Matea.
Assuming you have the coordinates for DMRs in each of the samples and can get them into R it's fairly simple using GenomicRanges.
Here is a simple example of how you could try to get it done starting with e.g. BED-like coordinates for your peaks. Might have some bugs but if you're lucky it works straight away by copy-pasteing into R but can hopefully get you started..
In general you get data into R by providing the path to your file on interest and using one of the built-in functions like 'read.table'.
data<-read.table(file="c:/path/to/my/data/mydata.txt",sep="\t")
head(data) #for peak at top 6 rows
Load/install required package (in R)
if('GenomicRanges' %in% rownames(installed.packages() ) ) { library(GenomicRanges) } else { source("https://bioconductor.org/biocLite.R") ; biocLite("GenomicRanges") ; library(GenomicRanges) }
Create mock DMRs (substitute with own coordinates by reading in your file so it looks something like this)
DMRs1<-data.frame(chr=c("chr1","chr1","chr2","chr2"),start=c(100,500,1000,2000),end=c(100,500,1000,2000)+100,stringsAsFactors=F)
DMRs2<-data.frame(chr=c("chr1","chr1","chr2","chr2"),start=c(300,525,1000,1800),end=c(300,525,1000,1800)+100,stringsAsFactors=F)
DMRs3<-data.frame(chr=c("chr1","chr1","chr2","chr2"),start=c(350,500,1000,2200),end=c(350,500,1000,2200)+100,stringsAsFactors=F)
head(DMRs1) ##check object, your files should look something like this once read in..
Make GRanges objects
r1<-GRanges(seqnames=DMRs1[,"chr"],ranges=IRanges(start=DMRs1[,"start"],end=DMRs1[,"end"]))
r2<-GRanges(seqnames=DMRs2[,"chr"],ranges=IRanges(start=DMRs2[,"start"],end=DMRs2[,"end"]))
r3<-GRanges(seqnames=DMRs3[,"chr"],ranges=IRanges(start=DMRs3[,"start"],end=DMRs3[,"end"]))
Get DMR overlaps using coverage and convert back to GRanges
myRanges<-coverage(c(r1,r2,r3))
myRanges<-as(myRanges,"GRanges")
Only keep regions where at least 2 DMRs have some degree of overlap
myRanges<-myRanges[score(myRanges)>1] # >2 would require all three experiments to have overlapping DMR
Merge contiguous regions with variable number of overlapping peaks
#myRanges<-myRanges[width(myRanges)>50] #only keep overlaps larger than e.g. 50 bp, remove first '#' to use
myRanges<-reduce(myRanges) #merge contiguous regions
Make container for map-back
elementMetadata(myRanges)$consensusRegions2dmr1Row<-NA
elementMetadata(myRanges)$consensusRegions2dmr2Row<-NA
elementMetadata(myRanges)$consensusRegions2dmr3Row<-NA
Map back to original peaks
myMatches<-as.matrix(findOverlaps(myRanges,r1))
elementMetadata(myRanges)$consensusRegions2dmr1Row[myMatches[,1]]<-myMatches[,2]
myMatches<-as.matrix(findOverlaps(myRanges,r2))
elementMetadata(myRanges)$consensusRegions2dmr2Row[myMatches[,1]]<-myMatches[,2]
myMatches<-as.matrix(findOverlaps(myRanges,r3))
elementMetadata(myRanges)$consensusRegions2dmr3Row[myMatches[,1]]<-myMatches[,2]
Output results matrix
out<-data.frame(chr=as.character(seqnames(myRanges)),start=start(myRanges),end=end(myRanges),as(elementMetadata(myRanges),"matrix"),stringsAsFactors=F)
write.table(out,file="peaksOverlapsOut.txt",sep="\t",quote=F,row.names=F) #open in e.g. excel and check
Hi Mattias Aine, thanks for sharing the code. It works perfectly until the last step, where I get the following:
Error in as.vector(data) : no method for coercing this S4 class to a vector
Any ideas on how to solve this problem? I've tried adding up the following into the NAMESPACE file, but that didn't help.
importMethodsFrom(GenomicRanges, as.data.frame)
Loading required package: GenomicRanges Loading required package: BiocGenerics Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl,
intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unlist, unsplit
Loading required package: S4Vectors Loading required package: stats4 Loading required package: IRanges Loading required package: GenomeInfoDb
data<-read.table(file="c:/path/to/my/data/mydata.bed",sep="\t")
Warning message: package ‘BiocGenerics’ was built under R version 3.2.2
data1<-read.table(file="c:/path/to/my/data/mydata.bed",sep="\t")
head(data)
DMRs1<-data.frame(chr=data[,1],start=data[,2],end=data[,3],stringsAsFactors=F)
DMRs2<-data.frame(chr=data1[,1],start=data1[,2],end=data1[,3],stringsAsFactors=F)
head(DMRs1)
r1<-GRanges(seqnames=DMRs1[,"chr"],ranges=IRanges(start=DMRs1[,"start"],end=DMRs1[,"end"]))
r2<-GRanges(seqnames=DMRs2[,"chr"],ranges=IRanges(start=DMRs2[,"start"],end=DMRs2[,"end"]))
myRanges<-coverage(c(r1,r2))
myRanges<-as(myRanges,"GRanges")
myRanges<-myRanges[score(myRanges)>1]
myRanges<-reduce(myRanges)
elementMetadata(myRanges)$consensusRegions2dmr1Row<-NA
elementMetadata(myRanges)$consensusRegions2dmr2Row<-NA
myMatches<-as.matrix(findOverlaps(myRanges,r1))
elementMetadata(myRanges)$consensusRegions2dmr1Row[myMatches[,1]]<-myMatches[,2]
myMatches<-as.matrix(findOverlaps(myRanges,r2))
elementMetadata(myRanges)$consensusRegions2dmr2Row[myMatches[,1]]<-myMatches[,2]
out<-data.frame(chr=as.character(seqnames(myRanges)),start=start(myRanges),end=end(myRanges),as(elementMetadata(myRanges),"matrix"),stringsAsFactors=F)
Error in as.vector(data) : no method for coercing this S4 class to a vector
This is how it looks like
GRanges object with 422 ranges and 2 metadata columns:
seqnames ranges strand | consensusRegions2dmr1Row
<Rle> <IRanges> <Rle> | <integer>
[1] 1 [ 949285, 949737] * | 1
[2] 1 [ 976705, 976708] * | 2
[3] 1 [1244840, 1245077] * | 7
[4] 1 [1365792, 1366320] * | 10
[5] 1 [2144696, 2145231] * | 20
... ... ... ... ... ...
[418] 9 [140175619, 140175972] * | 1819
[419] 9 [140199579, 140199944] * | 1821
[420] 9 [140311775, 140312125] * | 1824
[421] X [ 9984266, 9984622] * | 3858
[422] X [ 49090608, 49090773] * | 3872
consensusRegions2dmr2Row
<integer>
[1] 1
[2] 2
[3] 7
[4] 9
[5] 12
... ...
[418] 942
[419] 943
[420] 945
[421] 2023
[422] 2034
-------
seqinfo: 24 sequences from an unspecified genome
ok, think it should work but then which one of these fails?
a<-as.vector(seqnames(myRanges))
b<-start(myRanges)
c<-end(myRanges)
d<-as(elementMetadata(myRanges),"matrix")
If none then check that they are all same length or dimensions
length(a)
dim(d) #number of rows should be same as length(a)
out<-cbind(chr=a,start=b,end=c,d) #then print to file..
If you know how to use Linux, you can install BEDTools and use intersectBed to accomplish your task.
If you provide a gft file (or a gff file converted to a gtf), and a BED file, a nice option is to use rgmatch.py
This program only needs a simple command lane documented in the web page, and gives you the position of the DMR (or any other feature) in the many places you choose, such as Upstream, Promoter, TSS, 1st_exon, introns, Gene_body, TTS and Downstream. The lenght of these features can be selected by the final user
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
In what format are your DMRs now, is it "chr-start-end" for all three samples or something more raw?
Also how non-bioinformatician are you? If you got a basic R-script to do this, could you modify it for your needs?
Are you actually interested in the sequences themself or just where they are and between what samples the DMRs overlap?
Hi Mattias Aine,
unfortunately I am very non-bioinformatician. I have my samples in "chr-start-end" format. I need the sequences to design the primers.
Ok but what you want to know is all the places in the genome where S1+S2, S1+S3, S2+S3 or S1-3 have overlapping DMRs so you can go on and design primers?
This is quite easy to do in R but would require that you can get to a point where you can get your data into R, install one bioconductor package and run a sequence of commands. Up for it?
I am always in for adventures. I'll install R. Do you have any hints how to begin?