Yes, you can use the GenomicRanges library in R to do this. Here is an example:
library(GenomicRanges)
# generate 10 random segments
regions.gr <- GRanges(seqnames=Rle(rep(1,10)), ranges=IRanges(start=sample(1:1000,10), width=100), strand=Rle(rep("*",10)))
# generate 5 random genes
fiveGeneNames <- paste(sample(letters,5), sample(1:100,5), sep="")
genes.gr <- GRanges(seqnames=Rle(rep(1,5)), ranges=IRanges(start=sample(1:1000,5), width=100, names=fiveGeneNames), strand=Rle(sample(c("+","-"),5, replace=T)))
# what regions overlap what genes?
overlapGenes <- findOverlaps(regions.gr, genes.gr)
# Return any genes with an overlap.
# Convert the resulting "Hits" object to a data frame
# and use it as an index
overlapGenes.df <- as.data.frame(overlapGenes)
names(genes.gr[overlapGenes.df$subjectHits])
# extract the regions that hit genes
regionsWithHits <- as.data.frame(regions.gr[overlapGenes.df$queryHits])
# add the names of the genes
regionsWithHits$genes <- names(genes.gr)[overlapGenes.df$subjectHits]
# Some segments might not overlap any gene.
# Find the distance to the nearest gene
distanceToNearest(regions.gr, genes.gr)
# if your genes and segments were in a bed file you could easily import them
library(rtracklayer)
regions <- import.bed("myRegions.bed",asRangedData=FALSE)
genes <- import.bed("myGenes.bed",asRangedData=FALSE)
If you can use the command line, you can do this in a simple one-liner with BEDOPS bedmap:
The file
answer.bed
contains results in the format you expect, except that gene names are delimited with semi-colons. You can change this to commas, if you like, by specifying the--multidelim
option.Just make sure first that
segmented_file
andreference_file
are sorted BED files:Etc.