You can use R
and Bioconductor's GenomicRanges package as follows:
First, represent your regions of interest as genomic ranges. For real data, you may
load a BED file with the function rtracklayer::import.bed()
. Here, we create
by hand the three regions that you gave as example. The output of the commands is
displayed in italics.
# In R, load the library.
library(GenomicRanges)
# Create an GRanges object (here called, 'r'). that contains the regions of interest.
r <- c( GRanges("chr1", IRanges(1013574,1013576))
, GRanges("chr1", IRanges(1013984,1014478))
, GRanges("chr1", IRanges(1020163,1020383)))
r
Output:
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [1013574, 1013576] *
[2] chr1 [1013984, 1014478] *
[3] chr1 [1020163, 1020383] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Then, load genomic ranges for an annotation, for instance GENCODE.
# Get GENCODE from the Bioconductor's Annotation Hub.
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, c("Gencode", "gff", "human"))
Output:
AnnotationHub with 9 records
# snapshotDate(): 2016-12-29
# $dataprovider: Gencode
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags,
# sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH49554"]]'
title
AH49554 | gencode.v23.2wayconspseudos.gff3.gz
AH49555 | gencode.v23.annotation.gff3.gz
AH49556 | gencode.v23.basic.annotation.gff3.gz
AH49557 | gencode.v23.chr_patch_hapl_scaff.annotation.gff3.gz
AH49558 | gencode.v23.chr_patch_hapl_scaff.basic.annotation.gff3.gz
AH49559 | gencode.v23.long_noncoding_RNAs.gff3.gz
AH49560 | gencode.v23.polyAs.gff3.gz
AH49561 | gencode.v23.primary_assembly.annotation.gff3.gz
AH49562 | gencode.v23.tRNAs.gff3.gz
The command above returned that the ID for GENCODE 23 is AH49556
and indicated which command to run next.
ah["AH49556"]
gc <- ah[["AH49556"]]
Then, intersect your regions of interest with the annotation.
overlaps <- findOverlaps(r, gc)
This returns an object telling which region overlaps with which annotation. The last
step is to extract the gene names, collate them, and prepare your final output.
genes <- extractList(gc$gene_name, as(overlaps, "List"))
genes <- unstrsplit(unique(genes), ";") # Needed in case more than one gene overlaps.
paste(as.character(r), genes)
[1] "chr1:1013574-1013576 ISG15" "chr1:1013984-1014478 ISG15"
[3] "chr1:1020163-1020383 AGRN"
You can find other examples on how to use findOverlaps()
in other Biostars posts or in Bioconductor's support forum.
Thank you all for the great solutions :).