Hi,
So I've got a list of chromosome segments amplified and deleted by sample with the chromosome, start and end co-ordinates. How can I, using R, get hold of identifiers for all the genes that fall within those segments?
Cheers.
Hi,
So I've got a list of chromosome segments amplified and deleted by sample with the chromosome, start and end co-ordinates. How can I, using R, get hold of identifiers for all the genes that fall within those segments?
Cheers.
I'll assume for now that this is human data.
Short answer: biomaRt.
A brief example:
library(biomaRt)
mart.hs <- useMart("ensembl", "hsapiens_gene_ensembl")
# example: chromosome 22, start = 20000000, end = 20100000
genes <- getBM(attributes = c("hgnc_symbol", "chromosome_name", "start_position", "end_position"),
filters = c("chromosomal_region"),
values = c("22:20000000:20100000"),
mart = mart.hs)
genes
# hgnc_symbol chromosome_name start_position end_position
#1 ARVCF 22 19957419 20004331
#2 TANGO2 22 20004537 20053449
#3 DGCR8 22 20067755 20099400
#4 TRMT2A 22 20099389 20104915
#5 MIR185 22 20020662 20020743
#6 22 20050503 20058045
#7 22 20052075 20053228
#8 MIR3618 22 20073269 20073356
#9 MIR1306 22 20073581 20073665
#10 22 20098344 20099398
Perhaps instead of using R to do set operations, export the gene table for whole chromosomes to a file that you can do lookups on. Use a dedicated set operation tool like bedmap
to map genes to your segments of interest. More about bedmap
over here.
The basic outline would look like this:
GRanges
object (from the GenomicRanges
package) named gr
. You can annotate each range in this object with the sampleID it comes from, if this is necessary.TranscriptDb
object (GenomicFeatures
package) for your organism and reference, or create your own from a set of annotations you care about (read the vignette for this package to learn how to do so).txdb
objects, and subsetByOverlaps(features, gr)
.More concretely, assuming your working in human:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
tx <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
interesting <- subsetByOverlaps(tx, gr)
Now interesting
has all of the transcripts that overlap (in any way) with the segments in gr
. Read through the documentation available for GenomicRanges
(and likely IRanges
) to tune how you want to consider overlaps, or how to use the lower level findOverlaps
methods to better tune the results/output from these queries.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Which organism?