Consider this code:
library(microbenchmark)
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
microbenchmark::microbenchmark(
# Option 1:
GenomicFeatures::exons(
txdb,
vals = list(tx_name = c("uc001aaa.3", "uc010nxq.1")),
columns = list("EXONNAME", "TXNAME", "GENEID")),
# Option 2:
GenomicFeatures::exonsBy(txdb, by = "tx", use.names = TRUE)[c("uc001aaa.3", "uc010nxq.1")],
times = 10
)
# Option 1 takes an average of 1.6 seconds
# Option 2 takes an average of 6.5 seconds
These two options gets me the same data, but option1 is a lot faster. Is there an easy way I can use option1 and still get the structure that is option2? Or is there a way to make exonsBy faster by only extracting the transcripts I'm interested in?
My end goal is to get a GRangesList object with one GRanges object per transcript, OR a single GRanges object with duplicate entries for exons that appear in multiple transcripts (if that's possible). To begin with I only have the transcript names and the TxDb object.
_Edit_
I started working on an example to help explain what I want to accomplish. Below I'm trying two approaches, using exonsBy() (which succeeds but is slow) and using exons() (which fails but retrieves data faster):
library(Gviz)
library(GenomicFeatures)
options(ucscChromosomeNames=FALSE)
txdbhg19 <- GenomicFeatures::makeTxDbFromBiomart(
biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl",
host = "dec2013.archive.ensembl.org")
transcriptNames <- c("ENST00000370032", "ENST00000420055")
# ----- APPROACH 1
# Get GrangesList
tst <- exonsBy(txdbhg19, by = "tx", use.names=TRUE)[transcriptNames]
# Unlist to get GRanges
tst <- unlist(tst)
# Add transcript metadata column to make Gviz group correctly
elementMetadata(tst)$transcript <- names(tst)
# Create track and plot
trTrack <- Gviz::GeneRegionTrack(tst)
Gviz::plotTracks(trTrack)
# This is the plot I want to produce
# ----- APPROACH 2
tst2 <- GenomicFeatures::exons(
txdbhg19,
vals = list(tx_name = transcriptNames),
columns = list("EXONNAME", "TXNAME", "GENEID"))
# Problem 1: The TXNAME metadata field now contains transcripts that are not in my original transcript list:
unique(unlist(elementMetadata(tst2)$TXNAME[!elementMetadata(tst2)$TXNAME %in% transcriptNames]))
# [1] "ENST00000370031" "ENST00000402983"
# I can hack this out, but it feels unclean to do this:
elementMetadata(tst2)$TXNAME <- elementMetadata(tst2)$TXNAME[elementMetadata(tst2)$TXNAME %in% transcriptNames]
# I would much rather have it so that I only get the transcript names that I originally wanted. Anyhow, let's continue..
# Each entry in elementMetadata(tst2)$TXNAME now has 1 or more transcript names in it. This is not what I want. I want to somehow "expand" this GRanges object, so that each row only have one TXNAME value. Here's a very naive way to do it:
gr <- GRanges()
for (i in 1:length(tst2)) {
trNames <- elementMetadata(tst2[i])$TXNAME[[1]]
trNames <- strsplit(trNames, " ")
for (j in 1:length(trNames)) {
newGr <- tst2[i]
elementMetadata(newGr)$TXNAME <- trNames[[j]]
gr <- append(gr, newGr)
print(paste(elementMetadata(tst2[i])$GENEID, trNames[[j]]))
}
}
elementMetadata(gr)$transcript <- elementMetadata(gr)$TXNAME
trTrack <- Gviz::GeneRegionTrack(gr)
Gviz::plotTracks(gr)
# Error in (function (classes, fdef, mtable) :
# unable to find an inherited method for function ‘displayPars<-’ for signature ‘"GRanges", "list"’
# Why do I get this error? I was hoping that this gr structure would be similar to the tst structure in approach 1
The goal is to get a structure similar to the one I get with exonsBy() -> unlist(). Any ideas?