Question

GenomicFeatures: Select exons by transcript. Any faster option than exonsBy?

0

Entering edit mode

8.7 years ago

stianlagstad ★ 1.1k

Consider this code:

library(microbenchmark)
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

microbenchmark::microbenchmark(
  # Option 1:
  GenomicFeatures::exons(
    txdb,
    vals = list(tx_name = c("uc001aaa.3", "uc010nxq.1")),
    columns = list("EXONNAME", "TXNAME", "GENEID")),
  # Option 2:
  GenomicFeatures::exonsBy(txdb, by = "tx", use.names = TRUE)[c("uc001aaa.3", "uc010nxq.1")],
  times = 10
)

# Option 1 takes an average of 1.6 seconds
# Option 2 takes an average of 6.5 seconds

These two options gets me the same data, but option1 is a lot faster. Is there an easy way I can use option1 and still get the structure that is option2? Or is there a way to make exonsBy faster by only extracting the transcripts I'm interested in?

My end goal is to get a GRangesList object with one GRanges object per transcript, OR a single GRanges object with duplicate entries for exons that appear in multiple transcripts (if that's possible). To begin with I only have the transcript names and the TxDb object.

_Edit_

I started working on an example to help explain what I want to accomplish. Below I'm trying two approaches, using exonsBy() (which succeeds but is slow) and using exons() (which fails but retrieves data faster):

library(Gviz)
library(GenomicFeatures)

options(ucscChromosomeNames=FALSE)
txdbhg19 <- GenomicFeatures::makeTxDbFromBiomart(
  biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "hsapiens_gene_ensembl",
  host = "dec2013.archive.ensembl.org")

transcriptNames <- c("ENST00000370032", "ENST00000420055")

# ----- APPROACH 1

# Get GrangesList
tst <- exonsBy(txdbhg19, by = "tx", use.names=TRUE)[transcriptNames]
# Unlist to get GRanges
tst <- unlist(tst)
# Add transcript metadata column to make Gviz group correctly
elementMetadata(tst)$transcript <- names(tst)
# Create track and plot
trTrack <- Gviz::GeneRegionTrack(tst)
Gviz::plotTracks(trTrack)
# This is the plot I want to produce

# ----- APPROACH 2

tst2 <- GenomicFeatures::exons(
  txdbhg19,
  vals = list(tx_name = transcriptNames),
  columns = list("EXONNAME", "TXNAME", "GENEID"))

# Problem 1: The TXNAME metadata field now contains transcripts that are not in my original transcript list:
unique(unlist(elementMetadata(tst2)$TXNAME[!elementMetadata(tst2)$TXNAME %in% transcriptNames]))
# [1] "ENST00000370031" "ENST00000402983"
# I can hack this out, but it feels unclean to do this:
elementMetadata(tst2)$TXNAME <- elementMetadata(tst2)$TXNAME[elementMetadata(tst2)$TXNAME %in% transcriptNames]
# I would much rather have it so that I only get the transcript names that I originally wanted. Anyhow, let's continue..

# Each entry in elementMetadata(tst2)$TXNAME now has 1 or more transcript names in it. This is not what I want. I want to somehow "expand" this GRanges object, so that each row only have one TXNAME value. Here's a very naive way to do it:

gr <- GRanges()
for (i in 1:length(tst2)) {
  trNames <- elementMetadata(tst2[i])$TXNAME[[1]]
  trNames <- strsplit(trNames, " ")
  for (j in 1:length(trNames)) {
    newGr <- tst2[i]
    elementMetadata(newGr)$TXNAME <- trNames[[j]]
    gr <- append(gr, newGr)
    print(paste(elementMetadata(tst2[i])$GENEID, trNames[[j]]))
  }
}
elementMetadata(gr)$transcript <- elementMetadata(gr)$TXNAME

trTrack <- Gviz::GeneRegionTrack(gr)
Gviz::plotTracks(gr)
# Error in (function (classes, fdef, mtable)  :
# unable to find an inherited method for function ‘displayPars<-’ for signature ‘"GRanges", "list"’

# Why do I get this error? I was hoping that this gr structure would be similar to the tst structure in approach 1

The goal is to get a structure similar to the one I get with exonsBy() -> unlist(). Any ideas?

GenomicFeatures • 3.4k views

ADD COMMENT • link 8.7 years ago by stianlagstad ★ 1.1k