I'd like to report, as nucleotide sequences, the portions of some genes that are covered by RNAseq data. I've written the following, but it seems slow and awkward. I have a feeling I shouldn't have to convert form one kind of range to a dataframe and then to another kind of range, or convert the sequence view to a dataframe in order to access the sequences themselves.
I'm not asking for a code review, but I would be grateful if someone could suggest how to work with the various range objects more elegantly.
These (genomic) BAMs were constructed with bowtie2 and rsem using Ensembl 74, so I've stuck with those gene models. I also have wig and bigwig files.
thanks, Mark
library(GenomicAlignments)
library(GenomicFeatures)
library(BSgenome.Mmusculus.UCSC.mm10)
# first check gene's distribution of coverage first and set to mean - 2 * SD?
coverage.thresh <- 3
# make this a list and wrap much of the script in lapply
current.gene <- "ENSG00000081237"
setwd("/bam/path/")
#have five files with three experimental conditions... pool or union coverage? take max?
bamfile <- "pe_rnaseq.bam"
txdb <- makeTxDbFromBiomart(biomart = "ENSEMBL_MART_ENSEMBL",
host = "dec2013.archive.ensembl.org",
dataset = "mmusculus_gene_ensembl")
rnaseq.dat <- readGAlignmentPairs(bamfile)
cvg1 <- coverage(rnaseq.dat)
# slow steps above OK, wouldn't include in loop
current.chrom <- select(txdb,
keys = current.gene,
columns = "TXCHROM",
keytype = "GENEID")
current.chrom <- current.chrom$TXCHROM
peaks <- slice(cvg1[[current.chrom]], lower = coverage.thresh)
# too reliant on dataframes form here on down?
genedf <- as.data.frame(genes(txdb))
ir <- IRanges(start = genedf$start[genedf$gene_id == current.gene],
end = genedf$end[genedf$gene_id == current.gene])
intersect.frame <- as.data.frame(intersect(peaks, ir))
intersect.frame$seqnames <- paste0("chr", current.chrom)
coverage.gr <-
makeGRangesFromDataFrame(df = intersect.frame, seqnames.field = "seqnames")
seq.cov.view <- Views(Mmusculus, coverage.gr)
seq.cov.frame <- as.data.frame(seq.cov.view)
# still need to ensure correct orientation (rev/comp?)
# orientation available in txdb but lost upon converion to irange?
print(seq.cov.frame$dna)