extract gene subsequences that have supporting RNASeq evidence
0
0
Entering edit mode
6.0 years ago
mamillerpa ▴ 40

I'd like to report, as nucleotide sequences, the portions of some genes that are covered by RNAseq data. I've written the following, but it seems slow and awkward. I have a feeling I shouldn't have to convert form one kind of range to a dataframe and then to another kind of range, or convert the sequence view to a dataframe in order to access the sequences themselves.

I'm not asking for a code review, but I would be grateful if someone could suggest how to work with the various range objects more elegantly.

These (genomic) BAMs were constructed with bowtie2 and rsem using Ensembl 74, so I've stuck with those gene models. I also have wig and bigwig files.

thanks, Mark

library(GenomicAlignments)
library(GenomicFeatures)
library(BSgenome.Mmusculus.UCSC.mm10)

# first check gene's distribution of coverage first and set to mean - 2 * SD?
coverage.thresh <- 3

# make this a list and wrap much of the script in lapply
current.gene <- "ENSG00000081237"

setwd("/bam/path/")

#have five files with three experimental conditions... pool or union coverage?  take max?
bamfile <- "pe_rnaseq.bam"

txdb <- makeTxDbFromBiomart(biomart = "ENSEMBL_MART_ENSEMBL",
                                host = "dec2013.archive.ensembl.org",
                                dataset = "mmusculus_gene_ensembl")

rnaseq.dat <- readGAlignmentPairs(bamfile)

cvg1 <- coverage(rnaseq.dat)

#  slow steps above OK, wouldn't include in loop

current.chrom <- select(txdb,
                        keys = current.gene,
                        columns = "TXCHROM",
                        keytype = "GENEID")

current.chrom <- current.chrom$TXCHROM

peaks <- slice(cvg1[[current.chrom]], lower = coverage.thresh)

# too reliant on dataframes form here on down?
genedf <- as.data.frame(genes(txdb))

ir <- IRanges(start = genedf$start[genedf$gene_id == current.gene],
              end = genedf$end[genedf$gene_id == current.gene])

intersect.frame <- as.data.frame(intersect(peaks, ir))

intersect.frame$seqnames <- paste0("chr", current.chrom)

coverage.gr <-
  makeGRangesFromDataFrame(df = intersect.frame, seqnames.field = "seqnames")

seq.cov.view <- Views(Mmusculus, coverage.gr)

seq.cov.frame <- as.data.frame(seq.cov.view)

# still need to ensure correct orientation (rev/comp?)
# orientation available in txdb but lost upon converion to irange?
print(seq.cov.frame$dna)
RNA-Seq • 854 views
ADD COMMENT

Login before adding your answer.

Traffic: 1517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6