Question

Rsamtools-Countoverlaps

1

Entering edit mode

13.3 years ago

Cdiez ▴ 150

Hi, I'm a beginner analyzing genomic data and using R and I have a doubt that I don't know how to solve. I'm using Rsamtools to analyze mRNAseq_data but once I have the reads covering each gene I would like to add the information about their chromosome, start and stop of each gene. I have downloaded the gene information from the UCSC browser as a RangeList following this command line:

library(GenomicFeatures)
txdb = makeTranscriptDbFromUCSC(genome='hg19',tablename='ensGene')

I have tried to coerce the RangeList into a data frame in order to get this "extra data" but the number of observations changes so I don't know how to extract from the RangeList the information I need of each gene.

Right now I have this data frame:

Gene          Reads

AGR37373.F06    5

AGR37273.F04    0

And I would like to have this:

Gene          Chr     Start    Stop   Reads

AGR37373.F06    1      20       150     5

AGR37273.F04    1      300      410     0

I'd be very grateful if someone could give me any advice about how to solve this problem.

Thanks!

mrna r samtools • 5.2k views

ADD COMMENT • link updated 13.3 years ago by Michael 55k • written 13.3 years ago by Cdiez ▴ 150

0

Entering edit mode

Did you try to merge the two dataframes? read ?merge and merge on the the geneid.

ADD REPLY • link 13.3 years ago by Michael 55k

Ram · Answer 1 · 2011-08-09

My guess is you do something like:

library(GenomicFeatures)
txdb <- makeTranscriptDbFromUCSC(genome='hg19',tablename='ensGene')
aligns <- readBamGappedAlignments("my.bam")
counts <- countOverlaps(transcripts(txdb), aligns)
names(counts) <- elementMetadata(tx)[,"tx_name"]

so change the last line to and following:

elementMetadata(txdb)[,"my.counts"] <- counts
df <- as.data.frame(txdb)

and voila you have a data frame with all genomic positions and metadata.

However I would suggest you to look at htseq-count because it has multiple resolution rules for situations that a simple count does not answer. call it from R with system("htseq-count ...") and read in the resulting data.frame.

Ram · Answer 2 · 2011-08-10

Hi! Thanks a lot for your useful comments!

I have tried the two suggested methods but I have not had any successful result :(

Yes, you are right (Ido Tamir), I was trying this command line:

library(GenomicFeatures) txdb <- makeTranscriptDbFromUCSC(genome='hg19',tablename='ensGene') aligns <- readBamGappedAlignments("my.bam") counts <- countOverlaps(transcripts(txdb), aligns) counts.dframe <- data.frame(counts, stringsAsFactors=FALSE) rownames(counts.dframe) <- names(txdb)

Following this command line I obtained the number of reads overlapping each gene:

CC209858.4_FG002 214 CC209858.4_FG003 0 CC209858.4_FG004 17 CC209859.2_FG006 0 CC209859.2_FG008 0

Now I wanted to add an extra column with the information stored in the Metadata(txdb). I tried the two suggested options suggested command lines but I got an error:

rownames(counts.dframe) <- elementMetadata(txbd)[,"tx_name"]

Error in elementMetadata(txbd)[, "tx_name"] : selecting cols: cannot subset by character when names are NULL

elementMetadata(txbd)[,"my.counts"] <- count.dframe

Error in [<-(*tmp*, , "my.counts", value = list(counts = c(1L, : replacing cols: cannot subset by character when names are NULL

About trying to merge both files. The problem to merge the two files(counts and genes) using the gene ID is that I have to coerce into data.frame both files, and when I do it the number of rows of the txbd.dataframe doesn't correspond with the number of observations this file has as GRangeList (I don't get why..). Then, I can not merge the two datafrmes.

i.e:

Original files Files coerced into data frame

counts (interger[110185]) counts.dframe (110185 obs. of 1 variables) txbd (GRangesList of length 110185) txbd.dframe (392978 obs. of 8 variables)

Then, I don't know how I can access to the information store in the metaData and use it to complete the information about the reads overlapping each gene. I just thinking about to export the files and write and script in Perl to do it, but should be a way to do it in R (I guess...).

Thanks!

Ram · Answer 3 · 2011-08-10

0

Entering edit mode

13.3 years ago

Michael 55k

The number of rows doesn't have to be equal

Try:

merge(counts, meta, all=TRUE, by.x=1, by.y=1) # set by.x by.y to the column number with the Geneids

That way you will get a full outer join like in SQL with NAs inserted in rows that don't in exist in one of the files.

ADD COMMENT • link updated 5.5 years ago by Ram 44k • written 13.3 years ago by Michael 55k