Question

Rpkm Computation

1

Entering edit mode

11.4 years ago

fusion.slope ▴ 250

Hello,

I am computing RPKMs and now I would calculate the value "L=length of the feature in Kb" (i.e. the length of the transcribed gene) on the basis of exon unique length count.

For example: from biomart I retrieved for each gene the length of the exons on the basis of the start and end position. Since there are different isoform for each gene I have the same exon more than one time (one for each isoform of transcript). I want to calculate the length of the transcribed gene on the basis of exon unique count.

This is an Example for one Gene:

FBgn0000008    37
FBgn0000008    251
FBgn0000008    41
FBgn0000008    789
FBgn0000008    212
FBgn0000008    1473
FBgn0000008    1207
FBgn0000008    170
FBgn0000008    117
FBgn0000008    337
FBgn0000008    251
FBgn0000008    41
FBgn0000008    789
FBgn0000008    212
FBgn0000008    1473
FBgn0000008    1207
FBgn0000008    170
FBgn0000008    117
FBgn0000008    217
FBgn0000008    344
FBgn0000008    41
FBgn0000008    789
FBgn0000008    212
FBgn0000008    1473
FBgn0000008    1207
FBgn0000008    170
FBgn0000008    117
FBgn0000008    344
FBgn0000008    818

I have to obtain this value:

37+251+ 41+789+212+1473+1207+170 +117 +337 +217+344 +818=5796

Any kind of Idea is appreciated!! Thanks

rpkm • 3.3k views

ADD COMMENT • link updated 11.4 years ago by Devon Ryan 105k • written 11.4 years ago by fusion.slope ▴ 250

0

Entering edit mode

Are you looking for the cDNA length, the coding sequence length or the genomic length of the gene?

ADD REPLY • link 11.4 years ago by Emily 24k

0

Entering edit mode

Hi,

thanks for you reply. I am searching a way to not overcount the length of overlapping exons. Since for each gene there are multiple transcript and for all the transcript there are exons that overlap one with each other, I am searching a way to:

not count the same exon length each time
not count for two time the length of overlapping exons

This is my problem, I am trying with reduce() function on the IRanged bioconductor the package but I still have some problems..

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by fusion.slope ▴ 250

0

Entering edit mode

I put the problem in Devon reply. Just to better understand.

Cheers,
Tommi

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.3 years ago by fusion.slope ▴ 250

score 2 · Answer 1 · 2014-04-09

Assuming you have a GTF/GFF file, something like the following in R will work. This also computes the GC content of each gene, so you can simplify things from this example rather considerably. The important part is split(), which breaks things up be gene, followed by reduce(), which merges overlapping exons such that you don't double count anything. After that, it's just a matter of lapply()ing a function to sum() the width (which I actually did in a redundant way below).

N.B., you'll obviously need to change the GTFfile and FASTAfile definitions as well.

#!/usr/bin/env Rscript
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)

GTFfile = "~/Documents/Misc/Mus_musculus/Ensembl/GRCm38.71/Annotation/Mus_musculus.GRCm38.71.gtf"
FASTAfile = "~/Documents/Misc/Mus_musculus/Ensembl/GRCm38.71/Sequence/Mus_musculus.GRCm38.71.fa"

#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
    nGCs = sum(elementMetadata(x)$nGCs)
    width = sum(elementMetadata(x)$widths)
    c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")

write.table(output, file="GC_lengths.tsv", sep="\t")

score 1 · Answer 2 · 2014-04-09

1

Entering edit mode

11.4 years ago

Zuguang Gu ▴ 220

Since genes may have multiple transcripts, I would merge all transcripts into one transcript for the same gene.

If you use R, you can do like this: For each gene, save positions of exons from different transcripts as a list of GRanges/IRanges objects and do union operation.

And finally use the length of these merged exons.

ADD COMMENT • link 11.4 years ago by Zuguang Gu ▴ 220

0

Entering edit mode

Will this strategy takes into account exons which overlap between different transcripts of same gene??

ADD REPLY • link 11.4 years ago by Varun Gupta ★ 1.3k