Rpkm Computation
2
1
Entering edit mode
10.8 years ago
fusion.slope ▴ 250

Hello,

I am computing RPKMs and now I would calculate the value "L=length of the feature in Kb" (i.e. the length of the transcribed gene) on the basis of exon unique length count.

For example: from biomart I retrieved for each gene the length of the exons on the basis of the start and end position. Since there are different isoform for each gene I have the same exon more than one time (one for each isoform of transcript). I want to calculate the length of the transcribed gene on the basis of exon unique count.

This is an Example for one Gene:

FBgn0000008    37
FBgn0000008    251
FBgn0000008    41
FBgn0000008    789
FBgn0000008    212
FBgn0000008    1473
FBgn0000008    1207
FBgn0000008    170
FBgn0000008    117
FBgn0000008    337
FBgn0000008    251
FBgn0000008    41
FBgn0000008    789
FBgn0000008    212
FBgn0000008    1473
FBgn0000008    1207
FBgn0000008    170
FBgn0000008    117
FBgn0000008    217
FBgn0000008    344
FBgn0000008    41
FBgn0000008    789
FBgn0000008    212
FBgn0000008    1473
FBgn0000008    1207
FBgn0000008    170
FBgn0000008    117
FBgn0000008    344
FBgn0000008    818

I have to obtain this value:

37+251+ 41+789+212+1473+1207+170 +117 +337 +217+344 +818=5796

Any kind of Idea is appreciated!! Thanks

rpkm • 2.9k views
ADD COMMENT
0
Entering edit mode

Are you looking for the cDNA length, the coding sequence length or the genomic length of the gene?

ADD REPLY
0
Entering edit mode

Hi,

thanks for you reply. I am searching a way to not overcount the length of overlapping exons. Since for each gene there are multiple transcript and for all the transcript there are exons that overlap one with each other, I am searching a way to:

  1. not count the same exon length each time
  2. not count for two time the length of overlapping exons

This is my problem, I am trying with reduce() function on the IRanged bioconductor the package but I still have some problems..

ADD REPLY
0
Entering edit mode

I put the problem in Devon reply. Just to better understand.

Cheers,
Tommi

ADD REPLY
2
Entering edit mode
10.8 years ago

Assuming you have a GTF/GFF file, something like the following in R will work. This also computes the GC content of each gene, so you can simplify things from this example rather considerably. The important part is split(), which breaks things up be gene, followed by reduce(), which merges overlapping exons such that you don't double count anything. After that, it's just a matter of lapply()ing a function to sum() the width (which I actually did in a redundant way below).

N.B., you'll obviously need to change the GTFfile and FASTAfile definitions as well.

#!/usr/bin/env Rscript
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)

GTFfile = "~/Documents/Misc/Mus_musculus/Ensembl/GRCm38.71/Annotation/Mus_musculus.GRCm38.71.gtf"
FASTAfile = "~/Documents/Misc/Mus_musculus/Ensembl/GRCm38.71/Sequence/Mus_musculus.GRCm38.71.fa"

#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
    nGCs = sum(elementMetadata(x)$nGCs)
    width = sum(elementMetadata(x)$widths)
    c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")

write.table(output, file="GC_lengths.tsv", sep="\t")
ADD COMMENT
0
Entering edit mode

Really thanks!

ADD REPLY
1
Entering edit mode
10.8 years ago
Zuguang Gu ▴ 220

Since genes may have multiple transcripts, I would merge all transcripts into one transcript for the same gene.

If you use R, you can do like this: For each gene, save positions of exons from different transcripts as a list of GRanges/IRanges objects and do union operation.

And finally use the length of these merged exons.

ADD COMMENT
0
Entering edit mode

Will this strategy takes into account exons which overlap between different transcripts of same gene??

ADD REPLY

Login before adding your answer.

Traffic: 3620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6