How to define the gene length for RPKM calculation
4
3
Entering edit mode
8.6 years ago
Bnf83 ▴ 150

Hi guys,

I would like to calculate the RPKM of my RNA seq experiment. To do this, as from the formula, I need to know the gene length. My starting point are the row reads (single end) counts resulting from:

htseq-count -t exon -i gene_id my_file.sam gencode_v19.gtf > my_count.txt

To calculate the gene length I used the .gtf file from http://www.gencodegenes.org/releases/19.html.

The .gtf file contains start and end positions of genes annotated as: exons, transcript, CDS, start codon etc. I think I have to consider only exons for each gene in order to calculate the length of a gene but I'm not sure. My final purpose is to calculate the gene-level expression of my samples. I'm not interested in splicing variants for example.

Is it right to take into account exons to calculate the gen length for my purpose?

Thank you in advance and kind regards

B

gtf RNA-Seq RPKM • 23k views
ADD COMMENT
4
Entering edit mode

If you use featureCounts from subread (http://subread.sourceforge.net/) instead of htseq for your quantification, you will get the gene length automatically. Just a suggestion.

ADD REPLY
0
Entering edit mode

The length in featureCounts is wrong. Because it includes the length of intron!

ADD REPLY
3
Entering edit mode
6.8 years ago
lhdcsu ▴ 50

Different types of gene lengths can be easily calculated using GTFtools:http://www.genemine.org/gtftools.php

type 1: mean of lengths of isoforms.

type 2: median of lengths of isoforms.

type 3: max of lengths of isoforms.

type 4: length of merged exons of isoforms.

ADD COMMENT
1
Entering edit mode
8.6 years ago
Nathaniel ▴ 120

If you decide to ignore the existence of different transcripts and work only on the gene metalevel, then do as you suggest: overlap all exons for each gene and sum the length of each exon to obtain an estimate of the gene length

However, as Michael mentions, RPKMs are not the best choice to do differential analysis in gene expression. Take a look at the following video: www.youtube.com/watch?v=TTUrtCY2k-w

P.D. I assume your RNAseq protocol contained the polyA extraction step, to obtain only mature transcripts

ADD COMMENT
1
Entering edit mode
8.6 years ago

Since your final purpose is to calculate the gene-level expression I would suggest to go with these counts straight to DESeq2 (or edgeR, but I'm more familiar with DESeq2). You can export normalized counts using far more sophisticated models than RPKM.

ADD COMMENT
0
Entering edit mode
8.6 years ago
Michael 55k

There have been several post showing that RPKM can be detrimental instead of doing any good for gene based analysis. For gene DE you wouldn't need it anyway, because gene length doesn't change. Most packages that compute some aggregate (TPM, CPM, FPKM) will compute the gene length for you based on the annotation.

If you want to calculate it yourself approximately, you have already committed to 'exon length' by the use of the parameter. Any read not overlapping exons has no chance of being counted. Some exons might overlap, so it is not good enough to sum all exon lengths, because positions might count multiple times. I would try something like non-overlapping exon length, that means: compute the Union of all exon intervals and sum the length.

ADD COMMENT
2
Entering edit mode

Dear Michael,

I have to calculate rpkm from counts data which I got from STAR. I'm using edgeR for this and it needs gene_length. I'm doing this rpkm calculation not for DE analysis. Could you please tell me how can I calculate rpkm from counts.

Do you think this is the right way to calculate gene_length?

library(GenomicFeatures)
hg19.ens <- makeTxDbFromUCSC(genome="hg19", tablename="ensGene")
exonic <- exonsBy(hg19.ens, by="gene")
red.exonic <- reduce(exonic)
exon.lengths <- sum(width(red.exonic))
ADD REPLY
5
Entering edit mode

Almost. The last part needs a bit of work in order to work:

library(GenomicFeatures)
hg19.ens <- makeTxDbFromUCSC(genome="hg19", tablename="ensGene")
exonic <- exonsBy(hg19.ens, by="gene")
red.exonic <- reduce(exonic)
exon.lengths <- vapply(width(red.exonic), sum, numeric(1))

Alternatively, if you have an external gtf file , you could do:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF(gtfFile, format="gtf")
exonic <- exonsBy(txdb, by="gene")
red.exonic <- reduce(exonic)
exon.lengths <- vapply(width(red.exonic), sum, numeric(1))
ADD REPLY

Login before adding your answer.

Traffic: 2182 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6