Question

How To Calculate The Rpkm For The Count Tables Of Rna Seq Data

0

Entering edit mode

12.5 years ago

narges ▴ 230

Hi,

I have a count table of RNA seq data with 8 biological replications for two conditions ( so 4 biological replicates for each conditions ) like below:

           V1 V2 V3 V4 V5 V6 V7 V8 V9
2 ENSG00000000003  0  0  0  0  1  0  0  0
3 ENSG00000000005  0  0  0  0  0  0  0  0
4 ENSG00000000419 10 24 19 20 19  8 14  6
5 ENSG00000000457 17 15 13 18 21 18 21 15
6 ENSG00000000460  2  3  5  2  4  6  8  2
7 ENSG00000000938 20  4 35 16 10 17 19  9

How can I calculate the RPKM values for each gene? I have the count table but I do not have the gene's length.

rpkm rna-seq • 20k views

ADD COMMENT • link updated 12.5 years ago by Ge ▴ 80 • written 12.5 years ago by narges ▴ 230

score 1 · Answer 1 · 2013-01-22

1

Entering edit mode

12.5 years ago

Damian Kao 16k

You cannot calculate the RPKM if you don't have the gene's length.

ADD COMMENT • link 12.5 years ago by Damian Kao 16k

0

Entering edit mode

And how can I get the gene's length?

ADD REPLY • link 12.5 years ago by narges ▴ 230

0

Entering edit mode

What kind of data do you have exactly? Is it just the table of counts? Do you know what the reads were mapped to?

ADD REPLY • link 12.5 years ago by Damian Kao 16k

0

Entering edit mode

First used the TopHat and the Bowtie2Index to map the bam files and then using a gtf file I calculated the reads with the HTseq.

ADD REPLY • link 12.5 years ago by narges ▴ 230

1

Entering edit mode

Like Ge said below, you can use the gtf file to get the gene lengths. You can write a script to do that. If you don't have experience in scripting, you can try to open up the .gtf file in excel and generate the lengths by subtracting the 4th column (start position) from the 5th column (end position) + 1.

It's a nice little project for you to try to learn scripting if you don't already know how.

ADD REPLY • link 12.5 years ago by Damian Kao 16k

0

Entering edit mode

Thank you. Just one more question: is this start and end position you mentioned also includes the introns? I mean is it the start and end position of the genes on the chromosome or not? Is the gene length the simple subtract of these two variables or i should take into account some other factors as well.

ADD REPLY • link 12.5 years ago by narges ▴ 230

0

Entering edit mode

That depends on how your gtf file is structured. Usually there are only transcript structure listed in a gtf file, but not everyone follows the rules. Can you post a few lines of your file?

ADD REPLY • link 12.5 years ago by Damian Kao 16k

0

Entering edit mode

generally in the gtf file, one row is one exon or cds or something else. You can know what it is from "class" column or something..I cannot remember. end-start+1 is the length of one exon, for instance. if you want to get the length for transcript or gene, you can get the mapping relationship between the exons and transcripts, even genes in the "attributes" column.

ADD REPLY • link 12.5 years ago by Ge ▴ 80

0

Entering edit mode

Right. So you basically need to add up the lengths of the exons for each gene to get the transcript length.

ADD REPLY • link 12.5 years ago by Damian Kao 16k

0

Entering edit mode

Many thanks from both of you. I have downloaded the gtf file from the UCSC genome browser site and it is the latest version of hg19 like this:

> chr1    unknown    exon    11874    12227    .    +    .    gene_id    DDX11L1    transcript_id    NR_046018_1    gene_name    DDX11L1    tss_id    TSS14523
> chr1    unknown    exon    12613    12721    .    +    .    gene_id    DDX11L1    transcript_id    NR_046018_1    gene_name    DDX11L1    tss_id    TSS14523
> chr1    unknown    exon    13221    14408    .    +    .    gene_id    DDX11L1    transcript_id    NR_046018_1    gene_name    DDX11L1    tss_id    TSS14523

ADD REPLY • link 12.5 years ago by narges ▴ 230

1

Entering edit mode

If your aim is to find differentially expressed genes between the two groups I would suggest using any of edgeR/DESeq/baySeq R packages than calculating RPKM values. See this paper: http://www.ncbi.nlm.nih.gov/pubmed/22988256

ADD REPLY • link 12.5 years ago by Sudeep ★ 1.7k

0

Entering edit mode

Actually my goal is to rank genes based on their expression level not DE analysis.

ADD REPLY • link 12.5 years ago by narges ▴ 230

0

Entering edit mode

Is there any script available for calculating RPKM? I have a matrix ofGenes in the first column, gene_length in second column followed by count of all the samples in other colums.

ADD REPLY • link 9.2 years ago by genie66 ▴ 30

score 1 · Answer 2 · 2013-01-22

1

Entering edit mode

12.5 years ago

Ge ▴ 80

raw counts = FPKM * (length of that transcript/1000) * (# of mapped reads / 1e6) and you can do the math.

The gene length or transcript length can be extracted from one gtf file.

ADD COMMENT • link 12.5 years ago by Ge ▴ 80

0

Entering edit mode

Actually, I had the bam files of my rna seq samples. So I used TopHat to get the alignments and then applied the HTSeq over the accepted hits file to get the count tables. Now I need to know the expression level of genes so I decided to calculate the RPKM. But now I am not sue at which step I should have calculated the RPKM. I mean before using HTSeq and getting the above table or now after getting the count table by HTseq. Can I use the easyRNASeq R package now to get the RPKM values from the present above count table?

ADD REPLY • link 12.5 years ago by narges ▴ 230

0

Entering edit mode

When you applied the HTSeq to get the counts, it also needs one gtf as input right? This is where you can see the length of exons, genes or transcripts (whatever you are interested in). Then calculate the RPKM after you getting the counts. I have never used the easyRNASeq package, however, I quickly looked a bit. It seems that it can calculate RPKM and other normalized version of the raw counts.