Hi,
I have a count table of RNA seq data with 8 biological replications for two conditions ( so 4 biological replicates for each conditions ) like below:
V1 V2 V3 V4 V5 V6 V7 V8 V9 2 ENSG00000000003 0 0 0 0 1 0 0 0 3 ENSG00000000005 0 0 0 0 0 0 0 0 4 ENSG00000000419 10 24 19 20 19 8 14 6 5 ENSG00000000457 17 15 13 18 21 18 21 15 6 ENSG00000000460 2 3 5 2 4 6 8 2 7 ENSG00000000938 20 4 35 16 10 17 19 9
How can I calculate the RPKM values for each gene? I have the count table but I do not have the gene's length.
And how can I get the gene's length?
What kind of data do you have exactly? Is it just the table of counts? Do you know what the reads were mapped to?
First used the TopHat and the Bowtie2Index to map the bam files and then using a gtf file I calculated the reads with the HTseq.
Like Ge said below, you can use the gtf file to get the gene lengths. You can write a script to do that. If you don't have experience in scripting, you can try to open up the .gtf file in excel and generate the lengths by subtracting the 4th column (start position) from the 5th column (end position) + 1.
It's a nice little project for you to try to learn scripting if you don't already know how.
Thank you. Just one more question: is this start and end position you mentioned also includes the introns? I mean is it the start and end position of the genes on the chromosome or not? Is the gene length the simple subtract of these two variables or i should take into account some other factors as well.
That depends on how your gtf file is structured. Usually there are only transcript structure listed in a gtf file, but not everyone follows the rules. Can you post a few lines of your file?
generally in the gtf file, one row is one exon or cds or something else. You can know what it is from "class" column or something..I cannot remember. end-start+1 is the length of one exon, for instance. if you want to get the length for transcript or gene, you can get the mapping relationship between the exons and transcripts, even genes in the "attributes" column.
Right. So you basically need to add up the lengths of the exons for each gene to get the transcript length.
Many thanks from both of you. I have downloaded the gtf file from the UCSC genome browser site and it is the latest version of hg19 like this:
If your aim is to find differentially expressed genes between the two groups I would suggest using any of edgeR/DESeq/baySeq R packages than calculating RPKM values. See this paper: http://www.ncbi.nlm.nih.gov/pubmed/22988256
Actually my goal is to rank genes based on their expression level not DE analysis.
Is there any script available for calculating RPKM? I have a matrix ofGenes in the first column, gene_length in second column followed by count of all the samples in other colums.