anyone suggest how to get the genelength for the rpkm calculation. i want to do the within sample gene comparision i went through some of the biostar threads i did not able figure out how to get the gene length with or without intronic regions (suggestion like (start-end)+1 from GTF file), I have ensembl GTF file any suggestion please help me,
You should get transcript-level expression (and then aggregate transcript-level RPKMs/TPMs/whatever to gene-level). Transcripts actually have lengths associated with them; genes don't.
Agree with dsull. Calculating RPKM or FPKM manually is not a good choice. You'd better calculate them using some software like RSEM or StringTie. However, if you want to do this anyway, you can just use the items labeled as "transcript" in GTF, and use "end - start" to get the length with intron. For the gene length without intron, there is a simpler way: download the cDNA file from Gencode https://www.gencodegenes.org/human/ and use the length of cDNA in the header.
Below are two transcripts from Gencode cDNA fasta file, you can see the number 1657 and 632, just use them is OK
Here's why: Let's say a gene has two isoforms (a 1000 bp transcript and a 9000 bp transcript). What is the gene length? Do you take the average and say the gene length is 5000 bp? No, it can't be, because if the 9000 bp transcript is NEVER expressed (only the 1000 bp transcript is expressed) in your sequenced tissue, then you're dividing by a 5000 bp and are therefore underestimating your abundances (in this situation, you should be dividing by 1000 bp, not 5000 bp since only the 1000 bp transcript is expressed).
The way to remedy this problem is to get transcript-level expression and dividing your transcript-level counts by the transcript lengths, and THEN aggregating the transcript abundances to the gene-level.
You are asking for something that mathematically (and biologically) does NOT make any sense. "Gene length" is nonsensical unless you're working with "one gene -> one transcript" organisms.
@dsull, As your metioned make sense, what will be the suggestion actually i have used htseq-count to get raw count now i wanted to normalized to RPKM, equation suggest to give gene length, so I am asking how to get the gene length if w i am caluculate manually or is there any tool that i can provide my STAR mapped bam file to get rpkm please suggest
You could use featureCounts which does output a Length column associated with each gene by doing a "union of exons" -- however, as stated previously, this will NOT give you accurate results. The correct way to do things is to get transcript-level estimates by STAR+RSEM or kallisto and then aggregate transcript abundances to gene-level.
You should get transcript-level expression (and then aggregate transcript-level RPKMs/TPMs/whatever to gene-level). Transcripts actually have lengths associated with them; genes don't.