RPKM genelength
1
0
Entering edit mode
2.6 years ago

Hello,

anyone suggest how to get the genelength for the rpkm calculation. i want to do the within sample gene comparision i went through some of the biostar threads i did not able figure out how to get the gene length with or without intronic regions (suggestion like (start-end)+1 from GTF file), I have ensembl GTF file any suggestion please help me,

Thank you

RPKM RNA-seq • 1.4k views
ADD COMMENT
2
Entering edit mode

You should get transcript-level expression (and then aggregate transcript-level RPKMs/TPMs/whatever to gene-level). Transcripts actually have lengths associated with them; genes don't.

ADD REPLY
1
Entering edit mode
2.6 years ago
shiyang_bio ▴ 170

Agree with dsull. Calculating RPKM or FPKM manually is not a good choice. You'd better calculate them using some software like RSEM or StringTie. However, if you want to do this anyway, you can just use the items labeled as "transcript" in GTF, and use "end - start" to get the length with intron. For the gene length without intron, there is a simpler way: download the cDNA file from Gencode https://www.gencodegenes.org/human/ and use the length of cDNA in the header.

Below are two transcripts from Gencode cDNA fasta file, you can see the number 1657 and 632, just use them is OK

ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript| GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG CAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGG GCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCAT AGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAG TGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACG ACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCA GGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTG TCCTGGACACGCTGTTGGCCTGGATCTGAGCCCTGGTGGAGGTCAAAGCCACCTTTGGTT CTGCCATTGCTGCTGTGTGGAAGTTCACTCCTGCCTTTTCCTTTCCCTAGAGCCTCCACC ACCCCGAGATCACATTTCTCACTGCCTTTTGTCTGCCCAGTTTCACCAGAAGTAGGCCTC TTCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTG GAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCAGG AGAGTGTGGAGTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTAGTGCCCGTTGGAGA AAACAGGGGAATCCCGAAGAAATGGTGGGTCCTGGCCATCCGTGAGATCTTCCCAGGGCA GCTCCCCTCTGTGGAATCCAATCTGTCTTCCATCCTGCGTGGCCGAGGGCCAGGCTTCTC ACTGGGCCTCTGCAGGAGGCTGCCATTTGTCCTGCCCACCTTCTTAGAAGCGAGACGGAG CAGACCCATCTGCTACTGCCCTTTCTATAATAACTAAAGTTAGCTGCCCTGGACTATTCA CCCCCTAGTCTCAATTTAAGAAGATCCCCATGGCCACAGGGCCCCTGCCTGGGGGCTTGT CACCTCCCCCACCTTCTTCCTGAGTCATTCCTGCAGCCTTGCTCCCTAACCTGCCCCACA GCCTTGCCTGGATTTCTATCTCCCTGGCTTGGTGCCAGTTCCTCCAAGTCGATGGCACCT CCCTCCCTCTCAACCACTTGAGCAAACTCCAAGACATCTTCTACCCCAACACCAGCAATT GTGCCAAGGGCCATTAGGCTCTCAGCATGACTATTTTTAGAGACCCCGTGTCTGTCACTG AAACCTTTTTTGTGGGAGACTATTCCTCCCATCTGCAACAGCTGCCCCTGCTGACTGCCC TTCTCTCCTCCCTCTCATCCCAGAGAAACAGGTCAGCTGGGAGCTTCTGCCCCCACTGCC TAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCCGAGACGTTTGCATCCTGCACAGC TAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene| GTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGTTGGAGGAAAGA TGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTGTGTGGTGATGCCAGGCATGC CCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTC TTCTCAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTCTGGTGGAGAACCTGTGCATGAA GGCTGTCAACCAGTCCATAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGT GCTCCTGGACCAGTGATACACCCGGCACCCTGTCCTGGACACGCTGTTGGCCTGGATCTG AGCCCTGGTGGAGGTCAAAGCCACCTTTGGTTCTGCCATTGCTGCTGTGTGGAATTTCAC CAGAAGTAGGCCTCTTCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTT TGCTCTGCCCGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAA AGGTGAAACCCAGGAGAGTGTGGAGTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTA GTGCCCGTTGGAGAAAACAGGGGAATCCCGAA

ADD COMMENT
0
Entering edit mode

@shiyang_bio thanks for the reply but i need the gene level length not the transcript level and already mentioned i have ensembl gtf file

ADD REPLY
1
Entering edit mode

There's no such a thing as a "gene length".

Here's why: Let's say a gene has two isoforms (a 1000 bp transcript and a 9000 bp transcript). What is the gene length? Do you take the average and say the gene length is 5000 bp? No, it can't be, because if the 9000 bp transcript is NEVER expressed (only the 1000 bp transcript is expressed) in your sequenced tissue, then you're dividing by a 5000 bp and are therefore underestimating your abundances (in this situation, you should be dividing by 1000 bp, not 5000 bp since only the 1000 bp transcript is expressed).

The way to remedy this problem is to get transcript-level expression and dividing your transcript-level counts by the transcript lengths, and THEN aggregating the transcript abundances to the gene-level.

You are asking for something that mathematically (and biologically) does NOT make any sense. "Gene length" is nonsensical unless you're working with "one gene -> one transcript" organisms.

ADD REPLY
0
Entering edit mode

@dsull, As your metioned make sense, what will be the suggestion actually i have used htseq-count to get raw count now i wanted to normalized to RPKM, equation suggest to give gene length, so I am asking how to get the gene length if w i am caluculate manually or is there any tool that i can provide my STAR mapped bam file to get rpkm please suggest

thank you

ADD REPLY
3
Entering edit mode

You could use featureCounts which does output a Length column associated with each gene by doing a "union of exons" -- however, as stated previously, this will NOT give you accurate results. The correct way to do things is to get transcript-level estimates by STAR+RSEM or kallisto and then aggregate transcript abundances to gene-level.

ADD REPLY

Login before adding your answer.

Traffic: 2859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6