Hi all, I have a question about converting read count data from RNASeq data to expression values.
For example, this is a sample of read count data from GSM1375434_p01_counts.txt from the study GSE57110 in the GEO database:
name read_count
ENSRNOG00000000001 4
ENSRNOG00000000007 1278
ENSRNOG00000000008 3
ENSRNOG00000000009 0
ENSRNOG00000000010 28
ENSRNOG00000000012 0
ENSRNOG00000000014 47
ENSRNOG00000000017 0
ENSRNOG00000000021 48
ENSRNOG00000000024 77
ENSRNOG00000000028 4
ENSRNOG00000000029 80
I have a number of samples/files like this, and I want to turn these read counts into expression values.
I found a helpful guide to doing that here, however, when I follow the instructions, I think my numbers are unreasonable.
I followed the above instructions to obtain a TPM score for each gene (I also have a similar table for RPKM, which doesn't make sense either). If someone could tell me where I've gone wrong with this table I would appreciate it. In the below table:
- name = Ensembl Gene ID, obtained from the GSM1375434_p01_counts.txt file (names shortened for this example)
- read_count = read count, obtained from the GSM1375434_p01_counts.txt file
- start = gene start position, obtained from BioMart (NOTE, for this example, I have shortened these numbers for presentation purposes)
- end = gene end position, obtained from BioMart (NOTE, for this example, I have shortened these numbers for presentation purposes)
- length = end - start
- length_in_kbp = length /1000
- RPK = length_in_kbp / read_count
- RPK/scaling_factor = RPK / (sum of all the RPK values / 1,000,000)
> name read_count start end length_(bp) length_in_kbp RPK RPK/scaling_factor > RNOG001 4 23066 23069 1420 1.42 0.355 27053 > RNOG007 1278 5686 5690 40761 40.761 0.031 2430 > RNOG008 3 8254 8258 36572 36.57 12.19 929030 > RNOG009 0 1047 1049 16385 16.38 0 0 > RNOG010 28 2060 2061 3809 3.80 0.136 10367 > RNOG012 0 1476 1478 6254 6.25 0 0 > RNOG017 0 2543 2544 10588 10.58 0 0 > RNOG021 48 1519 1525 1201 1.20 0.025 1906 > RNOG024 77 1689 1690 29514 29.51 0.383 29210 > Total_Read_Counts 1438 > Total_RPK 13.12 > Total_Read_Counts_Divided_By_A_Million 0.0014 > Total_RPK_Divided_By_A_Million 1.31E-05
So as you can see, my values for RPK/scaling_factor are in some cases high (like >300,000). I got similar numbers for a table that I did for RPKM instead of TPM. I'm wondering if someone could point out (possibly with an example for one line) how I mis-calculated the TPM for each gene? One thing I was thinking is maybe I took the wrong gene length, I simply used BioMart to get the start and end position for each gene, maybe I was meant to sum all the transcript lengths for a gene? However the information I'm given in GEO is only for genes, not transcripts?
Thanks
This thread has code examples if you want to double-check your calculations: Calculating TPM from featureCounts output