Entering edit mode
9.1 years ago
amyfm
▴
10
Hi,
After RNA seq, alignment, counting, and normalization of the counts (TPMs), finally I have a document representing a tissue at a particular condition with all the transcript IDs and TPMs. I have 2 questions:
- If I just want to know how many transcripts are expressed in that tissue at the particular condition (the total number of transcript expressed), which transcripts I select from the list? The ones that have TPMs more than 0 or a larger number?
- How can I select a TPM cut-off value for further analysis of the more expressed transcripts?
Thank you very much
Defining the number of transcripts expressed doesn't work that way. For each sample, a limited amount of sequencing reads is available to be distributed over your genes. For example, consider sample 1 with 10.000 genes expressed compared to sample 2 with 2000 genes expressed and assume for both samples 100.000 reads were gathered. So in sample 1, there are 10.000 genes with on average 10 read counts and in sample 2 there are 2000 genes with on average 50 reads. If you decide the threshold of should be >25 reads, you would conclude that sample 2 has lots more genes expressed than sample 1. In this example we didn't even correct for gene lenghts or "sequencebility". Also we assumed that library sizes are equal but if they are not and you normalize for it, the big assumption is that overall expression between the two samples was comparable, and that is used to make all number comparable (i.e. do the normalization). In my opinion, accurately determining the number of genes expressed with RNAseq/microArray is not possible and is often not informative. It's much more informative to make a histogram of the log-transformed (!!!) normalized gene expression estimates (like you would get from cpm() function from edgeR) and compare the histograms. If you still really want to use TPM (use logTPM!) to determine the number of genes expressed and you want to define a threshold, do it based on the logTPM histogram.
Thank you very much! So it is more difficult than I thought. I used DESeq2 for differential expression analysis, not edge R.
So please, if I have my TPM file (all my transcrit IDs and the tpms and reads for each transcript) could you tell me how I create the logTPM histogram and calculate the number of genes expressed? I am a new R user so I will really appreciate if you can give me the script.
But if I just want to know the total of transcripts detected, wouldn't it be enough to select the transcripts with TPM >1?