Good morning everyone I am new to RNAseq, and I have some doubts, I hope some of you can guide me.
What is the difference between the units log2 (tpm + 0.001) and log2 (norm_count + 1). I was checking and it seems that both units indicate that the data is normalized, but the difference is not clear to me. Can you help me, please.
The IsoPct is a unit that represents the percentage of an isoform at the level of RNAseq. Some of you know how it is calculated and how it is interpreted.
The sum of the isoforms (expressed as a percentage: IsoPct), of a gene, represents the expression of a gene in 100%?
NOTE: If you can share an article with me, I would greatly appreciate it.
TPM is a known metric. You should be able to find some YouTube videos from StatQuest explaining it, and Renesh Bedre's blog post is also quite helpful for beginners. There is no metric called
norm_count
- it is probably something a team uses internally. You'll need to read their paper/consult their website to understand that metric.Read RSEM (or other quantification algorithm) papers to understand IsoPct. That should also answer your question #3, but intuitively, it makes sense that
SUM(IsoPct)
across transcripts for a particular gene would be 100%.thank you very much for your help
The log of 0 is minus infinity (-Inf in R). A way to avoid dealing with it, is to add a small number to your values, so the log will have a value (anything but -Inf). Adding 0.001, 1, 0.5, doesn't really matter and it is mainly driven by preference. I personally prefer to use +1, as I don't like to see negative values for the expression of genes.
I cannot help here.. :(
The total expression of a gene should be 100%. From the RSEM documentation:
IsoPct stands for isoform percentage. It is the percentage of this transcript's abandunce over its parent gene's abandunce. If its parent gene has only one isoform or the gene information is not provided, this field will be set to 100.
Your comment on #1 is irrelevant. Given that you've only added a couple of sentences from RSEM docs addressing one of OP's questions and cheekily (emoji are not really professional language) point to not addressing another, your post belongs better as a comment than as an answer and I am moving it to one.
I could be wrong, but the question was not on
TPM vs norm_count
, but onlog2 (tpm + 0.001) and log2 (norm_count + 1)
. So it is relevant to me. Maybe not a definitive answer, but still relevant. I also don't see the problem in referencing a source that gives the exact answer to the question. I replaced the smiley face with a sad one... do you prefer it?That is one hell of a technicality there. Should we also differentiate between ASCII values of
t
vsn
- technically, that counts too. Emoji are unprofessional, period; doesn't matter which one you use.:D
is definitely on the lesser acceptable ones as it reads as "Haha, I cannot answer that one" - nothing funny about that.Using any value but 1 as a pseudocount doesn't make intuitive sense to me. Like you say, most people wish to retain 0s (and avoid negative values) on log transformation and adding 1 makes a lot more sense. Unless the TPM values are in a pretty narrow range and adding 1 would make a LOT of difference, adding 0.001 is asking for negative values on log transformation.
As I said it is personal. Adding
+0001
makes sense because you are closer to the actual value, than+1
. Just to give you an example, QIAGEN uses +0.02 (if I recall correctly) in some of their bioinformatics tools to record gene expression values.The 0.02 looks more like a signature than a reasoned value. The choice of pseudocount depends on whether or not a negative value is acceptable after log transformation, so you're right, it is personal preference.
I'm using data from XENA BOWSER, from the study "The Cancer Genome Atlas (TCGA). The values come in log2 (tpm + 0.001), so there are negative values, which I can do to convert it to positive values.
If you really want to work with positive values, and you are sure the data is only log transformed, you can convert the values back to the original TPM values:
value = log2(TPM + 0.001)
, which means2^value = (TPM + 0.001)
. Once your have TPM, you can re-log them as you please.If I add +1 to my values, it is enough for the expression of my gene to be positive or it is necessary to do another mathematical calculation.
Don't add or subtract anything without understanding (in full detail) how it impacts the data. You're adding log2(TPM+0.001)=1 here, so you're adding 1.999 TPM to everything. Given than TPM is a fraction metric, adding a value to it breaks it.
Thank you very much for your help