Hello,
I have a question regarding TPM normalized RNA-seq data. The advantage of this method is that it normalizes for library depth, so each sample would have the same depth and thus we can understand the abundance of each gene in each sample, equally. The sum of each library should be 1 million.
My question is, after doing some gene filtering (the purpose is to minimize noise from uninformative genes), the sum of each library won't stay a million any more. It will differ from sample to sample. I'm not sure how crucial this step is, I'm afraid it would hurt my analysis. I'm doing a cell type enrichment analysis.
What do you guys think?
I filter using the following code, in R:
thresh = TPM > 0.5
keep = rowSums(thresh) >= 1.5
TPM =TPM[keep,]
A word of caution about TPMs. They are only valid if calculated from a transcript abundance-based method such as Salmon or Kallisto.
well.. no that's not how I calculated it. I started off with counts matrix, I have no BAM or SAM files in order to use Salmon or Kallisto. I calculated TPM using a code block in here.
You saw the part where it says "It will be only a rough estimate", right?
Yes, I did. Again, I don't really have a choice since I have no BAM files. Regarding the question, what do you say?
Jacky im not sure we can answer without knowing more about the science itself.
if you're positive the reads are good quality, map uniquely to a gene you care about, etc., and it is a gene you are positive is normally found only in one cell type in the tissue you have harvested, i could see how it might be of value in a cell type enrichment analysis. even then, id probably only use it as corroborating evidence, though, which begs the question.
the other thing to keep in mind is even if those reads are real, using genes with very low expression values could inflate type I error by generating extreme ratios. what i mean is, for very low expression values, even a few reads could result in the inference that one cell type is far more common in sample 1 than sample 2 due to chance alone.
in particular if the inferences such ratios would lead you to make very different from the conclusions you might draw from moderately expressed genes, i might pause...
is there a way to cheaply corroborate the findings using, for instance IHC?