Is it adviced to remove genes with very low expression in TPM data?
0
1
Entering edit mode
2.1 years ago
JACKY ▴ 160

Hello,

I have a question regarding TPM normalized RNA-seq data. The advantage of this method is that it normalizes for library depth, so each sample would have the same depth and thus we can understand the abundance of each gene in each sample, equally. The sum of each library should be 1 million.

My question is, after doing some gene filtering (the purpose is to minimize noise from uninformative genes), the sum of each library won't stay a million any more. It will differ from sample to sample. I'm not sure how crucial this step is, I'm afraid it would hurt my analysis. I'm doing a cell type enrichment analysis.

What do you guys think?

I filter using the following code, in R:

  thresh = TPM > 0.5
  keep = rowSums(thresh) >= 1.5
  TPM =TPM[keep,]      
r TPM RNA-seq clean • 2.2k views
ADD COMMENT
0
Entering edit mode

A word of caution about TPMs. They are only valid if calculated from a transcript abundance-based method such as Salmon or Kallisto.

ADD REPLY
0
Entering edit mode

well.. no that's not how I calculated it. I started off with counts matrix, I have no BAM or SAM files in order to use Salmon or Kallisto. I calculated TPM using a code block in here.

ADD REPLY
0
Entering edit mode

You saw the part where it says "It will be only a rough estimate", right?

ADD REPLY
0
Entering edit mode

Yes, I did. Again, I don't really have a choice since I have no BAM files. Regarding the question, what do you say?

ADD REPLY
0
Entering edit mode

Jacky im not sure we can answer without knowing more about the science itself.

if you're positive the reads are good quality, map uniquely to a gene you care about, etc., and it is a gene you are positive is normally found only in one cell type in the tissue you have harvested, i could see how it might be of value in a cell type enrichment analysis. even then, id probably only use it as corroborating evidence, though, which begs the question.

the other thing to keep in mind is even if those reads are real, using genes with very low expression values could inflate type I error by generating extreme ratios. what i mean is, for very low expression values, even a few reads could result in the inference that one cell type is far more common in sample 1 than sample 2 due to chance alone.

in particular if the inferences such ratios would lead you to make very different from the conclusions you might draw from moderately expressed genes, i might pause...

is there a way to cheaply corroborate the findings using, for instance IHC?

ADD REPLY

Login before adding your answer.

Traffic: 2314 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6