Question

If TPM is not comparable across cohorts, can it be used as the input for ML models if we have RNASeq data from multiple cohorts?

0

Entering edit mode

11 months ago

ivicts ▴ 10

From https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/:

While this is true, TPM is probably the most stable unit across experiments, though you still shouldn’t compare it across experiments.

Technically, we cannot use TPM as an input to ML model as the value in one cohort is not comparable to another cohort.

So, let's say if I have 10 different cohorts of RNASeq from different labs, what should we do if I want to train ML model on the 6 cohorts to predict something on the other 4 cohorts? What's the pipeline? Should I do raw counts or TPM? What in samples or between samples normalization should I use before inputting the data to the model? Should I do batch correction?

TPM RNA-seq machine-learning • 1.3k views

ADD COMMENT • link updated 11 months ago by LChart 5.0k • written 11 months ago by ivicts ▴ 10

0

Entering edit mode

What's the pipeline?

That's for you to figure out.

Should I do raw counts or TPM? What in samples or between samples normalization should I use before inputting the data to the model? Should I do batch correction?

Raw counts and yes, you may have to figure out both - some sort of normalized counts + batch correction based on a good understanding of the experimental design. IMO this will take some trial and error to figure out and may not be robust. Also, training=6 and test=4 is not a great amount to go on. I'd look for training ~ 15 and test ~ 5 at least. The more data you have to train on, the more oebust your model will be.

ADD REPLY • link 11 months ago by Ram 45k

0

Entering edit mode

This question was also asked on bioinfo SE: https://bioinformatics.stackexchange.com/questions/22820/if-tpm-is-not-comparable-across-cohorts-can-it-be-used-as-the-input-for-ml-mode

Please keep in mind that posting the same question to multiple sites can be perceived as bad etiquette, because efforts may be made to address a problem that has already been solved elsewhere in the meantime.

The helpful thing to do if you do decide to post on multiple forums is to add a link to the other forum posts on each post so people will look at the other posts before investing their effort.

ADD REPLY • link 11 months ago by Ram 45k

score 1 · Answer 1 · 2024-08-13

There is no standard solution to this problem and it remains an active area of research. The most straightforward approach is to quantile normalize all TPM vectors (1 x n_gene) to a reference distribution; this means that by definition, for each gene, no part of the CDF is informative about batch. This does, however, induce distortions; particularly if the "thing" you want to predict is a heterogeneous fraction of each batch (30% in batch 2 but 5% in batch 4). This is basically a big-hammer version of ComBAT, and often times it fails to address batch effects, as many of these effects impact gene covariances and will still show up on PC plots even after quantile normalization.

Your next approach is a version of "harmony" - which is to normalize the PCs to a reference distribution. This works for any set of extracted features, where on some reference distribution you learn f(X) -> Y (expression to features -- e.g., the linear combinations that give the top PCs), and on a new batch you calculate Y' = f(X') and transform Y* = T(Y') so that mean(Y*) = mean(Y) and (cov(Y*) = cov(Y)). Harmony (single-cell but applicable to large RNA-seq datasets) does this by unsupervised clustering and mean-matching in Y space.

The final method is some kind of "fine-tuning" approach, which has some applications that uses a multi-term objective. (Very) loosely speaking, scanVI combines a penalty for a set of supervised points (reference distribution) not explaining known labels, with a penalty for unsupervised points (new batch) for falling far from the reference distribution (ELBO). This is a trick you can play in many ways - you can learn feature extractions that preserve categorical crossentropy on your reference using a known classifier, while minimizing the KL divergence (or some other distribution-matching scheme) between the new batch and the reference. The devil, as always, is in the details; and I'm unaware of any method that "just works."