Question

RNA-seq data for deep learning classification

0

Entering edit mode

6 months ago

yahn • 0

The objective of my task is binary classification for the HPV-status of head and neck cancer patients with multi-modal data including genomics, transcriptomics and histopathology data.

For the transcriptomics data, I downloaded mRNA seq files with raw count, and normalised counts with tpm, fpkm, fpkm-uq transformation methods. However, it seems that these normalisation methods are not preferred choices and often raw counts are used directly as inputs for DESeq2 or EdgeR normalisation.

I did some further reading on DESeq2 and EdgeR normalisation methods, but they use the label information - which I would not want as this would require an independent test set etc..

Prior to the feature selection, I would like to apply normalisation for the raw count but I am not very sure still after days of reading which format of RNA-seq data to use. Could anyone give advice on how I can proceed further with this?

Thank you very much.

rna-seq • 479 views

ADD COMMENT • link 6 months ago by yahn • 0

score 1 · Accepted Answer · 2024-05-02

Honestly speaking, if it's deep learning, it probably doesn't matter that much if you use something like TPMs (probably not raw counts, unless one of your features is sequencing depth). I'm sure a deep learning model will be able to learn the things that cause between-sample differences and account for them naturally as it's making predictions.

Machine learning is practically constructing a complicated mathematical function over your features. Normalization is itself a mathematical function.

A patient walks into your clinic and you want to tell that n=1 patient their HPV status. You can get TPMs out from their sample fairly easily, and that's what you want to plug into your model.