Question

How can I normalize RPKM data from TCGA (pan-cancer analysis)?

0

Entering edit mode

4.9 years ago

lenC_biotecLover ▴ 90

I have a matrix with different miRNA RPKM values downloaded from TCGA, relatively to different TCGA projects (BRCA, LAML, LUAD ecc.) columns: TCGA-barcodes, rows: miRNa identifier.

In order to perform a machine learning analysis how can I normalize all this data between the patients in my matrix? I searched all around the web but I couldn't find any answer.

I'm really a novice in bioinformatics and computational biology, and any advice is strongly appreciated. Thank you very much.

RNA-Seq RPKM pan-cancer TCGA normalization • 2.0k views

ADD COMMENT • link 4.9 years ago by lenC_biotecLover ▴ 90

score 0 · Answer 1 · 2020-07-20

0

Entering edit mode

4.9 years ago

swbarnes2 15k

RPKM already is normalized.

ADD COMMENT • link 4.9 years ago by swbarnes2 15k

0

Entering edit mode

I know, but I meant between the patients, considering that I've data from different projects

ADD REPLY • link 4.9 years ago by lenC_biotecLover ▴ 90

2

Entering edit mode

You can convert rpkm to log scale and perform vst

ADD REPLY • link 4.5 years ago by DareDevil ★ 4.4k

0

Entering edit mode

Thank you, after this, when I have the vst normalized data (using the DEseq2 package, isn't it?), it is the same of having counts data transformed using the same vst function?. For instance, if I have a RPKM dataset converted using first log scale then vst and also a counts dataset normalized with the vst function, are they comparable in terms of normalization? Thank you very much

ADD REPLY • link 4.5 years ago by lenC_biotecLover ▴ 90

0

Entering edit mode

@dare_devil, Ok I tried but log scaled RPKM are also negative in some cases and the vst function doesn't work on negative values. How can I handle with this?

ADD REPLY • link 4.5 years ago by lenC_biotecLover ▴ 90

2

Entering edit mode

You should have a matrix of RPKM values greater than or equal to 1. In order to achieve this you can add 1 to entire data frame then convert to log scale to avoid negative values.

ADD REPLY • link 4.4 years ago by DareDevil ★ 4.4k

0

Entering edit mode

Thank you.

Now the problem is that I downloaded some data from GEO (Tumoral Breast vs Normal Breast samples), in particular this is the code: GSE68085, I suppose that data is already log2 normalized and some negative values are in it. I want to use this data as a validation dataset (I'm using an svm classifier): I've downloaded the series matrix and I used the batch ID information for the batch correction with comBat function. Should I do the inverse exponential function and then apply vst?

Thank you very much again.

ADD REPLY • link 4.4 years ago by lenC_biotecLover ▴ 90

3

Entering edit mode

In this case, I would suggest nneg in NMF package

#read the rpkm values
exp= read.table("rpkm.txt", header = TRUE, sep = "\t", row.names = 1)
#Convert as a matrix
d = as.matrix(exp)
#Remove negative values
data_non_neg <- nneg(d, method = 'pmax')

This will convert all negative values to 0

You can go through this link for other methods

ADD REPLY • link 4.4 years ago by DareDevil ★ 4.4k

0

Entering edit mode

You can convert the log2 scaled data to their corresponding RPKM values using inverse function. I looked at your data GSE68085. But, I don't think they are log transformed values

ADD REPLY • link 4.4 years ago by DareDevil ★ 4.4k

0

Entering edit mode

Thank you! Ok, but these data is described as "normalized" I can't understand what type of normalization they did, do they just refer to RPKM? And if so, why do we have negative values? I red the series matrix and I could not find any other useful info. Thanks again.

ADD REPLY • link 4.4 years ago by lenC_biotecLover ▴ 90

2

Entering edit mode

You can download the data and redo the analysis. You can find its raw data here for download

ADD REPLY • link 4.4 years ago by DareDevil ★ 4.4k