Question

Normalisation of RNAseq data and NOIseq

1

Entering edit mode

6.5 years ago

Ankit ▴ 500

Hi everyone, I have two questions regarding normalisation of RNAseq data.

1) My first question is general one. When dealing with two conditions (Lets say wildtype and knockout), do we need to normalize the counts by length of the feature/gene after correcting coverage bias (dividing by total no. of reads mapped for that particular sample)----referred as RPKM normalisation or just count per million (CPM) is sufficient ?

2) In NOIseq, R package, I tried normalizing the data both by CPM and RPKM (length correction using feature length). The CPM normalisation by NOIseq was found to be correctly normalized which I checked manually from the output. But the RPKM normalisation shows different values when I rechecked manually. I wonder if any one can help me with this or suggest something.

R script I used are as follows:

#Import counts
mycounts=read.table("mycounts.txt", header = TRUE, stringsAsFactors = FALSE)
#Import factor table
myfactors = read.table("myfactors.txt", header=TRUE)
#Import feature length
mylength=read.table("mylength_sort.txt", header = TRUE, stringsAsFactors = FALSE)
#Create NOIseq object
mydata1 <- NOISeq::readData(data=mycounts, factors=myfactors, length = mylength)
#Normalize (rpkm, lc=1)
myRPKM = rpkm(assayData(mydata1)$exprs, long = mylength , k = 0, lc = 1)

RNA-Seq R • 5.4k views

ADD COMMENT • link updated 4.5 years ago by sokhnazainab • 0 • written 6.5 years ago by Ankit ▴ 500

0

Entering edit mode

Thanks Kevin, for suggesting on normalization methods. I read this paper which pointed out same as you mentioned. I thought to try TMM method after reading this paper (https://genomebiology.biomedcentral.com/track/pdf/10.1186/gb-2010-11-3-r25).

In NOIseq, I tried normalizing by TMM method both,

with length (lc = 1)

myTMM = tmm(assayData(mydata1)$exprs, long = mylength, lc = 1)

or without length (lc = 0)

myTMM = tmm(assayData(mydata1)$exprs, long = 1000, lc = 0)

But I am not sure how to validate the output in both cases (just to understand where the difference lies and which one to choose).

I will highly appreciate any suggestion on this.

Thanks a lot!

Ankit

ADD REPLY • link 6.5 years ago by Ankit ▴ 500

0

Entering edit mode

If you are unsure of the exact parameters to choose, then I would leave them at the default. For TMM, the default is:

tmm(datos, long = 1000, lc = 0, k = 0, refColumn = 1, logratioTrim = 0.3, sumTrim = 0.05, doWeighting = TRUE, Acutoff = -1e+10)

For TMM, this means no length correction. TMM will instead calculate scaling factors and normalise your data that way. This function that you're using, however, is not the native function that was used in the package for which TMM was developed, i.e., EdgeR.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

I had a previous post here: How do I explain the difference between edgeR, LIMMA, DESeq etc. to experimental Biologist/non-bioinformatician

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks Kevin, for important suggestions.

I appreciate your help.

After applying tmm normalisation, I observe something, which is unclear to me.

I applied TMM method after filtering data with two different cpm values: 1 and 100. With low cpm value cutoff, more differentially expressed features retained (14051) and with high cpm value (100) less DE features retained. However when I performed DEG, lesser number of DEG was observed in cpm=1 than in cpm=100,

The R scripts are as follows:

#With CPM=1

#Low-count filtering

myfilt = filtered.data(mycounts, factor = myfactors$Sample, norm = FALSE, depth = NULL, method = 1, cv.cutoff = 100, **cpm = 1**, p.adj = "fdr")

***14051*** features are to be kept for differential expression analysis with filtering method 1

#Create NOIseq object with filtered and normalized counts:

mydata2 <- NOISeq::readData(data=myfilt, factors=myfactors)

#Normalisation

myTMM = tmm(assayData(mydata2)$exprs, long = 1000, lc = 0)

mydata3 <- NOISeq::readData(data=myTMM, factors=myfactors)

mydatax = mydata3

# NOISeq-DEG

mynoiseq = noiseq(mydatax, k = 0.5, norm = "n", factor = "Sample", conditions = c("WT","KO"), pnr = 0.2, nss = 5, v = 0.02, lc = 0, replicates = "technical")

deg = degenes(mynoiseq, q = 0.98, M = NULL)

[1] "**220** differentially expressed features"

. 
.
.
.
.
.
.
.
.
.
.
.


#With CPM=100

#Low-count filtering

myfilt = filtered.data(mycounts, factor = myfactors$Sample, norm = FALSE, depth = NULL, method = 1, cv.cutoff = 100, **cpm = 100**, p.adj = "fdr")

***2845*** features are to be kept for differential expression analysis with filtering method 1

#Create NOIseq object with filtered counts

mydata2 <- NOISeq::readData(data=myfilt, factors=myfactors)

#Normalisation

myTMM = tmm(assayData(mydata2)$exprs, long = 1000, lc = 0)

#Create NOIseq object with filtered and normalized counts:

mydata3 <- NOISeq::readData(data=myTMM, factors=myfactors)

mydatax = mydata3

# NOISeq-DEG

mynoiseq = noiseq(mydatax, k = 0.5, norm = "n", factor = "Sample", conditions = c("WT","KO"), pnr = 0.2, nss = 5, v = 0.02, lc = 0, replicates = "technical")

deg = degenes(mynoiseq, q = 0.98, M = NULL)

[1] "**935** differentially expressed features"

It would really be helpful if I can get any suggestions on this.

Thanks

Ankit

ADD REPLY • link updated 6.5 years ago by GenoMax 147k • written 6.5 years ago by Ankit ▴ 500

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This belongs under @Kevin's answer.

ADD REPLY • link 6.5 years ago by GenoMax 147k

0

Entering edit mode

I would welcome any suggestion on this. I observe the similar pattern with other datasets also. Please suggest what could be the reason.

Thanks

ADD REPLY • link 6.5 years ago by Ankit ▴ 500

0

Entering edit mode

I'm unsure what the problem is, exactly. If you apply different normalisaion methods and different statistical tests to the same dataset, you will obtain different answers. Anyone with experience will give this same answer. The different normalisation methods process data differently and have different filters for, for example, 'outliers' and low count samples. They also have different statistical tests that are being employed.

Perhaps you should read some published literature on the different normalisation strategies and then decide which one is best for you. You could start with this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks Kevin,

I tried some of the methods for normalisation (like CPM, Upper quartile, TMM) (without length normalisation) after removing lowly expressed gene. Almost similar number of differentially expressed features were observed.

I am still figuring out whether this is a suitable way for my data.

I will read more about this.

Thanks

Ankit

ADD REPLY • link 6.5 years ago by Ankit ▴ 500

score 3 · Answer 1 · 2018-05-24

If you are using NOIseq and your intention is to conduct differential expression, then you should not be using RPKM. Please use TMM (trimmed mean of M), instead. RPKM does not adequately adjust for differences in library sizes across your samples, thus increasing the false-positive rate that you would get via differential expression.

Please read through the very well cited manuscript A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. The authors conclde that TMM and DESeq2's normalisation strategies are supreme.

Kevin

score 0 · Answer 2 · 2020-05-11

0

Entering edit mode

4.5 years ago

sokhnazainab • 0

he, you should change the variable name in cpm=100. e.g myfilt2, myTMM2... because R kept older values of your variable in memory

ADD COMMENT • link 4.5 years ago by sokhnazainab • 0