Normalisation of RNAseq data and NOIseq
2
1
Entering edit mode
6.5 years ago
Ankit ▴ 500

Hi everyone, I have two questions regarding normalisation of RNAseq data.

1) My first question is general one. When dealing with two conditions (Lets say wildtype and knockout), do we need to normalize the counts by length of the feature/gene after correcting coverage bias (dividing by total no. of reads mapped for that particular sample)----referred as RPKM normalisation or just count per million (CPM) is sufficient ?

2) In NOIseq, R package, I tried normalizing the data both by CPM and RPKM (length correction using feature length). The CPM normalisation by NOIseq was found to be correctly normalized which I checked manually from the output. But the RPKM normalisation shows different values when I rechecked manually. I wonder if any one can help me with this or suggest something.

R script I used are as follows:

#Import counts
mycounts=read.table("mycounts.txt", header = TRUE, stringsAsFactors = FALSE)
#Import factor table
myfactors = read.table("myfactors.txt", header=TRUE)
#Import feature length
mylength=read.table("mylength_sort.txt", header = TRUE, stringsAsFactors = FALSE)
#Create NOIseq object
mydata1 <- NOISeq::readData(data=mycounts, factors=myfactors, length = mylength)
#Normalize (rpkm, lc=1)
myRPKM = rpkm(assayData(mydata1)$exprs, long = mylength , k = 0, lc = 1)
RNA-Seq R • 5.4k views
ADD COMMENT
0
Entering edit mode

Thanks Kevin, for suggesting on normalization methods. I read this paper which pointed out same as you mentioned. I thought to try TMM method after reading this paper (https://genomebiology.biomedcentral.com/track/pdf/10.1186/gb-2010-11-3-r25).

In NOIseq, I tried normalizing by TMM method both,

with length (lc = 1)

myTMM = tmm(assayData(mydata1)$exprs, long = mylength, lc = 1)

or without length (lc = 0)

myTMM = tmm(assayData(mydata1)$exprs, long = 1000, lc = 0)

But I am not sure how to validate the output in both cases (just to understand where the difference lies and which one to choose).

I will highly appreciate any suggestion on this.

Thanks a lot!

Ankit

ADD REPLY
0
Entering edit mode

If you are unsure of the exact parameters to choose, then I would leave them at the default. For TMM, the default is:

tmm(datos, long = 1000, lc = 0, k = 0, refColumn = 1, logratioTrim = 0.3, sumTrim = 0.05, doWeighting = TRUE, Acutoff = -1e+10)

For TMM, this means no length correction. TMM will instead calculate scaling factors and normalise your data that way. This function that you're using, however, is not the native function that was used in the package for which TMM was developed, i.e., EdgeR.

ADD REPLY
0
Entering edit mode

Thanks Kevin, for important suggestions.

I appreciate your help.

After applying tmm normalisation, I observe something, which is unclear to me.

I applied TMM method after filtering data with two different cpm values: 1 and 100. With low cpm value cutoff, more differentially expressed features retained (14051) and with high cpm value (100) less DE features retained. However when I performed DEG, lesser number of DEG was observed in cpm=1 than in cpm=100,

The R scripts are as follows:

#With CPM=1

#Low-count filtering

myfilt = filtered.data(mycounts, factor = myfactors$Sample, norm = FALSE, depth = NULL, method = 1, cv.cutoff = 100, **cpm = 1**, p.adj = "fdr")

***14051*** features are to be kept for differential expression analysis with filtering method 1

#Create NOIseq object with filtered and normalized counts:

mydata2 <- NOISeq::readData(data=myfilt, factors=myfactors)

#Normalisation

myTMM = tmm(assayData(mydata2)$exprs, long = 1000, lc = 0)

mydata3 <- NOISeq::readData(data=myTMM, factors=myfactors)

mydatax = mydata3

# NOISeq-DEG

mynoiseq = noiseq(mydatax, k = 0.5, norm = "n", factor = "Sample", conditions = c("WT","KO"), pnr = 0.2, nss = 5, v = 0.02, lc = 0, replicates = "technical")

deg = degenes(mynoiseq, q = 0.98, M = NULL)

[1] "**220** differentially expressed features"

. 
.
.
.
.
.
.
.
.
.
.
.


#With CPM=100

#Low-count filtering

myfilt = filtered.data(mycounts, factor = myfactors$Sample, norm = FALSE, depth = NULL, method = 1, cv.cutoff = 100, **cpm = 100**, p.adj = "fdr")

***2845*** features are to be kept for differential expression analysis with filtering method 1

#Create NOIseq object with filtered counts

mydata2 <- NOISeq::readData(data=myfilt, factors=myfactors)

#Normalisation

myTMM = tmm(assayData(mydata2)$exprs, long = 1000, lc = 0)

#Create NOIseq object with filtered and normalized counts:

mydata3 <- NOISeq::readData(data=myTMM, factors=myfactors)

mydatax = mydata3

# NOISeq-DEG

mynoiseq = noiseq(mydatax, k = 0.5, norm = "n", factor = "Sample", conditions = c("WT","KO"), pnr = 0.2, nss = 5, v = 0.02, lc = 0, replicates = "technical")

deg = degenes(mynoiseq, q = 0.98, M = NULL)

[1] "**935** differentially expressed features"

It would really be helpful if I can get any suggestions on this.

Thanks

Ankit

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

This belongs under @Kevin's answer.

ADD REPLY
0
Entering edit mode

I would welcome any suggestion on this. I observe the similar pattern with other datasets also. Please suggest what could be the reason.

Thanks

ADD REPLY
0
Entering edit mode

I'm unsure what the problem is, exactly. If you apply different normalisaion methods and different statistical tests to the same dataset, you will obtain different answers. Anyone with experience will give this same answer. The different normalisation methods process data differently and have different filters for, for example, 'outliers' and low count samples. They also have different statistical tests that are being employed.

Perhaps you should read some published literature on the different normalisation strategies and then decide which one is best for you. You could start with this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.

ADD REPLY
0
Entering edit mode

Thanks Kevin,

I tried some of the methods for normalisation (like CPM, Upper quartile, TMM) (without length normalisation) after removing lowly expressed gene. Almost similar number of differentially expressed features were observed.

I am still figuring out whether this is a suitable way for my data.

I will read more about this.

Thanks

Ankit

ADD REPLY
3
Entering edit mode
6.5 years ago

If you are using NOIseq and your intention is to conduct differential expression, then you should not be using RPKM. Please use TMM (trimmed mean of M), instead. RPKM does not adequately adjust for differences in library sizes across your samples, thus increasing the false-positive rate that you would get via differential expression.

Please read through the very well cited manuscript A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. The authors conclde that TMM and DESeq2's normalisation strategies are supreme.

Kevin

ADD COMMENT
0
Entering edit mode
4.5 years ago

he, you should change the variable name in cpm=100. e.g myfilt2, myTMM2... because R kept older values of your variable in memory

ADD COMMENT

Login before adding your answer.

Traffic: 2593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6