Hi everyone, I have two questions regarding normalisation of RNAseq data.
1) My first question is general one. When dealing with two conditions (Lets say wildtype and knockout), do we need to normalize the counts by length of the feature/gene after correcting coverage bias (dividing by total no. of reads mapped for that particular sample)----referred as RPKM normalisation or just count per million (CPM) is sufficient ?
2) In NOIseq, R package, I tried normalizing the data both by CPM and RPKM (length correction using feature length). The CPM normalisation by NOIseq was found to be correctly normalized which I checked manually from the output. But the RPKM normalisation shows different values when I rechecked manually. I wonder if any one can help me with this or suggest something.
R script I used are as follows:
#Import counts
mycounts=read.table("mycounts.txt", header = TRUE, stringsAsFactors = FALSE)
#Import factor table
myfactors = read.table("myfactors.txt", header=TRUE)
#Import feature length
mylength=read.table("mylength_sort.txt", header = TRUE, stringsAsFactors = FALSE)
#Create NOIseq object
mydata1 <- NOISeq::readData(data=mycounts, factors=myfactors, length = mylength)
#Normalize (rpkm, lc=1)
myRPKM = rpkm(assayData(mydata1)$exprs, long = mylength , k = 0, lc = 1)
Thanks Kevin, for suggesting on normalization methods. I read this paper which pointed out same as you mentioned. I thought to try TMM method after reading this paper (https://genomebiology.biomedcentral.com/track/pdf/10.1186/gb-2010-11-3-r25).
In NOIseq, I tried normalizing by TMM method both,
with length (lc = 1)
myTMM = tmm(assayData(mydata1)$exprs, long = mylength, lc = 1)
or without length (lc = 0)
myTMM = tmm(assayData(mydata1)$exprs, long = 1000, lc = 0)
But I am not sure how to validate the output in both cases (just to understand where the difference lies and which one to choose).
I will highly appreciate any suggestion on this.
Thanks a lot!
Ankit
If you are unsure of the exact parameters to choose, then I would leave them at the default. For TMM, the default is:
For TMM, this means no length correction. TMM will instead calculate scaling factors and normalise your data that way. This function that you're using, however, is not the native function that was used in the package for which TMM was developed, i.e., EdgeR.
I had a previous post here: How do I explain the difference between edgeR, LIMMA, DESeq etc. to experimental Biologist/non-bioinformatician
Thanks Kevin, for important suggestions.
I appreciate your help.
After applying tmm normalisation, I observe something, which is unclear to me.
I applied TMM method after filtering data with two different cpm values: 1 and 100. With low cpm value cutoff, more differentially expressed features retained (14051) and with high cpm value (100) less DE features retained. However when I performed DEG, lesser number of DEG was observed in cpm=1 than in cpm=100,
The R scripts are as follows:
It would really be helpful if I can get any suggestions on this.
Thanks
Ankit
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.This belongs under @Kevin's answer.
I would welcome any suggestion on this. I observe the similar pattern with other datasets also. Please suggest what could be the reason.
Thanks
I'm unsure what the problem is, exactly. If you apply different normalisaion methods and different statistical tests to the same dataset, you will obtain different answers. Anyone with experience will give this same answer. The different normalisation methods process data differently and have different filters for, for example, 'outliers' and low count samples. They also have different statistical tests that are being employed.
Perhaps you should read some published literature on the different normalisation strategies and then decide which one is best for you. You could start with this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.
Thanks Kevin,
I tried some of the methods for normalisation (like CPM, Upper quartile, TMM) (without length normalisation) after removing lowly expressed gene. Almost similar number of differentially expressed features were observed.
I am still figuring out whether this is a suitable way for my data.
I will read more about this.
Thanks
Ankit