Question

Normalization Procedure For The Rna-Seq Data Based On Counts

2

Entering edit mode

11.5 years ago

mssestak ▴ 20

Hi,

I have a RNA-Seq time-series data (vertebrate development with 9 stages and 2 samples per stage). So far I mapped these reads to the genome using BWA and used these alignments to count the number of reads on the level of genes using HTSeq count since I'm not interested in exons but complete genes. I'm using these counts to first calculate frequencies of expression value for each gene, (e.g., count of particular gene/sum of all counts in a sample), and then to calculate sum of frequencies for each sample and stage. As a result of this procedure I get the cumulative values (cumulative frequency) that show how the total expression varies between the stages. The question: I want to try different normalization techniques (on the HTSeq count results) to see how they affect my cumulative expression values. How to do only the normalization(s) without the differential expression of genes that usually comes later in packages?. What are my options among the packages available?

Greetings,

Martin

rna-seq normalization htseq • 17k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 11.5 years ago by mssestak ▴ 20

Ram · Answer 1 · 2013-05-27

8

Entering edit mode

11.5 years ago

Mikael Huss 4.8k

You could use something like this for the voom() normalization/variance stabilization in limma:

normalize.voom <- function(counts){
    require(limma)
    return(voom(counts)$E)
}

and something like this for the TMM normalization in edgeR (this function returns CPM, count per million, values as calculated by edgeR)

cpm.tmm <- function(counts, groups=NA){
    require(edgeR)
    ifis.na(groups)){
        d<-DGEList(counts=counts)
    }
    else{
        d<-DGEList(counts=counts, group=groups)
    }
    d <- calcNormFactors(d, method="TMM") 
    return(cpm(d, normalized.lib.sizes=TRUE))
}

Of course there are other options as well such as RLE normalization (edgeR/DESeq) and DESeq's variance stabilization.

ADD COMMENT • link 11.5 years ago by Mikael Huss 4.8k

1

Entering edit mode

The code that you post for using voom simply computes logCPM. If that's all you want, there's no need to run voom to get it. The benefit of running voom is to compute the weights on each observation, which you are throwing away by extracting just the $E from the resulting EList object.

ADD REPLY • link 11.5 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

Yes, good point. The way I interpreted the question, the person who asked just wanted to look at expression distributions resulting from using different normalizations (like logCPM) without doing differential expression testing. But that could equally as well have been done by the cpm() in edgeR without TMM, for instance. So I guess the voom() snippet was superfluous.

ADD REPLY • link 11.5 years ago by Mikael Huss 4.8k

0

Entering edit mode

Is there a way to have a matrix with the data corrected for variance as in voom?

I would like to plot my samples before running DE.

Thanks for your help.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by bnj195 • 0

0

Entering edit mode

As mentioned above, voom calculates a log(count per million) but also provides a confidence weight on each observation. So with the code given above, you will in a sense get the "data corrected for variance" (while losing the confidence weights as Ryan commented) - or you can just divide the counts by million reads and take logarithms yourself. Or you could use DESeq2's rlogTransformation() or varianceStabilizingTransformation() instead.

ADD REPLY • link 10.0 years ago by Mikael Huss 4.8k

0

Entering edit mode

Can you use rlog function in R to normalize HT-seq counts? The issue is that I have loaded my data matrix consisting of counts generated from HTseq for RNA-Seq time-series data. There are sort of 2 replicates. First replicate has 4 time point samples(day0,2,5,14), while the Second has 5 time point samples(day0,2,5,15,30). The data was loaded as a read.table matrix in R. The following is the script used:

setwd("Downloads")
df_A <- read.table(file='Replicate_sample_Raw_RPM_counts-2',header=T,sep='\t')
df_B <- read.table(file='HT-seq_counts.txt-2',header=T,sep='\t')
merged_df <- merge(df_A,df_B,by='geneID')
write.table(merged_df,file='merged_count.txt',row.names=F,quote=F,sep='\t')
counts<-read.csv("merged_count.txt",header=T,sep="\t")
data<-counts[-1]
rlog(data)

head(data)

  day0.x day2.x day5.x day14 day0.y day2.y day5.y day15 day30
1    358    422    241   617    145    508    389   357   594
2     11     31     44    26      8     24     41    49    49
3      7      3     33   392      2      5     25   159   155
4     26     45     74  5624     45    175     94  4604 14238
5      4     10     66   338     19     13     70   229   242
6    477    138     64    21    747    507     98    25    22

But when I try to to run rlog or rlogTransformation I get the following error:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘sizeFactors’ for signature ‘"data.frame"’

How can I improve the script to make rlog run so that I can have my HT-seq counts normalized?

ADD REPLY • link 8.6 years ago by herman.pappoe.45 ▴ 10

0

Entering edit mode

Hi Mikael , I am using R code for TMM normalization, but am getting this error

  cpm.tmm <- function(CB_total, groups=NA) {
   + require(edgeR)
   + ifis.na (groups) {
   Error: unexpected '{' in:
   "require(edgeR)
   ifis.na (groups) {"

I am new to R . So can you give me your R code for TMM normalization with explaination (if you can). Thanks.

ADD REPLY • link 6.8 years ago by k.kathirvel93 ▴ 310

score 2 · Answer 2 · 2013-05-27

2

Entering edit mode

11.5 years ago

John St. John ★ 1.2k

You could always load the data into an R matrix and then apply whatever normalization technique you want (eg mean, upper quartile, median, whatever else).

ADD COMMENT • link 11.5 years ago by John St. John ★ 1.2k