Question

Draw Heatmap Or Do Pca Analysis With Raw Read Counts?

8

Entering edit mode

11.8 years ago

camelbbs ▴ 710

Hi,

I want to ask a question about viewing RNAseq data with raw read counts. After I get the raw reads counts from HTseq-count or similar tools, how do I normalize it. I can use " counts(cds,normalized=T) " in DESeq to get the normalized data. But It still need to be normalized by gene length, right?

Do I need to use RPKM generated from cufflinks to draw a heatmap or perform PCA analysis? Can raw reads data do that?

Thanks,

Ch

rna-seq • 11k views

ADD COMMENT • link updated 11.8 years ago by Irsan ★ 7.8k • written 11.8 years ago by camelbbs ▴ 710

score 7 · Answer 1 · 2013-01-25

This is a good question and I look forward to reading anyone else's answer. My thought is that you certainly can do a PCA analysis and create heatmaps (presumably you mean with the typical hierarchical clustering performed) on raw read counts. But, you must interpret them within that context. If your libraries are of similar depth then maybe normalizing for read depth won't matter that much. And, if your PCA or heatmap/clustering analysis is mostly focused on the relationship between samples then normalizing for gene size won't matter as much. However, if libraries have dramatically different depths this could certainly affect your clustering results (although that will heavily depend on what kind of distance metric you use). Similarly, if you are interested in how genes relate to each other you probably will want to normalize for gene size. Calculating an RPKM matrix from your raw read counts is very easy. Why not run both (raw, RPKM, and maybe some other normalization schemes) through your heatmap and PCA analysis and compare the results with the above caveats in mind. It will probably be educational and teach you something about your data.

score 2 · Answer 2 · 2013-01-25

2

Entering edit mode

11.8 years ago

Irsan ★ 7.8k

It has been suggested that normalization by calculating rpkm is not enough because gc content can be sample specific and that longer genes have lower variance between samples and therefore generate lower p values in significance testing. Have a look at this paper about rna seq normalization. But its definitely worth it to just try all possibilities and make some diagnostic plots.

ADD COMMENT • link 11.8 years ago by Irsan ★ 7.8k

0

Entering edit mode

thanks, It seems normalization of rnaseq data is a complex question. if rpkm is not good, then how to normalize raw read counts data to make them reasonable in heatmap. Maybe the R package in this paper works good.

ADD REPLY • link 11.8 years ago by camelbbs ▴ 710

0

Entering edit mode

Just try all of them;no normalization, normalizing for transcript length, normalizing for transcript length and total mapped reads in sample, normalizing with package N1...NX and see if you can discover any biases towards GC content, sequencing lanes, transcript length, ...??? But the obvious expression differences will be clear without extensive normalization see dont spend months just to make your analysis increase from a A++ to an A+++

ADD REPLY • link 11.8 years ago by Irsan ★ 7.8k

0

Entering edit mode

Sorry I know this is an old thread. Trying everything can make it too easy to fall into confirmation bias. There really ought to be a good reason to try the approaches that you run.

ADD REPLY • link 10.8 years ago by Adamc ▴ 680