Hi Everyone, I am working on a set of samples (generated using Illumina) for differential gene expression between control and treated samples. I performed pre processing using few tools, used mapping tool and did the feature counting. Now i have a single txt file having gene names and counts for all samples. i have few questions.
1)To make tag counts comparable in samples, a normalization must be performed. i have been asked by my supervior to se RPKM. (RPKM (reads per kilobase per million) is a method of normalization that is widely used in RNA-seq analysis). Do i have to First filter all the genes that is zero across all the samples and then normalize. If yes, can anyone tell me how to normalize my file using RPKM by just giving my feature count txt file as input. (any bioconductor, R, python package).
2) Once i have the normalized file, using RPKM, i would like to find out differentially expressed genes. How should i do that.
I know , DESeq and edgeR packages differ in their default normalization: edgeR uses the trimmed mean of M values56, whereas DESeq uses a relative log expression approach. But i am interested in RPKM(as per requirement).
Thankyou so much for the reply. I read somwhere that RPKM is a normalization method, and after applyinh RPKM, you use that matrix to find differntilly expressed genes. But what i understood from your answer is that 1st i need to use RPKM using edgeR rpkm() function and get a count matrix as result, and then use edgeR or DESeq using the count matrix to find genes which differ between conditions. Do he normalization and statistical testing. (i am sorry, i am a bit confused now)
No. First you need separate the idea of RPKM and differential expression. EdgeR (or DESeq) couldn't care less about RPKM, and until recently edgeR had no function to calculate it. Yet they both do DE analysis and normalization. How is that? Both methods are count based, and will find genes differing between conditions - but neither are concerned with absolute expression levels (only relative levels), and in this regard, a normalization will be applied between samples. RPKM is something you can calculate yourself, is also a form of normalization (and since it's a common activity, a function was added to the edgeR package). However, RPKM (as discussed above) is simply a rate of observance in your data. Thus for a given molecule you can scale how many reads observed in different conditions to how many mapped reads exist for each condition, and scale for molecule length, and thus you would have normalized for sequencing depth (comparison between samples) and molecule length (comparison between genes). I would encourage you to read the edgeR user guide, and a simple review (e.g. Oslack et al., 2010; http://www.ncbi.nlm.nih.gov/pubmed/21176179), the paper on TMM (Robinson & Oshlack, 2010; http://www.ncbi.nlm.nih.gov/pubmed/20196867). They're all free. There are probably better more modern reviews, but those are simple, straightforward, and relevant to the issues above.