Question

Differential Gene Expression Between Control And Treated Samples, Rpkm

1

Entering edit mode

11.1 years ago

HNK ▴ 150

Hi Everyone, I am working on a set of samples (generated using Illumina) for differential gene expression between control and treated samples. I performed pre processing using few tools, used mapping tool and did the feature counting. Now i have a single txt file having gene names and counts for all samples. i have few questions.

1)To make tag counts comparable in samples, a normalization must be performed. i have been asked by my supervior to se RPKM. (RPKM (reads per kilobase per million) is a method of normalization that is widely used in RNA-seq analysis). Do i have to First filter all the genes that is zero across all the samples and then normalize. If yes, can anyone tell me how to normalize my file using RPKM by just giving my feature count txt file as input. (any bioconductor, R, python package).

2) Once i have the normalized file, using RPKM, i would like to find out differentially expressed genes. How should i do that.

I know , DESeq and edgeR packages differ in their default normalization: edgeR uses the trimmed mean of M values56, whereas DESeq uses a relative log expression approach. But i am interested in RPKM(as per requirement).

rpkm • 10.0k views

ADD COMMENT • link updated 11.1 years ago by seidel 11k • written 11.1 years ago by HNK ▴ 150

score 4 · Answer 1 · 2014-03-07

4

Entering edit mode

11.1 years ago

seidel 11k

Both of your questions can be addressed easily with edgeR. And, there is some confusion in your question regarding normalization, RPKM, and differential expression. If you import your count table into edgeR, there is a function for calculating RPKM from the counts for each gene. it's called rpkm(), and you hand it your counts and the gene lengths. It returns a matrix of rpkm values. This is NOT what you use to find differentially expressed genes. The edgeR and DESeq packages both use the counts matrix, COMPLETELY INDEPENDENT OF RPKM to find genes which differ between conditions. The RPKM values are simply a read out of reads per million for a given transcript, normalized by transcript length, and whether a further normalization has been applied (i.e. trimmed mean, etc), is a secondary consideration. For many people, a spreadsheet of edgeR (or DESeq) derived ratios and p-values is sufficient to find gees of interest, and then they want to see the RPKM values to get a sense for expression levels, even though the two may be generated from slightly different paths.

ADD COMMENT • link 11.1 years ago by seidel 11k

0

Entering edit mode

Thankyou so much for the reply. I read somwhere that RPKM is a normalization method, and after applyinh RPKM, you use that matrix to find differntilly expressed genes. But what i understood from your answer is that 1st i need to use RPKM using edgeR rpkm() function and get a count matrix as result, and then use edgeR or DESeq using the count matrix to find genes which differ between conditions. Do he normalization and statistical testing. (i am sorry, i am a bit confused now)

ADD REPLY • link 11.1 years ago by HNK ▴ 150

2

Entering edit mode

No. First you need separate the idea of RPKM and differential expression. EdgeR (or DESeq) couldn't care less about RPKM, and until recently edgeR had no function to calculate it. Yet they both do DE analysis and normalization. How is that? Both methods are count based, and will find genes differing between conditions - but neither are concerned with absolute expression levels (only relative levels), and in this regard, a normalization will be applied between samples. RPKM is something you can calculate yourself, is also a form of normalization (and since it's a common activity, a function was added to the edgeR package). However, RPKM (as discussed above) is simply a rate of observance in your data. Thus for a given molecule you can scale how many reads observed in different conditions to how many mapped reads exist for each condition, and scale for molecule length, and thus you would have normalized for sequencing depth (comparison between samples) and molecule length (comparison between genes). I would encourage you to read the edgeR user guide, and a simple review (e.g. Oslack et al., 2010; http://www.ncbi.nlm.nih.gov/pubmed/21176179), the paper on TMM (Robinson & Oshlack, 2010; http://www.ncbi.nlm.nih.gov/pubmed/20196867). They're all free. There are probably better more modern reviews, but those are simple, straightforward, and relevant to the issues above.

ADD REPLY • link 11.1 years ago by seidel 11k