Question

Calculating Rpkm For Rna-Seq Data Including Several Samples Each Condition

2

Entering edit mode

11.9 years ago

narges ▴ 210

Hi,

I have asked about how to calculate the RPKM for RNA-seq data and I got the answer. Now I wanted to calculate the RPKM for an experiment which has 8 samples for each condition(8 sample for two groups). Do you think it is a good Idea if I calculate the RPKM for each sample of a condition separately and then calculate the mean for each condition? I would be grateful if some one could refer me to a paper or something which has done it. Thank you in advance

rpkm rna-seq • 6.4k views

ADD COMMENT • link updated 11.9 years ago by swbarnes2 14k • written 11.9 years ago by narges ▴ 210

0

Entering edit mode

Are your 8 samples replicates? If so, what kind? What is your end goal? e.g., are you trying to identify significantly differentially expressed genes between the two groups? What software/package(s) are you using to process your RNA-seq data?

ADD REPLY • link 11.9 years ago by Obi Griffith 20k

0

Entering edit mode

yes, they are 8 biological replicates for each group. And then the goal is to compare gene expression level and ofcourse the finite aim is to find DE genes but not now.

ADD REPLY • link 11.9 years ago by narges ▴ 210

0

Entering edit mode

hi, how did u calculate RPKM values? do you have single end data? i also want to know to calculate RPKM value.reply urgent. thank you in advance

ADD REPLY • link 11.1 years ago by raj.gzra ▴ 30

score 5 · Answer 1 · 2013-01-30

Hello narges,

I assume 8 samples are sequence with the same technique (possibly the same flow cell).

There are different possibilities for what you want to do. Below I explain the two most common ones:

Use raw mapping counts ( e.g. count with bedtools) for calling differentially expressed genes and plug them in into DESeq. The idea in this case is to keep the expression estimation as simple as possible because you compare same entities: Is gene x differentially expresssed in cond1-vs-cond2? Therefore, there is the possibility to specify biological replicates to calculate expression variance in a specific condition. This variance is than utilized to judge expression variance between conditions
Use cuffdiff for calling differentially expressed transcripts. This program uses some optimizing function to identify expression of different transcripts (exon skipping,...) and, thus, can call differential splicing events.

Depending on the granuality you need (differential gene or transcript calling), you can choose one of the steps.

score 3 · Answer 2 · 2013-01-30

First, you need to ask the people who submitted the samples if they are true biological replicates, or technical replicates.

Technical replicates are good for knowing how much variability your library prep and instrument add. A technical replicate would be like taking the liver from one mouse, cutting it into 4 pieces, and treating them like 4 different samples. Any variation between the samples should be an artifact of the library prep and sequencing procedure. So if the exact same sample prepped multiple ways leads to big swings in RPKM, then you know that your prep is lousing things up, and you are going to have very little precision in your estimate of what the "real" expression was in the one sample.

In general, Illumina instruments do a good job with technical replicates, your samples should be very, very similar to each other, and I think if that's the case, combining them might be okay.

Biological replicates are when you, say, expose 4 organisms to the same condition, and differences between the biological replicates is likely not artifacts, but are due to real variations between organisms. You hope that your condition is powerful enough that the difference between the samples will be quite a bit smaller than the differences between two organisms exposed to different conditions.

For instance, let's say you check one control animal, and one condition animal. The control has an RPKM of 3, in one gene, the other animal has an RPKM of 8. A big difference, right?

Well, now you check a bunch of different control animals, and you see that the range of RPKM among control animals is 2-10, and the range from your condition animals was 3-11. Now, it looks like the condition doesn't actually change expression of that gene; it's naturally pretty variable, and the two sets of animals look pretty much the same.

So for biological duplicates, you need to keep them separate, because you need to know the average, and the variance for each gene. Because if there is a lot of organism-to-organism variability, you need to know how significant that is. And combining all the biological replicates together will lose that.

So biological replicates are required for good RNA-seq experiments. But people sometimes cut corners, so don't assume that that's what you have.

It's not the most sophisticated analysis around, but working out the averages and variances for each gene in each group would be a useful, simple place to start. Then, do unpaired t-tests, which look at the averages of each group, and their variances, and give you a number telling you how likely it is that your conditions really is significantly changing the expression if the gene.