Hi everyone,
I have a RNASeq Data set made of 3 biological replicates for the control and 3 biological replicates for the experiment. I used RSEM to calculate gene expression and EBseq to calculate the differential expression between the control and the experiment. When looking at either the control or the experiment, in some cases the expression values are not the same across the 3 replicates for a particular transcript. I have tried using the variance of the three controls and calculating distribution of variance which does not seem to be helping either. Does anyone know any other way or software I could use to filter out the transcripts that are not constant? I would truly appreciate any help.
Thank you
Perhaps I am misunderstanding something but here goes.
and
Isn't that the whole point of doing
biological replicates
? To understand the variation present.Having variations between replicates is very normal. How different are they? Why do you need them to be constant?
There are many transcript that have 3 very different values for example:
You see this causes problems when I am looking at differential expression between 3 replicates of control and comparing them to 3 replicates of the experiment. What I want to do is filter out transcripts that look like this because I am thinking they are not reliable. I am looking for a way to pick out transcripts that are similar in 2 out of 3 of these replicates.
These biological replicates are from flies coming from the same condition so they should be showing the same expression across these replicates.
As @Frederike said below, most commonly used DE analysis programs take the biological variability into consideration when fitting models to the data. Take a look at the workflow she has linked below.
Thank you. Do you know if EbSeq considers this variability too? The list I am look at are transcripts that have been marked as deferentially expressed but when looking at the expression values of the replicates you can see they are not constant.
I would be surprised if DESEq et al would call the example that you showed above differentially expressed although it, of course, also depends on the values from the other condition. If those values were considerably higher (e.g., 1000, 1200, 900), I'd still think that your gene in question could be identified as being significantly lower in whatever condition you showed above.
No idea what EBSeq does, but I can only repeat myself by saying to please follow the established workflows.
I have a feeling those values are normalized but @Sepd will need to clarify.
They are normalized. I don't know how to rely on it being deferentially expressed when there are big differences in the expression values. Sometimes it shows zero expression in two of the replicates but one of the replicates has a huge value throwing off the whole calculation.
FBtr0085375 0.091674054 0.030532013 0.019790594 0.010525734 167.6551102 0.010300845
So for this transcript the first three columns are biological replicates for the experiment and the second three columns are the biological replicates for the control. This gave a log 2 fold change of 6, but you can see how the 5th column is throwing it off.
That's what the (adjusted) p-values are for.
You would not want to filter that away just for that reason. What if your treated samples have values of 800, 900, 1000? Anyway, you don't want to re-invent the wheel by DIYing this. I woudl be shocked if your program didn't handle biological replicates appropriately, so just let it do its job. Though I would recommend using DESeq, or EdgeR,or limma instead, since more people use them, it's easier to get help. DESeq takes pretty much the same input as your program does.