Forum:DEG analysis when not replicates are available
1
1
Entering edit mode
4.7 years ago

It comes to my attention a recent publication in biorxiv on the evaluation of different R packages for the analysis of DEG in RNA-Seq experiments when no replicates are available.

Even though I also believe that this is not recommended, it is certain that we need sometime to face this situation and is nice to learn that some approximations can be performed. This paper evaluates several of these packages.

The still not peered article can be found HERE

NOISeq RNA-Seq edgeR • 1.3k views
ADD COMMENT
0
Entering edit mode

I can only shake my head when reading studies like this. First of all they use simulated data, not a single confirmation experiment was done to back up their strategy. Second, they benchmark (among others) against DESeq (deprecated since 2014 upon DESeq2 release) and NOI-seq (now deprecated in the now Bioc version). And third, why would you make a study to develop methods to process inherently underpowered (n=1) studies? If you do not have replicates then you cannot make any claims, simply as that. No way possible you can separate technical noise from biological effects. They should have generated some data themselves (or download experiments with replicates) and then compare their method with standard approaches that use replicates. False discovery rate would probably skyrocket.

ADD REPLY
2
Entering edit mode

If you do not have replicates then you cannot make any claims

Hi- I think I know what you mean and I agree but I think sometimes this statement is exaggerated. There are circumstances where even n=1 can be informative, even if not conclusive, at least to formulate further hypotheses. For example, if you expect low variability to start with (e.g. cell cultures), the effect of the treatment is large and you have some expectation of what genes should change and what should stay the same, even n=1 is something useful. Besides, in some cases n=1 may be all you have so either you do the experiment with that one or you don't do it at all (which may be wiser sometimes but not always, it depends...).

On the other hand, in some cases n=2 or 3 is only marginally better than n=1 and it may give a false sense of security. For example, if you assign people completely blindly to treatment and control, with very small n you have good chances that confounders like sex or age are completely associated to the condition and you get very significant changes which in fact are misleading.

Sorry - it's only that sometimes I have the impression that n=1 is taken as "useless" while n=2 as "ok, now everything's fine".

ADD REPLY
1
Entering edit mode

you cannot make any claims

I agree that replicates are necessary, but I am not sure you cannot make any claims at all. If you have geneA with 10 counts in both conditions and geneB that goes from 10 to 1000 counts, would you say that the probabilities that those genes are differentially expressed are completely equal?

ADD REPLY
1
Entering edit mode

The difference in counts between A and B can be convincing but without randomization of replicates you can't be sure that the difference is due to the treatment applied. The difference may be due to something other than the treatment (e.g. sex, age, handling of samples etc), unless you have some prior belief that tells you that that gene is unlikely to be different for reasons other than the treatment applied (basically I just rephrased my previous comment)

ADD REPLY
0
Entering edit mode

I agree. My main point was that I don't think that the values are completely random in a single-replicate experiment. As you pointed out in the earlier comment, having 2 replicates does not automatically transform complete noise into perfect signal.

ADD REPLY
1
Entering edit mode

Sure, this suggests that this single gene is DEG, but this is nothing you can base a claim on. With claim I mean biological messages like alterations of pathways, groups of genes reacting upon treatment, drug response etc. You are right though that n=1 is not completely useless but for a systematic analysis it is not reliable and I guess most people do a RNA-seq since they want a global picture. For single gene studies one can and should do qPCR or similar low-throughput methods.

ADD REPLY
1
Entering edit mode
4.7 years ago

My 5 cents

Found some SRA data related to roots infected with a fungus. Got personal interest in analyzing these data. SRA data did not contain replicates

In the meantime, the authors of the SRA data published a paper. However, neither the DE genes nor the GO enrichment published had sense to me, as no defense genes appeared in the list of the DE genes. I unknown the reasons for that

However, we analyze the data through edgeR following their particular instructions for samples containing not replicates. And I must say I felt very confident with these results, as a set of defense DE genes appeared after 7 days related to the biotrophic fungus we initially infected the roots, that were replaced by another set of defense genes typical of necrotrophic organisms after 15 days: The list of DE genes were not merely including defense genes. They contain a numerous collection of other known genes related to infections. This process is very well described in many occasions, and it serves as a model

We then got interested in analyzing the metatranscriptomic of these sample to try to explain these results. And it turn that after 7 days, a myriad of new opportunistic organisms emerged that included necrotrophic fungus and bacteria that were taking advantages of the initial fungal infection. The results have been recently published in BMC

So I cannot be convinced of any other idea that data with no replicates are not useful. I have seen this has not been my case, and that after a careful management and analysis of the data, you can get very useful results

ADD COMMENT
0
Entering edit mode

Probably you picked up the top differential genes with large fold changes and high expression. I agree that these data are not useless, but one has to remember what the point of differential analysis is. It is not only to pick the top and highly expressed genes which one could probably even pick by examining a MA-plot. It is rather to decide in a data-driven manner what the cutoff is for genes that are variable, moderately- or lowly expressed and have either moderate FCs or high FCs due to small counts, and this is not possible without replicates. The question would be if filtering by a proper logCPM and logFC cutoff would not have resulted in similar findings.

I agree thoigh that my statement that you cannot make any claims was probably too much, as in situations as you decribe it one probably has at least some data-driven basis to make conclusions. It depends what the underlying question is and how dramatic the expression changes are.

ADD REPLY
0
Entering edit mode

As a biologist with poor knowledge, I think statistic could be a double-edged sword.. I only have that feeling, but not or the capacity by any means a way to demonstrate that. And honestly sometimes I believe this could be an advantage.

This is one case. There are some other cases in which a strong commitment and effort of the use of statistic is done, as is the case of alternate splicing when Illumina short reads are used. My short common sense tells me that it is impossible to get that kind of conclusions when you are using paired or not paired reads which are only 100 bases long. PAcBio sequencing is giving me the reason. But I see that almost the whole community was convinced of that approach. The idea of using a statistical analysis to reconstruct alternate splicing with short reads was deeply included as a canonical and advisable method.

This lead me to think.. Are we trusting too much in statistic ?. According statistic no meaningful data could be obtained from my SRA data.

ADD REPLY

Login before adding your answer.

Traffic: 2024 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6