Question

How to perform DE anaysis on a data set in which biological replicates have high varaince?

0

Entering edit mode

3.2 years ago

Gisele • 0

Hello guys! I would really appreciate if someone could help me with DE analysis. This is my challenge:

I have four conditions and I have five biological replicates for each condition. Performing a PCA (DESeq2) we can observe that most of the replicates don't cluster together. My first question is, can I simply analyze them using DESeq2 or edgeR or this would be wrong because of this replicate scenario? Second, are there any ways to filter out the genes in which the replicates are not good and keep only the ones that are consistent among the replicates of the same condition to run the DE analysis?

Thank for the help!

PCA

high replicates biological variance DE • 2.6k views

ADD COMMENT • link 3.2 years ago by Gisele • 0

score 0 · Answer 1 · 2021-09-23

0

Entering edit mode

3.2 years ago

bigomics.team ▴ 90

If the 5 replicates are repeated in each condition (e.g. 5 cell lines), you can do a paired t-test, even if the PCA doesn't look nice. The paired t-test will correct for the individual bias of each replicate. Alternatively, if the 5 replicates are repeated in the conditions, you can perform batch correction on the replicates. As last resort, you can you SVA (surrogate variable analysis) to correct any batch effects.

ADD COMMENT • link 3.2 years ago by bigomics.team ▴ 90

0

Entering edit mode

Since when is a paired t-test best practice for RNASeq?

ADD REPLY • link 3.2 years ago by swbarnes2 14k

0

Entering edit mode

You can do a paired t-test on Voom-limma transformed data. For edgeR and DEseq2 you will need to put the sample pairing in the linear model as covariate.

ADD REPLY • link 3.2 years ago by bigomics.team ▴ 90

0

Entering edit mode

You "can", but is it best practice? "5 biological replicates" shouldn't mean "5 different cell lines". It should mean "5 samples that are the same except for incidental variance". You can't do a paired t-test on that.

ADD REPLY • link 3.2 years ago by swbarnes2 14k

0

Entering edit mode

That's why I started my sentence with "if". Voom-limma is well accepted and shown to be as good as deseq2 or edger, there are numerous papers about that. Regarding the cell lines, I clearly wrote e.g. (=for example). Swbarnes2, if you know a better method, please suggest it.

ADD REPLY • link 3.2 years ago by bigomics.team ▴ 90

score 0 · Answer 2 · 2021-09-23

0

Entering edit mode

3.2 years ago

swbarnes2 14k

I would hazard that the ground truth of your experiment is that your conditions do not affect RNA expression much. You can do DESeq2, but I predict it will return very few genes as DE.

You certainly cannot just throw away genes that don't agree with what you think the experiment should look like.

You might want to investigate and see if there is any experimental reason driving PC1.

ADD COMMENT • link 3.2 years ago by swbarnes2 14k

0

Entering edit mode

Thanks for the ideas! However, I got lots of DEs in some comparisons I got more than 20k! But what bothers me the most is the fact that lots of DEs have bad concordance across replicates. I will put an example so you can have a better idea.

enter image description here

ADD REPLY • link 3.2 years ago by Gisele • 0

0

Entering edit mode

You have to decide for yourself if all those genes expressed in only a couple of samples are real or not. You can filter away genes which, say, have fewer than 5 samples with > 10 counts, or whatever, but if that's all really biologically real, then I guess you should keep it. I assume you checked and all of these samples have a healthy number of total gene counts?

ADD REPLY • link 3.2 years ago by swbarnes2 14k

0

Entering edit mode

DESeq2 manual doesn't recommend performing such filter because it does the independent filtering. But I could try it to see what happens. Thanks for the tips!

ADD REPLY • link 3.2 years ago by Gisele • 0

1

Entering edit mode

You will find many posts at http://support.bioconductor.org where the developer recommends some minimum prefiltering, e.g. at least 10 counts in at least 3 samples. True, that is (I think) not explicitely in the manual, but I think it makes good sense as these genes you show are probably not very reliable, at least not at that sample size. Three are zero and two are high, what do you believe => not easy to say with 5 samples in that group, one would need many more to see whether zero or high is the reproducible truth, hence the conservative approach would be the filtering. For example one of the groups must have at least 10 counts in either all of its samples or in the majority, there is no strict rule for that. edgeR::filterByExpr does this type of filtering with well-tested defaults and is group-aware, you could simple filter out from dds those genes flagged by the edgeR filter.+

Something like:

#/ assuming you had your factor of interest as "group":

raw_counts <- dds(counts)

keep <- filterByExpr(raw_counts, group=dds$group)

dds[keep,]