Question

RNA-seq replicates not clustering

1

Entering edit mode

6.3 years ago

sckinta ▴ 730

I know someone posted similar topics before (rna seq replicates not clustering ). However, here, I want to prompt some discussions on this topic with given plots and examples here.

I have triplicates for three conditions (Ev. Mut and WT) in mouse cells. I followed standard ways people usually used to analyze RANseq data: alignment (STAR) -> raw count (HTseq with intersection-strict mode) -> TMM normalization on edgeR -> VSN to 'disassociate dependence of the variance on the mean intensity' -> use VSN value for exploratory analysis (QC).

I did both PCA and pairwise-correlation clustering to find 'outlier'.

PCA plot:

enter image description here

pairwise-correlation heatmap:

enter image description here

From PCA plot, I can generally tell one sample from each condition (wt_rep2, ev_rep1 and mut_rep1) are obviously outliers. pairwise-correlation confirms that outlier. However, the rest of samples are not obviously clustered based on conditions. Yes, those data are totally messed up.

Now here are three options I plan to proceed (with questions):

delete outliers and use rest samples for DE analysis. Since I have only 3 replicates, how could I just two samples for DE analysis?
treat outlier and normal clustering as batch effect, incorperate batch effect as covariant in design.matrix (~ condition + batch), even though I have no idea what possible condition-independent factors cause them as outliers. Is this valid?
Ignore QC, go ahead with DE analysis. Then what is the point to do a QC exploratory analysis?

RNA-Seq clustering • 5.9k views

ADD COMMENT • link updated 6.1 years ago by Biostar 20 • written 6.3 years ago by sckinta ▴ 730

0

Entering edit mode

Just to help others, is it likely that there is a batch effect for those samples to the left of the PCA bi-plot? How does your PCA bi-plot look if you just use 'my' code: A: PCA plot from read count matrix from RNA-Seq ?

ADD REPLY • link 6.1 years ago by Kevin Blighe 88k

0

Entering edit mode

PC1-5 scatterplots: https://ibb.co/nB1Bvp PC1-3 bi-plots: https://ibb.co/iLKaMU

ADD REPLY • link 6.3 years ago by sckinta ▴ 730

1

Entering edit mode

I would not go by CPM. I would analyse the samples in DESeq2 and produce regularised log counts. Prior to normalisation, I would remove transcripts whose mean raw count was <10 or <15. Then, regenerate the plots.

If you proceed to DEA with the current samples, then I imagine that many transcripts would fail Cook's test for outliers based on having large variance. Your PC1 variance is >40%, which is quite substantial.

ADD REPLY • link 6.3 years ago by Kevin Blighe 88k

0

Entering edit mode

This is not CPM. It is vsn output. It is like rlog in DESeq2

ADD REPLY • link 6.3 years ago by sckinta ▴ 730

0

Entering edit mode

Why is cpm in your PCA bi-plot titles? VSN is not like rlog in DESeq2

ADD REPLY • link 6.1 years ago by Kevin Blighe 88k

0

Entering edit mode

It is just a name. I called vsn as vsnCPM. The plots were made from vsn output. VSN is variance stabilizing method, which is what rlog was trying to do. but rlog is less sensitive to sample size

ADD REPLY • link 6.3 years ago by sckinta ▴ 730

0

Entering edit mode

I did remove low exp genes before normalization. Actually I tried to PCA top 50% genes, the results are very similar. I have very low depth on samples. After mapping and filtering, I have only 2M ~ 9M (3M on average) reads left per sample. When I was doing the low exp genes, I just made sure top 3 samples have at least 1 mapped read per sample. This filtered out 40% genes I called. Do you think it would change if I increase my expression filtering cut off?

ADD REPLY • link 6.3 years ago by sckinta ▴ 730

0

Entering edit mode

You have small sample size and shallow sequencing, it is no surprise there is a lot of noise. Does the "outliers" correlate with sequencing depth?

ADD REPLY • link 6.3 years ago by h.mon 35k

0

Entering edit mode

Well, only three samples, I cannot say correlation. I do observe outliers have relatively more read count. Here are read count per sample

Th2KO_ev_rep1   7,616,704
Th2KO_ev_rep2   3,455,495
Th2KO_ev_rep3   2,867,725
Th2KO_mut_rep1  9,538,351
Th2KO_mut_rep2  5,271,933
Th2KO_mut_rep3  2,498,574
Th2KO_wt_rep1   5,305,279
Th2KO_wt_rep2   5,034,473
Th2KO_wt_rep3   2,847,010

ADD REPLY • link 6.3 years ago by sckinta ▴ 730

0

Entering edit mode

What is the % variation on your PC1?

Sequencing depth does not appear to be the sole reason, as Th2KO_wt_rep1 also has high depth but groups fine with the other samples.

If you have still not sorted this out, can you show your code and also state clearly the source of the samples and any errors that occurred during sample processing in the wet-lab. You never clearly stated this, from what I can see. One cannot introduce a batch variable for no logical reason, of course.

ADD REPLY • link 6.1 years ago by Kevin Blighe 88k

score 2 · Answer 1 · 2018-08-14

2

Entering edit mode

6.3 years ago

h.mon 35k

With just three replicates per treatment, I would be very cautious of calling outliers, one needs larger sample sizes to confidently identify outliers. Thus, I wouldn't do 1. Also, I don't see a clear batch effect (clustering is not related to replicate number, for example), and introducing a "batch" effect just because the samples didn't cluster the way you expect seems like torturing the data - thus I wouldn't do 2 also.