Question

RNA Seq Sample Outlier

0

Entering edit mode

8 months ago

Emilie ▴ 10

Hi all,

I am having some issues with some RNA-seq data and a potential outlier. Below you will find the MDS plot (plotMDS function in R, edgeR). Sample X9026 has been eliminated based on sequencing quality, but X9001 looks fine regarding sequencing QC.

To me, just by looking at the MDS plot, 9001 is an outlier. I ran the DEG analysis (edgeR), and when I kept 9001 in, there were 100+ DEGs, but when I took the sample out, there were 0, indicating to me that the single sample was driving those differences.

What do y'all think about sample 9001? Does anyone know of some methods that identify outlier samples other than looking at the MDS plot?

Thanks in advance!

enter image description here

MDS RNA-Seq • 895 views

ADD COMMENT • link updated 8 months ago by ATpoint 85k • written 8 months ago by Emilie ▴ 10

0

Entering edit mode

Sample X9026 has been eliminated based on sequencing quality

What was different about that sample (ans X9001 as well)? Were those processed/sequenced along with the rest?

ADD REPLY • link 8 months ago by GenoMax 147k

0

Entering edit mode

All samples were processed at the same time!

9026 produced a significantly smaller number of reads compared to the other samples. And when the lab redid the library prep they did not see any significant quality improvement so we decided not to resequence it.

Sequencing wise and QC wise, 9001 is on par with the rest!

ADD REPLY • link 8 months ago by Emilie ▴ 10

0

Entering edit mode

Can you show the edgeR code. A single sample driving things, to me, sounds like no prefiltering was done on genes with essentially no counts but few high outliers.

ADD REPLY • link 8 months ago by ATpoint 85k

0

Entering edit mode

Sure! Here is the beginning part of my code. I did include pre-filtering, but sample 9001 still is an "issue".

y <- DGEList(counts=x)
dim(y)
y
AveLogCPM <- aveLogCPM(y)
hist(AveLogCPM)
AveLogCPM2 <- aveLogCPM(y)
hist(AveLogCPM2)
#Tried various values here 
keep <- rowSums(cpm(y)>1) >= 8
y <- y[keep, , keep.lib.sizes=FALSE]
y$samples$norm.factors
dim(y)
y <- calcNormFactors(y, method = 'TMM')
y$samples
dim(y)

ADD REPLY • link 8 months ago by Emilie ▴ 10

0

Entering edit mode

I'd use filterByExpr as in the user guide.

ADD REPLY • link 8 months ago by ATpoint 85k

score 0 · Answer 1 · 2024-03-08

0

Entering edit mode

8 months ago

swbarnes2 14k

I'd definitely remove 9001. Find out what genes are driving PC2. Maybe the sample is contaminated with something. Given how close your other samples are all clustering, I'd remove 9001 even if you don't see a smoking gun as to why it's so different.

ADD COMMENT • link 8 months ago by swbarnes2 14k

0

Entering edit mode

That was my overall thought process; the other samples are all so close together(Clearly no treatment effect, lol). I have had other data sets where there is much less clustering of samples, and if that was the case, I would not be as concerned about 900.

ADD REPLY • link 8 months ago by Emilie ▴ 10