Question

Sample outlier detection in DESeq2

2

Entering edit mode

7.5 years ago

nikitavlassenko ▴ 120

I have 16 samples of bulk-RNA-seq data. There are 4 groups, so 4 samples are in each group. I want to check for outliers in each of the groups, and also get some characteristic estimate for each sample in each group of how much of an outlier it is. I am using DESeq2. There are threads that I found in Google, but I do not think they answer my question:

https://support.bioconductor.org/p/35918/

With this one I get a bunch of plots which all look approximately the same. abline() for some reason is not drawn for me, not sure what is the purpose of it (showing genes outliers? I am not interested in it).

If we look to the official documentation:

http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#approach-to-count-outliers

The assays(dds)[["cooks"]] gives the measures for outliers for each gene and sample, but I need to get it just for each sample in each group. I did a boxplot and it looks like that:

The PCA looks like that:

Which makes me think that there is an outlier in B6 group, the rightmost sample. The output from summary(res) gives the following:

What metrics to use to assess how much of an outlier each sample is in each group? I am a bit confused.

DESeq2 rna-seq • 9.5k views

ADD COMMENT • link updated 7.5 years ago by GenoMax 154k • written 7.5 years ago by nikitavlassenko ▴ 120

1

Entering edit mode

For future reference : How to add images to a Biostars post

ADD REPLY • link 7.5 years ago by GenoMax 154k

score 2 · Answer 1 · 2018-04-24

2

Entering edit mode

7.5 years ago

russhh 5.8k

Don't be too quick to throw away samples. That B6 sample is no more an outlier than the ET sample on the far-left, by eye. Run PCA on subsamples: subselect genes and use leave-one-out over the samples (you mentioned 16 samples, but presented 12); for each plie of leave-one-out, check which group your held-out sample clusters with (which is the nearest other point, or which is the nearest mean point in the PCA). What values are you passing to pca: log-counts or log-fpkm or on some zero-shifted version of the counts? Have you filtered-out your poorly-detected features before running PCA? If so, was this outlier robust to your arbitrary cutpoint?

ADD COMMENT • link 7.5 years ago by russhh 5.8k

3

Entering edit mode

I neither see outliers here. There looks to be good separation between B6, ET, and GF along both PC1 and PC2, in fact. Proceed with your analysis as it is.

Cook's Distance is one of the chief metrics used to assess outliers.

Going by a PCA bi-plot, one would look are various things:

% variation explained along axes
each sample's value to each PC (a true outlier may have a value of, e.g., -200, while all other samples are 'scrunched up' between -20 and +20 [along PC1])
Z-score of each sample's value (for example, a Z-score >3 along PC1 can be assumed outlier)

Edit: you pasted your code just as I was typing. Although there are outlier genes in that sample, the sample itself is not an outlier. In addition, there are most likely outlier genes in the other samples, too. Biology doesn't follow the rules that we'd like it [to follow]. Whilst we desperately would like everything to fit neatly into our statistical models, things never do. I see minimal justification for removing any sample from your study based on the information that you've provided. If you wanted to re-do normalisation and set a harder threshold for filtering low count transcripts, then that may work.

Note Bene: it looks like you've used the PCA function that's in-built into DESeq2. That function automatically removes a large proportion of your genes based on low variance without even informing the user (it should inform because all users would naturally assume that all transcripts are being used). If you wanted to perform an entirely unbiased PCA analysis, then use my code that utilises base R functions: A: PCA plot from read count matrix from RNA-Seq (start from your regularised log expression levels)

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Sorry Kevin

We have raw read counts from sequencing using a technique that only sequence 1402 immune related genes (HTG)

I used DESeq2 and I am seeing my samples are too disperse

enter image description here

And as you are seeing in this link I am not able to separate genes based on the differential expression at all

http://amp.pharm.mssm.edu/clustergrammer/viz_sim_mats/5d8cb789d367f150a908a313/a.txt

What one can do in such a samples please ?

ADD REPLY • link 6.1 years ago by zizigolu ★ 4.4k