Question

Deseq2 : Filtering low counts before per sample

0

Entering edit mode

6.0 years ago

Cdk • 0

Good morning everyone,

While I was doing some bibliography, I found the following article, Threshold-seq: a tool for determining the threshold in short RNA-seq datasets. (Bioinformatics. 2017 Jul 1;33(13):2034-2036. doi: 10.1093/bioinformatics/btx073.) https://www.ncbi.nlm.nih.gov/pubmed/28203700 which describe a tool that provide how many reads need to support a short RNA molecule in a given dataset before it can be considered different from ‘background.

My question is : can I use this tool to have a number of reads for each sample (lets say a int of 14 reads), pass to zero the numbers that have a number inferior to this int in my count matrix, and provide this count matrix to the DESeq2 functions for differential expression analysis ?

While I understand that DESeq2 expect as input un-normalized counts, my question is : is this kind of filtering affect the internal model of DESeq2 ? If so, may I ask how exactly ?

I have noticed the answer about filtering in other posts, like this one : https://support.bioconductor.org/p/65256/ but I do not really know how to translate them for my question. Especially, the script output a int for every sample, so I am actually quite puzzled about how I could apply this threshold number with Deseq2.

Deseq2 threshold-seq • 10.0k views

ADD COMMENT • link 6.0 years ago by Cdk • 0

0

Entering edit mode

Have a look at the DESeq2 manual at the pre-filtering section http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering

ADD REPLY • link 6.0 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

Thank you for your reply.

But actually this is not something that I can translate directly : here the filtering is done by gene (raw of the counts matrix), while Threshold-seq output a number that could be use on each columns(sample).

ADD REPLY • link 6.0 years ago by Cdk • 0

1

Entering edit mode

If you are using DESeq2, then, like Grant, I also recommend following the advice within the DESeq2 tutorial. There is advice for setting thresholds based on both raw and normalised count values.

I also looked at the Threshold-seq manuscript and disagree with it, generally-speaking. For one, they have performed very little benchmarking to real datasets. Second, the documentation is poor. Third, they make the program available as a ZIP file in which there are even hidden MAC system files, lncluding ._.DS_Store. Fourth, the program is neither available on CRAN nor Bioconductor. Finally, I disagree generally with the premise that there exists a 'background' in RNA-seq experiments that is in any way like the background in microarrays. In microarrays, the background is due to fluorescent intensities; in RNA-seq, whilst many transcripts may return very low count values, these may genuinely be real and be reflective of transcriptional 'noise'. Certain experiments may actually want to look at these transcripts. In a 'heightened' transcriptional cellular state (for example, during proliferation), transcriptional noise may be elevated; however, again, these are likely real transcripts but may have no functionality.

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

0

Entering edit mode

@Kevin Blighe so is there any better approach for this filtering?

ADD REPLY • link 4.2 years ago by Dr.Animo ▴ 130

0

Entering edit mode

Please check the DESeq2 vignette, where this is mentioned. The zFPKM package also already addresses filtering for 'background' in RNA-seq data - please take a look at that too.

ADD REPLY • link 4.2 years ago by Kevin Blighe 88k

0

Entering edit mode

The method described in the manual is not appropriate, because it sums up the rows and then apply the threshold. for example

ID  W1  W2  W3  M1  M2  M3 
Gene1   0   0   0   2313    55  699

In this example, you can see that the described method will fail to filter out these low counts.

ADD REPLY • link 4.2 years ago by Dr.Animo ▴ 130