Question

How to deal with zeros in tag-seq NGS analysis

0

Entering edit mode

10.8 years ago

nash.claire ▴ 510

Hi,

I wonder if anyone can help. I am trying to analyse my NGS tag sequencing data. We have a list of aligned genes with read counts, read counts per million for two different cell types (lets call them X and Y) and the fold changes in gene expression between them. We are mainly interested in one of the cell types (X) more than the other in that we want to know which genes are enriched in X as compared to Y.

However, the data we get back from the GeneProf programme has a considerable number of zeros for tag counts for aligned genes in both X and Y cell types. Obviously, GeneProf can't compute the fold change when either of the tag counts has a zero which means I have a large amount of blank fold changes. Is there a standard way of dealing with this? If I set all the zeros to a constant of 1, will this skew some of my fold changes? Alternatively, I thought I could set all the zeros to a constant of a really small number such as 0.0000000001 or something, but then the fold changes I get will be massive numbers so not sure if that is any good either?

Can anyone help?

RNA-Seq alignment next-gen • 2.9k views

ADD COMMENT • link updated 4.4 years ago by Ram 45k • written 10.8 years ago by nash.claire ▴ 510

Ram · Answer 1 · 2014-10-24

1

Entering edit mode

10.8 years ago

Istvan Albert 103k

This problem is very common and does not have an universal answer other than the somewhat generic one below.

As you note you can't really compute a fold change if one of the values is zero. What you can do is compute other statistics, like a t-test that compares wether the two distributions have the same mean/variance etc.

Many statistical tests will work just fine for zero counts and it is a matter of not interpreting fold changes for these rows.

ADD COMMENT • link updated 4.4 years ago by Ram 45k • written 10.8 years ago by Istvan Albert 103k

0

Entering edit mode

Thank you for your reply Istvan,

I was under the impression that statistics could not be done if you only have one replicate for each cell type/condition etc? I was hoping to take the fold changes and do Kernel Density plots to look at the distribution of fold changes to try and identify a fold change that is "significant" compared to the rest as I understand that DESeq and Cuffdiff are meaningless without technical/biological replicates.

However, if you have any other strategy for this analysis then I'm very open to suggestions!

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.8 years ago by nash.claire ▴ 510

0

Entering edit mode

Ok that is now a different issue altogether - the first is what to do with zero counts, the second question is what to do if I don't have replicates. That is unrelated to the first, and it actually questions the data even more. If you don't have replicates the reliability of fold change is even worse as your errors will add up quadratically and you don't have any way to mitigate that.

The reason so many tools don't work without replicates is that one cannot infer anything useful without them.

The one potential way to use data with no replicates is to validate hypotheses derived by other means. So instead of this data being the driver for hypothesis discovery it becomes the means of validating a hypothesis derived in a different way.

ADD REPLY • link updated 4.4 years ago by Ram 45k • written 10.8 years ago by Istvan Albert 103k