Question

DESeq2 biais when genes are missing from the annotation?

0

Entering edit mode

7.1 years ago

corend ▴ 70

As it is about a bioconductor package, I also posted here.

I am working on RNAseq data,

I made my count table using kallisto and then tximport to work with DESeq2.

My genes are a set of cDNAs, (supposed to be corresponding to all the genes of my species), but the annotation is quite bad, when I align on these cDNAs I get 60% of mapping, instead of 95% on total genome.

I have 2 conditions: (A and B) and 3 replicates in each condition.

My fear is: If a gene is over-expressed in A, not expressed in B, and not in my cDNA list, I expect to have less reads in A than is B and when the normalization by DESeq2 occurs, it could create a bias ?

Example:

A: 1 1 1 1 2 2 2 2 3 3

B: 1 1 1 1 2 3 3 3 3 3

3 is not annotated, then after normalization by DESeq2:

A: 1 1 1 1 1 2 2 2 2 2

B: 1 1 1 1 1 1 1 1 2 2

1 over-expressed in B, but it is not true.

How can I deal with this kind of problem?

Should I add a line in my table with "unmapped reads" to have a better normalization?

RNA-Seq DESeq2 • 1.8k views

ADD COMMENT • link updated 7.1 years ago by h.mon 35k • written 7.1 years ago by corend ▴ 70

1

Entering edit mode

7.1 years ago

h.mon 35k

If you are certain the culprit is an incomplete annotation, you can use Cufflinks or Stringtie (recommended) to do a reference annotation-based transcript assembly (RABT assembly), then use this extended transcript set to perform the kallisto / tximport / DESeq2 workflow.

It may be, however, that you have other problems, for example, a high proportion of rRNA on your sequencing. Did you check for other issues?

ADD COMMENT • link 7.1 years ago by h.mon 35k

0

Entering edit mode

I don't know what is the proportion of rRNA in my data, but the sequencing what made purifying polyA RNAs.

As you and the previous answer suggested, I will build a new gff with cufflinks, it seems to be the best option !

ADD REPLY • link 7.1 years ago by corend ▴ 70

score 3 · Accepted Answer · 2017-11-13

I'll start from the end: adding unmapped reads will not help with normalization.

And for the main question: DESeq2 uses the median value of the ratio between A and B assuming most of the genes have the same expression level. If this assumption holds for your data as well then you're safe using DESeq2. You can start validating this assumption by plotting expression level in A vs B and see that you get a nice correlation plot. I think that you'll be fine using DESeq2 normalization.

For the sake of getting better results you might want to have a better annotation of your genome of course, you can easily do that with the transcriptome data that you already have.