As it is about a bioconductor package, I also posted here.
I am working on RNAseq data,
I made my count table using kallisto
and then tximport
to work with DESeq2
.
My genes are a set of cDNAs, (supposed to be corresponding to all the genes of my species), but the annotation is quite bad, when I align on these cDNAs I get 60% of mapping, instead of 95% on total genome.
I have 2 conditions: (A and B) and 3 replicates in each condition.
My fear is: If a gene is over-expressed in A, not expressed in B, and not in my cDNA list, I expect to have less reads in A than is B and when the normalization by DESeq2
occurs, it could create a bias ?
Example:
A: 1 1 1 1 2 2 2 2 3 3
B: 1 1 1 1 2 3 3 3 3 3
3 is not annotated, then after normalization by DESeq2
:
A: 1 1 1 1 1 2 2 2 2 2
B: 1 1 1 1 1 1 1 1 2 2
1 over-expressed in B, but it is not true.
How can I deal with this kind of problem?
Should I add a line in my table with "unmapped reads" to have a better normalization?
My expression levels are supposed to be similar as I work on different tissues of the same organism.
But building a new gff with cufflinks and use it to improve my results seems a good option !