Hello, I've made a de novo transcriptome assembly of an invertebrate organism, and have made some downstream analysis. I tried first running differential gene expression analyis with DESeq2, and found that ~3000 transcripts was considered as diff. expressed (adj-pval < 0.001 and |LFC| >= 2). But when I annotate this transcripts with Trinotate, I saw that many of them was contaminants like fungi, bacteria, parasites and others.
I think I can't just simply ignore this contaminants and select only transcripts matching the organism of interest and keep on with downstream analysis such as GO enrichment. My DE analysis would get biased if I do that, right?
So I've inverted the order running annotation first, then removing contaminants, and then selecting only those transcripts matching a sequence of the organism of interest for DE analysis (~5000). But by doing this, I got like 30 DE transcripts with the same thresholds, and even if I set the thresholds to adj-pval < 0.05 and |LFC| >=1, I got ~120 transcripts.
Is it normal the ammount of diff. express genes being around 100 even with loose thresholds? Or should I make DE analysis with all assembled transcripts? Because most papers I read seems like they did DE analysis before annotation, but I didn't see nothing about removing contaminants.
If anyone had any tips for me, I would be very thankful!
Thank you so much!!!