Hello community!
I have several doubts related with DGE. I have searched in different forums so I expect to don't repeat a question answered several times (otherwise, let's me say sorry :P).
I have assembled a de novo transcriptome (invertebrate) composed by 120,000 (I have started with more than 1,000,000 but I have used EvidentialGene to compact my transcriptome) transcripts/contigs/unigenes (I'm a little bit confused with these terms equivalence). When I plan to go in to a DGE analysis I think about the transcript redundancy. Several transcripts of my transcriptome come in fact from the same gene. So when I run the DESeq2 pipeline directly with the Trinity scripts I could obtain a lot of "false positives/negatives" because reads which corresponds to the same gene are divided between different transcripts.
In order to solve this issue I have used tximport pipeline to introduce my count matrix in to DESeq2 (sorry I forget to mention it, I have used kallisto to obtain my "read counts"). With the use of tximport I was looking for 1) make my counts fit in the model assumptions and 2) summarize at gene level my transcripts. To reach that gene level I have annotate my transcripts in order to generate the file tx2gene. Then I just followed the DESeq2 pipeline.
As summary: Trinity (EvidentialGene) > kallisto > blastx vs custom database (9 proteomes) > generate tx2gene file > tximport > DESeq2
In my blast search I have identified 40,000 of 120,000 transcripts, which corresponds to 19,500 different genes. I have "only" identified 30% of my transcripts but I'm not sure what is better: Use all transcripts in the DGE but dividing reads between several trasncripts (which could be the same gene)??, or focus the DGE in the partition of trasncripts which I already know "who" they are??
Please I want to know if that approach it's viable or not. I have seen a lot of examples of tximport & DESeq2 but these examples are ever based in a reference genome approach.
Than you for your time and your attention.
Edit: if you need more details I could give it to you.
Pablo GF
Yep, I know about Sleuth but for any reason my supervisor has some kind of preference for DESeq2. I think is the eternal problem of "that is the software which I saw more times cited in the bibliography" even if in this particular case other software fits better to that DGE.
I'm interested in that remapping to the know fraction of the transcriptome. I'll take a view on that way.
Thank you for your time Biojl.