Question

DESeq2 + de novo assembly

0

Entering edit mode

8.5 years ago

pablo61991 ▴ 90

Hello community!

I have several doubts related with DGE. I have searched in different forums so I expect to don't repeat a question answered several times (otherwise, let's me say sorry :P).

I have assembled a de novo transcriptome (invertebrate) composed by 120,000 (I have started with more than 1,000,000 but I have used EvidentialGene to compact my transcriptome) transcripts/contigs/unigenes (I'm a little bit confused with these terms equivalence). When I plan to go in to a DGE analysis I think about the transcript redundancy. Several transcripts of my transcriptome come in fact from the same gene. So when I run the DESeq2 pipeline directly with the Trinity scripts I could obtain a lot of "false positives/negatives" because reads which corresponds to the same gene are divided between different transcripts.

In order to solve this issue I have used tximport pipeline to introduce my count matrix in to DESeq2 (sorry I forget to mention it, I have used kallisto to obtain my "read counts"). With the use of tximport I was looking for 1) make my counts fit in the model assumptions and 2) summarize at gene level my transcripts. To reach that gene level I have annotate my transcripts in order to generate the file tx2gene. Then I just followed the DESeq2 pipeline.

As summary: Trinity (EvidentialGene) > kallisto > blastx vs custom database (9 proteomes) > generate tx2gene file > tximport > DESeq2

In my blast search I have identified 40,000 of 120,000 transcripts, which corresponds to 19,500 different genes. I have "only" identified 30% of my transcripts but I'm not sure what is better: Use all transcripts in the DGE but dividing reads between several trasncripts (which could be the same gene)??, or focus the DGE in the partition of trasncripts which I already know "who" they are??

Please I want to know if that approach it's viable or not. I have seen a lot of examples of tximport & DESeq2 but these examples are ever based in a reference genome approach.

Than you for your time and your attention.

Edit: if you need more details I could give it to you.

Pablo GF

RNA-Seq DESeq2 DGE de novo Trinity • 3.2k views

ADD COMMENT • link updated 8.5 years ago by Biojl ★ 1.7k • written 8.5 years ago by pablo61991 ▴ 90

score 0 · Answer 1 · 2017-05-25

0

Entering edit mode

8.5 years ago

Biojl ★ 1.7k

If you use kallisto you should use Sleuth, which is the DE tool designed for that kind of quantification. Also is easier to use than DESeq2.

I would focus on the fraction of transcripts with a blast hit, unless you have very good evidence that there are a number of genes that are not conserved beyond the species you blast against that are relevant to your experimental design/question.

Get a filtered transcriptome (19.500 genes, best hit for each one of the 40k transcripts) and then run the quantification on that transcriptome. This way you will be doing the analysis at gene level, assuming most reads that will map to different isoforms will map to the best hit. Then you can run the DE analysis.

ADD COMMENT • link 8.5 years ago by Biojl ★ 1.7k

0

Entering edit mode

Yep, I know about Sleuth but for any reason my supervisor has some kind of preference for DESeq2. I think is the eternal problem of "that is the software which I saw more times cited in the bibliography" even if in this particular case other software fits better to that DGE.

I'm interested in that remapping to the know fraction of the transcriptome. I'll take a view on that way.

Thank you for your time Biojl.

ADD REPLY • link 8.5 years ago by pablo61991 ▴ 90