Question

Best Strategy For De Novo Assembly Of A Reference Transcriptome Without A Genome

5

Entering edit mode

12.4 years ago

Wrf ▴ 210

I have sequences from several tissues of the same animal. I'd like to generate a reference transcriptome to then map my reads onto a search for differential expression. There is no genome for this animal, not even anything close.

The most obvious strategy would be to assemble each tissue de novo, then combine them and remove duplicate sequences. Is there any reason why this would not be the best way?

Does anyone know of a data structure or program that could include one "gene" and all exon combinations for mapping, so I could clearly see that reads are mapping to splice variants and not see it as mapping to possibly unrelated contigs? For example, a gene with 3 exons (1,2,3) might have two transcripts (isoform A: 1+2, isoform B:1+2+3). While the first is a subsequence of the second, I don't want to remove the first since the inclusion of the c-terminal exon might be biologically important in one of the tissues. If I were to then map the reads with bowtie, some of them would hit isoform B and some to A. Since they are the same gene, at some level I just would want to know that, and could possibly disregard the cassette exons.

assembly transcriptome rna-seq • 5.6k views

ADD COMMENT • link 12.4 years ago by Wrf ▴ 210

0

Entering edit mode

For the transcriptome assembly I would personally recommend to pool the reads from all different tissues - the more information you have available the better an assembler can perform.

ADD REPLY • link 12.4 years ago by Sebastian Kurscheid ▴ 300

score 3 · Answer 1 · 2013-03-07

3

Entering edit mode

12.4 years ago

Wrf ▴ 210

I guess I can answer my own question a bit...

I had just exchanged some emails with Daniel Zerbino, the creator of Velvet. He said that in a case with the same 3 exons, if one tissue had 1+2 and another had 2+3, Velvet/Oases would make a final transcript of 1+2+3, even though this never occurs in the real animal. This is probably true of most assemblers.

As far as I can tell, that is a reason specifically NOT to pool the reads. One would never want to pool them and end up with more than the sum of the two individually. In fact since housekeeping genes should be common for both tissues, I would suspect that the combined set should necessarily be smaller than the sum.

He also pointed to this program: http://flux.sammeth.net/capacitor.html which supposedly can use the read counts to generate the splice variants.

ADD COMMENT • link 12.4 years ago by Wrf ▴ 210

0

Entering edit mode

This does not make sense to me.

The way I would approach this analysis (painting with the broadest brush here) is:

1) create a reference transcriptome based on reads obtained from all tissues 2) create an annotation of the transcriptome, including identification of putative splice sites 3) perform alignment of the same reads (from step 1) to this reference, but this time doing it for each library (tissue) separately

ADD REPLY • link 12.4 years ago by Sebastian Kurscheid ▴ 300

0

Entering edit mode

so then step 1 is not really a reference "transcriptome" since it contains non-real transcripts, its an all-intron-removed genome. that might work. my complaint is still that the 'reference transcriptome' might be treated as real when it is not proven to be real.

ADD REPLY • link 12.3 years ago by Wrf ▴ 210