I have around 1.5 billion RNA-Seq SOLiD3 SE 50bp reads taken from various conditions. Libraries were poly-A enriched and ribo depleted. I've done several different mapping/assemblies with tophat + cufflinks.
-mapped/assembled individual libraries
-mapped/assembled grouped libraries based on biological condition. So all Control libraries together, all irradiation libraries together...etc
-mapped/assembled libraries randomly in increasing number of reads. 100mil, 200mil, 300mil...
-mapped/assembled everything together
I decided to go with mapping/assembly based on grouped biological condition because I figured different conditions will produce various transcript compositions. Merging different compositions might confound the assembler statistics. I used cuffmerge to merge all the assemblies together.
I looked at the splice junctions generated from Tophat in each of these libraries and found that the number of discovered splice junctions actually caps out at around half a billion reads. This makes me think that it is reaching information saturation at around half a billion. Interestingly, I find my assembly getting progressively worse (fragmentation of transcripts) as I add more reads.
My main problem with the reference assembly is accuracy of assembly and coverage. I find a lot of assembled transcripts to have ORFs with stops in the middle of the transcript. This might be due to tophat not predicting splice junctions correctly? Or perhaps a problem with the genome which is AT rich and in around 25k supercontigs.
I also have around 1 million roche 454 reads that I've de novo assembled with Newbler and mapped back to the genome with GMAP. I find that there are a decent amount of 454 assembled transcripts that do not have ABI reads mapping, hence coverage issue.
What are your experiences with doing a transcriptome assembly with SOLiD reads? Reference or de novo?