Hi, I'm trying to build a de novo transcriptome from quite a large dataset with multiple conditions for the same organism.
I tried building it using spades for the entire dataset but it just runs out of memory at some point.
I have resorted to building the transcriptome for one replicate and then using the result as trusted-contigs in the next sample assembly and so on but it is going too slow. has anyone done this? does it get faster with more samples as you would use the trusted contigs as some sort of starting point
Or, is there are way to build many small transcriptomes and then merge them or would still amount to the same?
would pre-merging paired reads help?
I'd appreciate any ideas you might have to do this.
Thanks
Intuitively it should be fine to build smaller sets (since you don't seem to have the infrastructure to build a single large one) and then use something to remove redundancy (CD-HIT or clumpify from BBMap). It may be a little tricky if you are working with eukaryotic data but at least remove sequences that are fully identical over the entire length.
You reads should not normally be merging (unless you have short inserts).