Are there any tools/options to assemble transcriptome Illumina datasets on low memory machines? I know there are tools for machines with RAM in the hundreds of gigs, but I would like to know if there are any for low-mem workstation-like machines.
For example, for a dataset of 3 GA2 runs, with 7 lanes each run, 75bp PE, about 200GB in fastq files, for an insect species with no reference genome anywhere near (>100MYA)?
trans-ABySS is designed to work on a cluster with each node having about 2GB of memory. It is not as memory intensive as other de Bruijn graph based methods which scale linearly with genome size. I've not tried it on a single machine (nor do I know if it is possible sorry) but it is worth a look.
I suppose you might want to look at CLC or Ray. This being said sometimes it's just easier to find someone with a big machine.
Most assemblers will tell you their transcriptome assemblies suffer from:
splicing variants. With bigger kmers splicing variants don't introduce so many ambiguities. So use them.
non-uniform coverage, though most won't explain why exactly that is deleterious. My understanding is that it comes down to high-coverage areas being treated as repeats, and coverage being made useless as a tiebreaker in ambiguous path traversal situations. At any rate there are a couple strategies available to normalize or flatten coverage. Obviously you'll want to remove exact duplicates, but an EST clustering approach (using Vmatch or other tools) might be handy as well.
reduce the complexity of the input set. Any sequencing errors likely
increase the amount of RAM needed to
hold them, so strict quality
filtering may help.
error correction (but this may require comparable amount of RAM as the
assembly) there are programs for k-mer based error corrections which
according to authors improve the quality of genomic assemblies. This
is likely to hold for transcript assembly.
instead of reference genome try "reference transcriptome". Hopefully you will
find some ESTs, be it Sanger or 454 which can be assembled with less pain.
try to get at least mitochondrial sequences of your
species or anything close. A lot of
RNA-Seq matches it. Same goes for
ribosomal sequences.
get even a small chunk of genomic sequence (say cosmid sized) with some repeats in it. Map with some large number of mismatches, filter out everything what maps to repetitive parts.
my understanding is that trans-ABYSS takes pre-assembled contigs as input, or is that only an optional input format other than the raw reads?
yes, you need to run ABySS on them first