Question

Transcriptome Assembly Of Illumina Reads On Low-Mem Machines?

5

Entering edit mode

13.6 years ago

2184687-1231-83- ★ 5.1k

Are there any tools/options to assemble transcriptome Illumina datasets on low memory machines? I know there are tools for machines with RAM in the hundreds of gigs, but I would like to know if there are any for low-mem workstation-like machines.

For example, for a dataset of 3 GA2 runs, with 7 lanes each run, 75bp PE, about 200GB in fastq files, for an insect species with no reference genome anywhere near (>100MYA)?

transcriptome assembly memory • 4.1k views

ADD COMMENT • link updated 5.7 years ago by Biostar 20 • written 13.6 years ago by 2184687-1231-83- ★ 5.1k

score 3 · Answer 1 · 2011-05-16

3

Entering edit mode

13.6 years ago

Alastair Kerr 5.3k

trans-ABySS is designed to work on a cluster with each node having about 2GB of memory. It is not as memory intensive as other de Bruijn graph based methods which scale linearly with genome size. I've not tried it on a single machine (nor do I know if it is possible sorry) but it is worth a look.

ADD COMMENT • link 13.6 years ago by Alastair Kerr 5.3k

0

Entering edit mode

my understanding is that trans-ABYSS takes pre-assembled contigs as input, or is that only an optional input format other than the raw reads?

ADD REPLY • link 13.6 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

yes, you need to run ABySS on them first

ADD REPLY • link 13.6 years ago by Alastair Kerr 5.3k

score 1 · Answer 2 · 2011-05-16

I suppose you might want to look at CLC or Ray. This being said sometimes it's just easier to find someone with a big machine.

Most assemblers will tell you their transcriptome assemblies suffer from:

splicing variants. With bigger kmers splicing variants don't introduce so many ambiguities. So use them.
non-uniform coverage, though most won't explain why exactly that is deleterious. My understanding is that it comes down to high-coverage areas being treated as repeats, and coverage being made useless as a tiebreaker in ambiguous path traversal situations. At any rate there are a couple strategies available to normalize or flatten coverage. Obviously you'll want to remove exact duplicates, but an EST clustering approach (using Vmatch or other tools) might be handy as well.

score 1 · Answer 3 · 2011-05-17

Just random untested ideas:

reduce the complexity of the input set. Any sequencing errors likely increase the amount of RAM needed to hold them, so strict quality filtering may help.

error correction (but this may require comparable amount of RAM as the assembly) there are programs for k-mer based error corrections which according to authors improve the quality of genomic assemblies. This is likely to hold for transcript assembly.

instead of reference genome try "reference transcriptome". Hopefully you will find some ESTs, be it Sanger or 454 which can be assembled with less pain.

try to get at least mitochondrial sequences of your species or anything close. A lot of RNA-Seq matches it. Same goes for ribosomal sequences.

get even a small chunk of genomic sequence (say cosmid sized) with some repeats in it. Map with some large number of mismatches, filter out everything what maps to repetitive parts.

score 0 · Answer 4 · 2011-05-17

0

Entering edit mode

13.6 years ago

Geparada ★ 1.5k

We want to do the same and we don't have a super high memory machine. So we probably use Cufflinks at galaxy public server or in galaxy on the cloud.

http://main.g2.bx.psu.edu/root?tool_id=cufflinks

has anybody tried Cufflinks standar alone or over galaxy?

ADD COMMENT • link 13.6 years ago by Geparada ★ 1.5k

1

Entering edit mode

Cufflinks is not a de-novo assembler. It first requires alignment to a reference genome, then combines these aligned reads into transcripts.

ADD REPLY • link 13.6 years ago by Brad Chapman 9.7k

0

Entering edit mode

Today (I think) it isn't a problem, because there are so many genomes available.

ADD REPLY • link 13.6 years ago by Geparada ★ 1.5k