De Novo Metatranscriptomic Assembly Failing - Trinity, Velvet/Oases
3
4
Entering edit mode
13.1 years ago
Newvin ▴ 360

I'm attempting de-novo assembly of metatranscriptomic data, which is admittedly a very resource-intensive problem. I have ~206 million paired-end Illumina reads each 100bp long generated via RNA-seq on environmental samples. I am able to create assemblies using Trinity and Velvet/Oases using a small portion of the reads; however, when I attempt to assemble the metatranscriptome using the full set of reads, both programs will run for a day or so then fail while attempting to allocate memory. The server I am running on has 32 procs and 256GB of RAM. I should also mention that for Velvet/Oases, I am using K=61. I believe Trinity's K value is locked at 25.

I am rather new at this. Does anyone have of sense of how unreasonable my parameters are? Is the idea of assembling 200 million reads ludicrous? I may be able to perform a dereplication step that would reduce the number of reads to ~50 million. Does anyone have an assembly experience indicating that I might have more success with only 50 million reads?

Thanks...

assembly transcriptome trinity velvet • 5.6k views
ADD COMMENT
3
Entering edit mode
13.1 years ago

You might need to speak to Titus Brown, who has used Bloom filters to put metagenomic (perhaps not metatranscriptomic) reads into manageable piles.

http://www.google.com/search?q=titus+brown+bloom+filters

ADD COMMENT
0
Entering edit mode

I'd be interested in your results with using digital normalization, http://ivory.idyll.org/blog/mar-12/diginorm-paper-posted.html. I think it might work better for metatranscriptomic data than partitioning will.

ADD REPLY
3
Entering edit mode
13.1 years ago
pmenzel ▴ 310

Assembly of that many reads is not unreasonable. Try SOAPdenovo for the assembly. If you filter out low abundance k-mers (e.g. with -d option of SOAPdenovo), the memory consumption would decrease.

ADD COMMENT
0
Entering edit mode
11.1 years ago
Dgg32 ▴ 90

I would cluster the reads with cdhit with a high identity cutoff and put the amount of reads into the fasta headers so I can keep track of them. This step alone cuts my sequences into a half without losing a single reads (but it surely mask some heterogenity of your sequences). Then Velvet with default settings will finish the job.

ADD COMMENT

Login before adding your answer.

Traffic: 1965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6