Hi everyone! I'm new in bioinformatics field and I have some problems with de novo assembly. So I would like to ask for suggestions.
General information about my work are...
- Illumina paired-end, read length 150 bp
- whole genome sequencing (estimate genome size is 224 million bp)
- 100x coverage
- number of reads is around 70 million reads per files (forward and reverse)
- I used Amazon Web Service EC2, instance type M4xlarge (vCPUs = 4, RAM = 16 GiB) to perform all of the following processes.
After I trimmed reads, I tried to assemble with 2 programs: Velvet and ABySS, but both didn't work.
In case of Velvet, I ran velveth with this code.
velveth /home/ubuntu/velvet21 21 -shortPaired -separate -fastq.gz /home/ubuntu//149-6_1_val_1.fq.gz /home/ubuntu//149-6_2_val_2.fq.gz
and got results like this
[0.000001] Reading FastQ file /home/ubuntu/149-6_1_val_1.fq.gz;
[0.002344] Reading FastQ file /home/ubuntu/149-6_2_val_2.fq.gz;
[924.933978] 139366234 sequences found in total in the paired sequence files
[924.933995] Done
[924.983130] Reading read set file /home/ubuntu/velvet21/Sequences;
[1228.465533] 139366234 sequences found
Killed
However, I tried with much smaller genome (4.8 million bp, 1.4 million read each file) and it worked!
In case of ABySS, I performed with this code.
abyss-pe k=21 name=abyss21 in='149-6_1_val_1.fq.gz 149-6_2_val_2.fq.gz'
The result came up like this...
ABYSS -k21 -q3 --coverage-hist=coverage.hist -s output21-bubbles.fa -o output21-1.fa 149-6_1_val_1.fq.gz 149-6_2_val_2.fq.gz
ABySS 2.0.2
ABYSS -k21 -q3 --coverage-hist=coverage.hist -s output21-bubbles.fa -o output21-1.fa 149-6_1_val_1.fq.gz 149-6_2_val_2.fq.gz
Reading `149-6_1_val_1.fq.gz'...
sparsehash FATAL ERROR: failed to allocate 10 groups
/usr/bin/abyss-pe:506: recipe for target 'output21-1.fa' failed
make: *** [output21-1.fa] Error 1
However, again, it ran successfully with a small synthetic data set from this page (ftp://ccb.jhu.edu/pub/dpuiu/Docs/ABYSS.html).
Has it anything to do with RAM? How can I resolve this problem?
Thank you
Putita
I am not experienced with genome assemblies so the more experienced folks will tell you for sure, but 16GB is pretty much nothing for many bioinformatics tasks. From what I read you need hundreds of GB for de novo assemblies. I would start checking if and from where you can get a cluster/service/node with that amount of memory.
I agree. Boost it up to at least 64GB RAM.
I increased RAM and it works!
Thank you ATpoint and Kevin Blighe for your suggestion :)
Can I know how many memory at the end of your job used?