I am trying to assembled a pooled metagenomic dataset with 762909004 reads. Will using tools such as BBNorm to downsize the coverage in order to speed up the assembly significantly effect the final assemblies?
I am trying to assembled a pooled metagenomic dataset with 762909004 reads. Will using tools such as BBNorm to downsize the coverage in order to speed up the assembly significantly effect the final assemblies?
To answer your question first: normalizing it down to 100x should have no ill effect, and it may even help reduce the errors if you have very high coverage. Several times I have obtained better assemblies from 60x- or 80x-normalized data than from original datasets. Still, I suggest that you assemble from your original data so you have a baseline. I have assembled about 350 million reads with megahit in about 10 hours - you should be able to do it with your dataset in couple of days. It is always possible the original data is of such quality that it yields the best assembly.
Thank you for this. I will try megehit
. I have been using metaSPADes
and constantly running out of memory despite around 380gb being allocated, however I recently realised the default threat count is high which could be sapping that memory up. Do you have any experience with metaSPADes
?
What metrics are you using to determine assembly quality here?
I try to assemble with both SPAdes and megahit, unless someone has already done the former. In my hands there is no major difference in the end when using the same data, but megahit is faster and requires less memory.
Rather than explaining what I exactly mean by better assembly, I will give you a real-life example. Below is the the statistics of two metagenomic assemblies: first by JGI using SPAdes, and second by megahit on 100x-normalized data.
JGI:
A C G T N IUPAC Other GC GC_stdev
0.2266 0.2715 0.2724 0.2295 0.0000 0.0000 0.0000 0.5439 0.1398
Main genome scaffold total: 1885295
Main genome contig total: 1885295
Main genome scaffold sequence total: 1822.235 MB
Main genome contig sequence total: 1822.235 MB 0.000% gap
Main genome scaffold N/L50: 182520/1.697 KB
Main genome contig N/L50: 182520/1.697 KB
Main genome scaffold N/L90: 1279363/353
Main genome contig N/L90: 1279363/353
Max scaffold length: 315.134 KB
Max contig length: 315.134 KB
Number of scaffolds > 50 KB: 885
% main genome in scaffolds > 50 KB: 3.96%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 1,885,295 1,885,295 1,822,235,338 1,822,235,338 100.00%
100 1,885,295 1,885,295 1,822,235,338 1,822,235,338 100.00%
250 1,828,031 1,828,031 1,809,418,206 1,809,418,206 100.00%
500 822,637 822,637 1,450,835,268 1,450,835,268 100.00%
1 KB 351,832 351,832 1,128,227,824 1,128,227,824 100.00%
2.5 KB 110,888 110,888 764,967,803 764,967,803 100.00%
5 KB 43,184 43,184 532,889,781 532,889,781 100.00%
10 KB 15,857 15,857 345,804,186 345,804,186 100.00%
25 KB 3,646 3,646 164,948,352 164,948,352 100.00%
50 KB 885 885 72,234,256 72,234,256 100.00%
100 KB 160 160 24,610,336 24,610,336 100.00%
250 KB 10 10 2,916,604 2,916,604 100.00%
megahit 100x data:
A C G T N IUPAC Other GC GC_stdev
0.2321 0.2707 0.2696 0.2275 0.0000 0.0000 0.0000 0.5403 0.1384
Main genome scaffold total: 2102597
Main genome contig total: 2102597
Main genome scaffold sequence total: 1946.904 MB
Main genome contig sequence total: 1946.904 MB 0.000% gap
Main genome scaffold N/L50: 199888/1.495 KB
Main genome contig N/L50: 199888/1.495 KB
Main genome scaffold N/L90: 1454160/359
Main genome contig N/L90: 1454160/359
Max scaffold length: 349.476 KB
Max contig length: 349.476 KB
Number of scaffolds > 50 KB: 1210
% main genome in scaffolds > 50 KB: 5.11%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 2,102,597 2,102,597 1,946,904,126 1,946,904,126 100.00%
100 2,102,597 2,102,597 1,946,904,126 1,946,904,126 100.00%
250 1,991,743 1,991,743 1,922,255,288 1,922,255,288 100.00%
500 896,124 896,124 1,518,425,677 1,518,425,677 100.00%
1 KB 337,791 337,791 1,140,044,398 1,140,044,398 100.00%
2.5 KB 103,295 103,295 790,133,988 790,133,988 100.00%
5 KB 42,619 42,619 582,498,384 582,498,384 100.00%
10 KB 17,492 17,492 409,934,452 409,934,452 100.00%
25 KB 4,588 4,588 215,123,703 215,123,703 100.00%
50 KB 1,210 1,210 99,396,790 99,396,790 100.00%
100 KB 246 246 35,830,593 35,830,593 100.00%
250 KB 15 15 4,286,097 4,286,097 100.00%
I use BBnorm
and I don't think it merges reads. However, some reads may become single as their mates will be removed during normalization. I use extract-paired-reads.py from khmer to split the reads into paired and single, and then feed them as such to assembly programs.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What do you want to do with your metagenome? If you want to bin genomes, then such normalization isn't a good idea because coverage is one of the variables used by nearly all binning programs..
Coverage is only one variable in addition to other 256 (or 136, depending on how one counts 4n frequencies). It is not really that important and binning can be done just fine without it. Besides, after the assembly one can map the contigs onto original reads, which will yield the coverage that can be used for binning.