Will coverage downsize of metagenomeic data drastically effect assembly?
1
1
Entering edit mode
4.0 years ago

I am trying to assembled a pooled metagenomic dataset with 762909004 reads. Will using tools such as BBNorm to downsize the coverage in order to speed up the assembly significantly effect the final assemblies?

Assembly • 1.2k views
ADD COMMENT
0
Entering edit mode

What do you want to do with your metagenome? If you want to bin genomes, then such normalization isn't a good idea because coverage is one of the variables used by nearly all binning programs..

ADD REPLY
0
Entering edit mode

Coverage is only one variable in addition to other 256 (or 136, depending on how one counts 4n frequencies). It is not really that important and binning can be done just fine without it. Besides, after the assembly one can map the contigs onto original reads, which will yield the coverage that can be used for binning.

ADD REPLY
3
Entering edit mode
4.0 years ago
Mensur Dlakic ★ 28k

To answer your question first: normalizing it down to 100x should have no ill effect, and it may even help reduce the errors if you have very high coverage. Several times I have obtained better assemblies from 60x- or 80x-normalized data than from original datasets. Still, I suggest that you assemble from your original data so you have a baseline. I have assembled about 350 million reads with megahit in about 10 hours - you should be able to do it with your dataset in couple of days. It is always possible the original data is of such quality that it yields the best assembly.

ADD COMMENT
0
Entering edit mode

Thank you for this. I will try megehit. I have been using metaSPADes and constantly running out of memory despite around 380gb being allocated, however I recently realised the default threat count is high which could be sapping that memory up. Do you have any experience with metaSPADes?

What metrics are you using to determine assembly quality here?

ADD REPLY
0
Entering edit mode

I try to assemble with both SPAdes and megahit, unless someone has already done the former. In my hands there is no major difference in the end when using the same data, but megahit is faster and requires less memory.

Rather than explaining what I exactly mean by better assembly, I will give you a real-life example. Below is the the statistics of two metagenomic assemblies: first by JGI using SPAdes, and second by megahit on 100x-normalized data.

JGI:

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2266  0.2715  0.2724  0.2295  0.0000  0.0000  0.0000  0.5439  0.1398

Main genome scaffold total:             1885295
Main genome contig total:               1885295
Main genome scaffold sequence total:    1822.235 MB
Main genome contig sequence total:      1822.235 MB     0.000% gap
Main genome scaffold N/L50:             182520/1.697 KB
Main genome contig N/L50:               182520/1.697 KB
Main genome scaffold N/L90:             1279363/353
Main genome contig N/L90:               1279363/353
Max scaffold length:                    315.134 KB
Max contig length:                      315.134 KB
Number of scaffolds > 50 KB:            885
% main genome in scaffolds > 50 KB:     3.96%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All              1,885,295       1,885,295   1,822,235,338   1,822,235,338   100.00%
    100              1,885,295       1,885,295   1,822,235,338   1,822,235,338   100.00%
    250              1,828,031       1,828,031   1,809,418,206   1,809,418,206   100.00%
    500                822,637         822,637   1,450,835,268   1,450,835,268   100.00%
   1 KB                351,832         351,832   1,128,227,824   1,128,227,824   100.00%
 2.5 KB                110,888         110,888     764,967,803     764,967,803   100.00%
   5 KB                 43,184          43,184     532,889,781     532,889,781   100.00%
  10 KB                 15,857          15,857     345,804,186     345,804,186   100.00%
  25 KB                  3,646           3,646     164,948,352     164,948,352   100.00%
  50 KB                    885             885      72,234,256      72,234,256   100.00%
 100 KB                    160             160      24,610,336      24,610,336   100.00%
 250 KB                     10              10       2,916,604       2,916,604   100.00%
ADD REPLY
0
Entering edit mode

megahit 100x data:

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2321  0.2707  0.2696  0.2275  0.0000  0.0000  0.0000  0.5403  0.1384

Main genome scaffold total:             2102597
Main genome contig total:               2102597
Main genome scaffold sequence total:    1946.904 MB
Main genome contig sequence total:      1946.904 MB     0.000% gap
Main genome scaffold N/L50:             199888/1.495 KB
Main genome contig N/L50:               199888/1.495 KB
Main genome scaffold N/L90:             1454160/359
Main genome contig N/L90:               1454160/359
Max scaffold length:                    349.476 KB
Max contig length:                      349.476 KB
Number of scaffolds > 50 KB:            1210
% main genome in scaffolds > 50 KB:     5.11%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All              2,102,597       2,102,597   1,946,904,126   1,946,904,126   100.00%
    100              2,102,597       2,102,597   1,946,904,126   1,946,904,126   100.00%
    250              1,991,743       1,991,743   1,922,255,288   1,922,255,288   100.00%
    500                896,124         896,124   1,518,425,677   1,518,425,677   100.00%
   1 KB                337,791         337,791   1,140,044,398   1,140,044,398   100.00%
 2.5 KB                103,295         103,295     790,133,988     790,133,988   100.00%
   5 KB                 42,619          42,619     582,498,384     582,498,384   100.00%
  10 KB                 17,492          17,492     409,934,452     409,934,452   100.00%
  25 KB                  4,588           4,588     215,123,703     215,123,703   100.00%
  50 KB                  1,210           1,210      99,396,790      99,396,790   100.00%
 100 KB                    246             246      35,830,593      35,830,593   100.00%
 250 KB                     15              15       4,286,097       4,286,097   100.00%
ADD REPLY
0
Entering edit mode

This is helpful, thank you. What tool do you use for normalization? BBnorm seems to output merged reads which metaSPADes will not accept.

ADD REPLY
0
Entering edit mode

I use BBnorm and I don't think it merges reads. However, some reads may become single as their mates will be removed during normalization. I use extract-paired-reads.py from khmer to split the reads into paired and single, and then feed them as such to assembly programs.

ADD REPLY
0
Entering edit mode

ah so BBnorm output is interleaved paired end? I shall try this tool to split them, thank you.

ADD REPLY

Login before adding your answer.

Traffic: 2508 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6