Larger-Than-Expected Total Scaffolds Size Using Nextera Mate Pairs
4
3
Entering edit mode
10.7 years ago
Rayan Chikhi ★ 1.5k

Has anyone performed assembly with Nextera mate pairs and has seen the following problem?

We're doing mammalian assemblies using Nextera 8 kbp-insert mate-pairs, and the results are abnormal. The total assembly size is way larger than expected: because it has 1 Gbp of NNN's in scaffolds. The total contigs size matches the genome size.

Some detailed information regarding the data and the assembly:

  • mammalian genome
  • Lib1: HiSeq 150 bp paired-end 500 bp insert, 30x coverage
  • Lib2: HiSeq 150 bp mate-pairs 3kbp insert (not sure about protocol), 10x coverage
  • Lib3: HiSeq 150 bp mate-pairs 8kbp insert (Nextera), 30x coverage
  • assembler: SOAPdenovo2 latest
  • Lib1 and Lib2 were adapter-trimmed using nesoni
  • Lib3 was adapter-trimmed using nextclip (which had a positive impact on scaffold N50) and to those familiar with nextclip, we kept the A-B-C categories only.

Some extra steps we tried:

  • When we assemble Lib1 and Lib2 together, the total scaffolds size is what we expect (3 Gbp, 30 Kbp scaffold N50). So all is fine here.
  • When we assemble all libs together, the total scaffolds size is too high (4 Gbp, 150 Kbp scaffold N50).
  • When Lib3 is untrimmed, the total scaffolds size is terrible (6 Gbp) and contigs size is also odd (3.5 Gbp).
  • Whether Lib3 is included in the contigs step or not (asm_flags=2 or 3) does not have a significant impact on the results.
assembly • 7.6k views
ADD COMMENT
1
Entering edit mode

Did you check for identical duplicate reads - i.e. both ends are identical? Happens a lot in mate pair libraries.

ADD REPLY
0
Entering edit mode

Yes, Nextclip filtered those. For reference this is the duplicated reads report -- 60% were unique.

  n       Count   Percent
  1       149720460       57.95
  2       57356740        23.22
  3       25391361        10.28
  4       9071636 3.67
ADD REPLY
1
Entering edit mode

I'll add that Nextclip solved the Nextera artefact of very low insert mate-pairs.

Original library, mapped to the assembly with Lib1+Lib2, histogram created with bamstats.

After nextclip.

after nextclip

ADD REPLY
0
Entering edit mode

Did you map the reads on the GapClosed assembly or the scaffolded one ?

ADD REPLY
1
Entering edit mode

Rayan, would you please send me your SOAPdenovo2 configuration file, and the command line to my email addr. luoruibang@genomics.org.cn, thanks Manoj Samanta for directing me here.

ADD REPLY
0
Entering edit mode

I've sent it to your email, but I actually see no reason not to make it public. Here it is:

cmdline:

SOAPdenovo-63mer all -s soapdenovo2.config -K 61 -o [output prefix] -R -a 400 -p 30

config file:

max_rd_len=200
[LIB]
avg_ins=200
reverse_seq=0
asm_flags=3
rank=1
map_len=32
q1=MD_NoIndex_R1.clipped.fastq
q2=MD_NoIndex_R2.clipped.fastq

[LIB]
avg_ins=2859
reverse_seq=1
asm_flags=3
rank=2
map_len=32
q1=M273_R1.clipped.fastq
q2=M273_R2.clipped.fastq

[LIB]
avg_ins=6455
reverse_seq=1
asm_flags=3
rank=3
map_len=32
q1=8kbp/nextclip/nextclipped_ABC_R1.fastq
q2=8kbp/nextclip/nextclipped_ABC_R2.fastq
ADD REPLY
0
Entering edit mode

SOAPdenovo2 authors requested it, here is the log for the map and scaff steps: http://pastebin.com/bJ2a7dUp

ADD REPLY
0
Entering edit mode

A couple of people have recommended me to try another scaffolder -- that's a good idea, perhaps this is a SOAPdenovo2 specific problem, I'll explore this possibility.

ADD REPLY
0
Entering edit mode

Lars Arvestad ‏requested the scaffolds distribution length (Lib1+Lib2, prior to scaffolding with Lib3), here it is:

http://s30.postimg.org/htqcz7581/output.png

ADD REPLY
0
Entering edit mode
10.7 years ago

Perhaps you can try out these pre-processing steps and see if that helps?

http://genomics-pubs.princeton.edu/prv/scripts.shtml#Part_I

You would skip step #3 unless you were worried about contamination (and that would be specific to your own experiment anyways)

ADD COMMENT
0
Entering edit mode

Thanks for answering. My understanding is that step #1 should not have any influence on the assembler (SOAPdenovo2 doesn't take into account qualities as far as I'm aware), and step #2 (adaptor trimming) is already taken care of by Nextclip. Host contaminant (step #3) is difficult to believe; if the extra 400 Mbp - 1 Gbp in scaffolds was contamination, it'd also be present in the contigs.

ADD REPLY
1
Entering edit mode

True - but the mononucleotide and quality trimming might be useful

ADD REPLY
0
Entering edit mode

Good point -- I realize that Nextclip didn't do quality trimming. I'll run nesoni on nextclip output and report the assembly results.

ADD REPLY
0
Entering edit mode

That was a good idea apparently. Quality-trimming and adapter-trimming the reads using Nesoni, _after_ adapter-trimming them with Nextclip, further removed 20% of bases of low quality, or where the Nextera adapter was still present. I ran nextclip with recommended parameters from the manual (-min_length 25 --trim_ends 0 --remove_duplicates) and nesoni with default parameters.

ADD REPLY
0
Entering edit mode

To follow up: turns out that Nextclip-trimmed reads produced overall sightly more contiguous scaffolds (with SSPACE) than the Nextclip+Nesoni-trimmed reads (96 Kb vs 92 Kb, same assembly size). I suspect that further quality-trimming the reads reduces coverage.

ADD REPLY
0
Entering edit mode
10.7 years ago
Rayan Chikhi ★ 1.5k

Answering my own question because.. problem was solved by using a stand-alone scaffolder! instead of SOAPdenovo2's. Thanks to those who suggested it on Twitter.

I ran SSPACE, using the Nextera library (Lib3) trimmed by Nextclip, to scaffold the Lib1+Lib2 assembly. The results are now reasonable: 3.2 Gbp assembly, only 150 Mbp were added as NN's. The scaffold N50 is 95 Kb, which isn't as good as the 150 Kb produced by SOAPdenovo2, but at least the assembly doesn't have 1 Gbp of N's..

Other runs of SSPACE using different trimming parameters (nesoni only, nextclip + nesoni) did not produce significantly different results.

I'll report results with the BESST scaffolder once I have them -- for some reason Bowtie of SSPACE ran orders of magnitude faster than BWA-MEM of BESST on my node.

ADD COMMENT
1
Entering edit mode

What do you mean by 'for some reason'? Bowtie is for exact or near exact search only, whereas BWA-MEM is more versatile (and hence the cost).

ADD REPLY
0
Entering edit mode

We still need to know why SOAPdenovo2 did not work. Maybe Ruibang will help.

ADD REPLY
0
Entering edit mode

Yes, I agree...

ADD REPLY
0
Entering edit mode

I realize that Bowtie has less to do (it's in 0-errors mode in SSPACE), but I said "for some reason" because I suspect that my cluster node has other OS or hardware issues. Other benchmarks led me to believe so. BWA-MEM shouldn't take 24 hours to align 100M pairs using 15 threads, yet it did.

ADD REPLY
1
Entering edit mode

I suspect it is OS-related issues, but BWA-MEM can get slow in repetitive regions. Is that why SOAPdenovo failed?

http://www.homolog.us/blogs/blog/2014/03/31/bwa-mem-gets-very-slow-in-highly-repetitive-regions/

ADD REPLY
0
Entering edit mode

Those mate-pairs, despite being quality-filtered by Nextclip, indeed had significantly lower mapping-quality than other data that I typically map.

mapq mean       44.8779
mapq stdev      23.5473

Anecdotally, human tumor paired-end reads mapped to hg19 have the following stats:

mapq mean       52.3832
mapq stdev      17.3866

Stats computed using sam-stats from ea-tools.

ADD REPLY
0
Entering edit mode

"BESST scaffolder" ? Could you enlighten me? Where could I find any details on this (new?) scaffolder?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks a lot for the link !!

ADD REPLY
0
Entering edit mode
10.6 years ago
Rayan Chikhi ★ 1.5k

Also following up on a SOAPdenovo2-only assembly, I was able to get the assembly size down to 3.4 Gbp following advice from the SOAPdenovo team.

The stats are now: 600 Mbp of N's (instead of 1 Gbp before), 168 Kb scaf N50.

The changes were (in suspected order of importance, only a guess):

  1. Performed Nextclip on the 3kbp library as well, as it was also a Nextera lib
  2. Better insert sizes estimations in soapdenovo config: 250 for PE, 2500 for Lib2, 7500 for Lib3

Also noted that Lib2 had unusually large standard deviation (~1500), which may explain the bad scaffolding.

ADD COMMENT
0
Entering edit mode

By running SOAPdenovo2's GapCloser, the number of N's dropped from 600 Mbp to 160 Mbp and the assembly size dropped from 3.4 Gbp to 3.3 Gbp. While I cannot verify that the filled gaps are correct, I can only assume that SOAPdenovo2 over-estimated gap sizes during scaffolding, and gap-filling fixed that.

ADD REPLY
0
Entering edit mode
10.5 years ago
Rayan Chikhi ★ 1.5k

My final conclusion, whenever an assembly is larger than expected:

  1. Use a different scaffolder

    or

  2. SOAPdenovo2 should produce an assembly of correct size, when parameters are tweaked in the following way:

    • Run Nextclip on all Nextera libs (if applicable)
    • Run SOAPdenovo2's GapCloser: it will fill gaps, which as a byproduct will correct bad gap size estimation for the filled gaps
    • Provide accurate insert size estimations to SOAPdenovo2. (What I typically do is a quick assembly using SOAPdenovo2 or Minia with inaccurate parameters, then extract contigs above 10 kbp using seqtk, then use this script https://gist.github.com/rchikhi/7281991.)

Thanks to everyone who responded to this thread, and to SOAPdenovo's team!

ADD COMMENT
0
Entering edit mode

For me Nextclip works well with the scaffolded output of SOAPdenovo, I mean my mate pairs are well oriented in RF sens. But once GapCloser done, I come up with a lot of FR or tandem mate pairs.

Did you have the same problem ?

I map my reads with bowtie2 and compute my insert size with Picard.

ADD REPLY

Login before adding your answer.

Traffic: 2572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6