Has anyone performed assembly with Nextera mate pairs and has seen the following problem?
We're doing mammalian assemblies using Nextera 8 kbp-insert mate-pairs, and the results are abnormal. The total assembly size is way larger than expected: because it has 1 Gbp of NNN's in scaffolds. The total contigs size matches the genome size.
Some detailed information regarding the data and the assembly:
- mammalian genome
- Lib1: HiSeq 150 bp paired-end 500 bp insert, 30x coverage
- Lib2: HiSeq 150 bp mate-pairs 3kbp insert (not sure about protocol), 10x coverage
- Lib3: HiSeq 150 bp mate-pairs 8kbp insert (Nextera), 30x coverage
- assembler: SOAPdenovo2 latest
- Lib1 and Lib2 were adapter-trimmed using nesoni
- Lib3 was adapter-trimmed using nextclip (which had a positive impact on scaffold N50) and to those familiar with nextclip, we kept the A-B-C categories only.
Some extra steps we tried:
- When we assemble Lib1 and Lib2 together, the total scaffolds size is what we expect (3 Gbp, 30 Kbp scaffold N50). So all is fine here.
- When we assemble all libs together, the total scaffolds size is too high (4 Gbp, 150 Kbp scaffold N50).
- When Lib3 is untrimmed, the total scaffolds size is terrible (6 Gbp) and contigs size is also odd (3.5 Gbp).
- Whether Lib3 is included in the contigs step or not (asm_flags=2 or 3) does not have a significant impact on the results.
Did you check for identical duplicate reads - i.e. both ends are identical? Happens a lot in mate pair libraries.
Yes, Nextclip filtered those. For reference this is the duplicated reads report -- 60% were unique.
I'll add that Nextclip solved the Nextera artefact of very low insert mate-pairs.
Original library, mapped to the assembly with Lib1+Lib2, histogram created with bamstats.
After nextclip.
Did you map the reads on the GapClosed assembly or the scaffolded one ?
Rayan, would you please send me your SOAPdenovo2 configuration file, and the command line to my email addr. luoruibang@genomics.org.cn, thanks Manoj Samanta for directing me here.
I've sent it to your email, but I actually see no reason not to make it public. Here it is:
cmdline:
config file:
SOAPdenovo2 authors requested it, here is the log for the map and scaff steps: http://pastebin.com/bJ2a7dUp
A couple of people have recommended me to try another scaffolder -- that's a good idea, perhaps this is a SOAPdenovo2 specific problem, I'll explore this possibility.
Lars Arvestad requested the scaffolds distribution length (Lib1+Lib2, prior to scaffolding with Lib3), here it is: