Confuse About Sequence Assembly Results
2
2
Entering edit mode
13.2 years ago
Plantae ▴ 390

I have used Illumina GA II to sequence two pairend DNA libraries,

one were sequenced with forward reverse direction, -500 bp insert size (LIB1 FR)

the other were sequenced with reverse forward direction, -2kb insert size ,this library was built after circulation. (LIB2 RF)

When I assemly the reads, I got curious results. If I did not reverse complement the second library, I got lower N50 values, but with more reads could be assemblied. (N50 25k, 480M scaffolds/contigs were assembled)

When I reverse complement the second librayry, I got higher N50 values, buit with many reads that could not be assemblied. (N50 46k, only 360M scaffolds/contigs were assembled)

I though the second setting (reverse complement LIB2 reads) was right, but how could the wrong setting assembled more contigs/scaffolds?

The assembler i used is SOAPdenovo.

assembly • 4.9k views
ADD COMMENT
4
Entering edit mode
13.2 years ago

First, you seem to be talking about two different things. Are "480/360M" referring to the number of scaffolds/contigs? or the number of reads assembled?

For contig numbers, 480/360M are unrealistically high for any assembly. There is also really no surprise that wrong settings will lead to smaller, and "more contigs". So I think this is not what you meant, and instead assume you are referring to the number of reads.

You observed that after using the correct reverse-forward orientation, significantly less reads were assembled. This can sometimes be due to the high level of contamination of your 2Kb library (which is reverse-forward) with paired end reads (which is forward-reverse). This is a common problem for the Illumina mate pair protocol. You can map your 2Kb library reads to some of your largest contigs, and see how many of them are indeed separated at 2Kb and in reverse-forward orientation.

How many reads are in your 2Kb library? the drop of 120M reads sound a lot though.

ADD COMMENT
1
Entering edit mode

How bad is the contamination? do you have a histogram to post? also, make sure that in SOAPdenovo configuration file, set asm_flags=2 for the 2k library.

ADD REPLY
0
Entering edit mode

Hi, 480/360 actually means 480MB/360MB - the total length of contigs and scaffolds after assembly. sorry for not clarify.

The situation is that the wrong setting get less contigs, but the right setting get many more contigs. setting FR - i got ~100k contigs and scaffolds in total, N50 25k, N90 4.5K setting RF - i got ~290k contigs and scaffolds in total, N50 46k, N90 280bp

It seems likely that, under the right setting, lots of mate -pair information could not be used by the assembler, thus too many short contigs remained as contigs.

ADD REPLY
0
Entering edit mode

I have checked the data, 2k library have two peaks of insert size, (300bp, 2.4kb). If I got many contaminations in 2kb library, could these datas be used for assemlby, or I should rebuild and resequencing the 2kb libraries?

ADD REPLY
0
Entering edit mode

Can you give me the number of scaffolds only? in both settings?

ADD REPLY
0
Entering edit mode

hi, Setting for FR: 32319 scaffolds setting for RF: 21301 scaffolds

I don't know how to post histograms on to this forum. ~25% mate pairs come with insert size 1~400bp.

ADD REPLY
0
Entering edit mode

the assembled size of scaffolds? in both settings? but your contamination doesn't sound like a huge problem though. I have got a few of my own libs 50% contaminated.

ADD REPLY
0
Entering edit mode

FR scaffolds (32319 scaffolds assemble to 454Mb) RF scaffolds (21301 scaffolds assemble to 300Mb) With such a high level of faked RF mate pairs, did you filtered out these reads?

ADD REPLY
0
Entering edit mode

no I did not use them as they do more harm than good to my assembly. I would have filtered out bad mates if I know the truth from a reference or a close relative - but it is a luxury one does not always have for de-novo assembly. Do the two sets of scaffolds contain similar number of Ns? also if you ever keep the SOAPdenovo logs, in the scaffold step, what insert size does SOAPdenovo infer for the 2k lib?

ADD REPLY
0
Entering edit mode

For scaffolds, FR-23Mb gaps,289850 gaps in total. RF-38Mb gaps,57517 gaps in total. I guess the problem was caused by the "Gapcloser" process adopted by SOAP. With FR setting, several contigs could be use to assembled in the gaps of the scaffold. I have RF run logs only, for 2k libs: 2k_libA,insert size-1975. estimated PE size 56,insert_size estimated: 1180 2k_libB,insert size-2195. estimated PE size 45,insert_size estimated: 0

ADD REPLY
0
Entering edit mode
13.2 years ago
Fabian Bull ★ 1.3k

Could you specifiy your SOAPdenovo run a little more? For exmaple could you paste your config.

In general I think your pipeline could be better. You didnt said anythink about preprocessing. There are programs for eliminating contamination and trimming for quality. These preprocessing steps generaly increase the outcome of an assembly significantly.

Furthermore: Librarys with bigger insert size should not be used for assembly. None of the assemblery I know (CLC, SOAP, newbler, mira) shows significantly better results if you include these large insert reads. Those reads should be used for scaffolding (e.g. with SSPACE).

ADD COMMENT

Login before adding your answer.

Traffic: 2294 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6