Question

SPAdes insert size

0

Entering edit mode

8.9 years ago

treitlis ▴ 40

Hello,

I would have a question regarding the genome assembly.

I was trying to use SPAdes to de novo assemble a genome. The data was obtained from a culture and it has some bacterial contamination due to the fact that the organism does not live without bacteria, and even filtration techniques don't remove all bacteria.

For sequencing we used 3 libraries with 500, 5000 and 10.000 bp insertions. These were sequenced on Miseq 2x300.

Reads were trimmed, and then the genome was assembled with SPAdes.

The problem is the average insert size estimation in spades. I always get this warning:

Estimated mean insert size 316.923 is very small compared to read length 300

I use default parameters, and I know that SPAdes is not the best assembler for long reads, but I wanted to try it.

So, for the 500 bp library the estimate is very close (475 bp), but for the 5k is around 316 bp average insert size, and for the 10k is around 426 bp average insert size.

Because of this reason I don't get long scaffolds. Also the number of N's in the final assembly is extremely low (4613 bases marked N in a 70 MB assembly). We prepared intentionally the libraries in this way to be able to get a good assembly, with long scaffolds. The longest scaffold is 150k and there are just 133 scaffolds above 50k, which is roughly around 10% of the entire 70 MB data.

Should I try to use a different assembler? Is this a normal thing? I don't expect that the sequencing company prepared three libraries of 500 bp inserts. Can you suggest another good assembler for multi-cell data? I have little experience in genomics and most of my work which was done previously was done on single cell genomes, but I know that spades can be used also for genomes whose data was obtained from multi-cell culture.

Thank you,

Sebastian

SPAdes genome miseq • 4.6k views

ADD COMMENT • link 8.9 years ago by treitlis ▴ 40

1

Entering edit mode

For sequencing we used 3 libraries with 500, 5000 and 10.000 bp insertions. These were sequenced on Miseq 2x300.

Can we get a clarification about "libraries"? The libraries you refer to above are plasmid/cosmid libraries or if they are sequencing libraries then the 5/10K ones must be mate-pair, is that correct? In that case the simplest explanation is probably that you are not using SPAdes correctly, if you are getting 426 bp insert size with the 10K libraries. Perhaps you can provide the command line you used for running SPAdes.

ADD REPLY • link 8.9 years ago by GenoMax 152k

0

Entering edit mode

Hi there, Thank you very much. It was such a stupid mistake from my side.

On our server wikipage the libraries were marked paired-end. Unfortunately the library was prepared by the sequencing company and the person who took the data an put it on the webpage annotated it in the wrong way.

Thank you again,

ADD REPLY • link 8.9 years ago by treitlis ▴ 40