Question

Smaller Assembled Genome Size Than Expected

1

Entering edit mode

12.8 years ago

Rahul Sharma ▴ 660

Dear all, I am doing an assembly of 40 Mb genome with expected coverage of 181x. I am using Illumina reads 76bp length with insert size 200 bp (Sd 20 bp). I have tried velvet for these assemblies and 86-99% of reads were used in this assembly with N50 of 80kb (with k-mer's 21,55,2). But the strange thing is that I am getting only 19 Mb genome after all assemblies. The whole genome has been covered during the library preparations. What could be the possible reason behind this? Is this due to repeat elements, as some of my NODE's covered more than 5000x? I would appreciate your suggestions.

Thanks in advance Rahul

illumina velvet assembly next-gen sequencing repeats • 4.9k views

ADD COMMENT • link updated 10.9 years ago by Adrian Pelin ★ 2.6k • written 12.8 years ago by Rahul Sharma ▴ 660

0

Entering edit mode

I think there's a good chance that you have an over-coverage of some elements. try maybe reducing the files you assemble (e.g. from 181X to 40X) see if you get the same results. also, check this: http://www.illumina.com/Documents/products/technotes/technote_denovo_assembly_ecoli.pdf

ADD REPLY • link 12.8 years ago by Schrodinger'S Cat ▴ 210

score 6 · Answer 1 · 2012-02-01

Yes, collapsed repeats can lead to a smaller than expected assembly size. See Myers et al (2000) for a good discussion on how to detect collapsed repeat contigs. If this is the case then you have a very repetitive genome on your hands.

Also, have you confirmed that your observed sequencing throughput is compatible with your expected throughput? You can do this by reference mapping against a single copy locus that was isolated previously from your species of interest. If the library/sequencing was poor, you may have a lower coverage than you think which could lead to a partial assembly, although in the range you are talking about this seems unlikely.

score 4 · Answer 2 · 2012-01-31

4

Entering edit mode

12.8 years ago

Francois Olivier Hébert ▴ 280

It is possible indeed. It is strange that you successfully assemble so many reads and you get such a small genome size. Have you tried to BLAST your "un-assembled reads" against a database containing only repeated elements (e.g repbase)?

It also depends on how you obtained your reads... maybe the whole genome isn't in your sample, because even if there is a lot of repeated elements in the genome, they should be there in multiple copies. You wouldn't assemble almost 100% of the reads. A whole bunch of reads very similar among them wouldn't assemble.

ADD COMMENT • link 12.8 years ago by Francois Olivier Hébert ▴ 280

0

Entering edit mode

mank thanks Francois for your valuable comments. I will do some analysis and get back soon.

ADD REPLY • link 12.8 years ago by Rahul Sharma ▴ 660

score 1 · Answer 3 · 2012-02-08

1

Entering edit mode

12.8 years ago

Ahdf-Lell-Kocks ★ 1.6k

Many assemblers don't do well with repetitive regions and collapse them up, which can lead to smaller assemblies than the expected genome size.

ADD COMMENT • link 12.8 years ago by Ahdf-Lell-Kocks ★ 1.6k

score 0 · Answer 4 · 2014-01-20

0

Entering edit mode

10.9 years ago

Adrian Pelin ★ 2.6k

I suggest trying spades assembler. It permits the usage of multiple kmers, merging all kmers into a final assembly.

Some regions will benefit from lower kmers, others from higher kmers. Try k=23,33,43,54,65

Adrian

ADD COMMENT • link 10.9 years ago by Adrian Pelin ★ 2.6k