Smaller Assembled Genome Size Than Expected
4
1
Entering edit mode
12.9 years ago
Rahul Sharma ▴ 660

Dear all, I am doing an assembly of 40 Mb genome with expected coverage of 181x. I am using Illumina reads 76bp length with insert size 200 bp (Sd 20 bp). I have tried velvet for these assemblies and 86-99% of reads were used in this assembly with N50 of 80kb (with k-mer's 21,55,2). But the strange thing is that I am getting only 19 Mb genome after all assemblies. The whole genome has been covered during the library preparations. What could be the possible reason behind this? Is this due to repeat elements, as some of my NODE's covered more than 5000x? I would appreciate your suggestions.

Thanks in advance Rahul

illumina velvet assembly next-gen sequencing repeats • 5.0k views
ADD COMMENT
0
Entering edit mode

I think there's a good chance that you have an over-coverage of some elements. try maybe reducing the files you assemble (e.g. from 181X to 40X) see if you get the same results. also, check this: http://www.illumina.com/Documents/products/technotes/technote_denovo_assembly_ecoli.pdf

ADD REPLY
6
Entering edit mode
12.9 years ago

Yes, collapsed repeats can lead to a smaller than expected assembly size. See Myers et al (2000) for a good discussion on how to detect collapsed repeat contigs. If this is the case then you have a very repetitive genome on your hands.

Also, have you confirmed that your observed sequencing throughput is compatible with your expected throughput? You can do this by reference mapping against a single copy locus that was isolated previously from your species of interest. If the library/sequencing was poor, you may have a lower coverage than you think which could lead to a partial assembly, although in the range you are talking about this seems unlikely.

ADD COMMENT
0
Entering edit mode

Many thanks for your valuable comments, I will do some analysis and will get back again :)

ADD REPLY
4
Entering edit mode
12.9 years ago

It is possible indeed. It is strange that you successfully assemble so many reads and you get such a small genome size. Have you tried to BLAST your "un-assembled reads" against a database containing only repeated elements (e.g repbase)?

It also depends on how you obtained your reads... maybe the whole genome isn't in your sample, because even if there is a lot of repeated elements in the genome, they should be there in multiple copies. You wouldn't assemble almost 100% of the reads. A whole bunch of reads very similar among them wouldn't assemble.

ADD COMMENT
0
Entering edit mode

mank thanks Francois for your valuable comments. I will do some analysis and get back soon.

ADD REPLY
1
Entering edit mode
12.9 years ago
Ahdf-Lell-Kocks ★ 1.6k

Many assemblers don't do well with repetitive regions and collapse them up, which can lead to smaller assemblies than the expected genome size.

ADD COMMENT
0
Entering edit mode
10.9 years ago
Adrian Pelin ★ 2.6k

I suggest trying spades assembler. It permits the usage of multiple kmers, merging all kmers into a final assembly.

Some regions will benefit from lower kmers, others from higher kmers. Try k=23,33,43,54,65

Adrian

ADD COMMENT

Login before adding your answer.

Traffic: 2557 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6