Hello,
We are trying to assemble the genome of the common loon, and I have used abyss (v. 1.5.2) to produce de novo assemblies with the following output for different values of k:
n n:500 n:N50 min N80 N50 N20 E-size max sum name k37
4689853 152662 53975 500 579 761 1177 935 10767 1.19E+08 test-unitigs.fa
4689829 152665 53982 500 579 761 1177 935 10767 1.19E+08 test-contigs.fa
4689653 152727 53871 500 579 762 1178 935 10767 1.19E+08 test-scaffolds.fa
k55
2564599 25203 9944 500 542 639 867 795 14461 1.70E+07 test-unitigs.fa
2564423 25179 9885 500 543 641 877 799 14461 1.71E+07 test-contigs.fa
2564028 25142 9769 500 543 645 902 811 14461 1.72E+07 test-scaffolds.fa
k32
5038033 198105 67641 500 591 802 1287 1005 7812 1.61E+08 test-unitigs.fa
5038000 198106 67653 500 591 802 1287 1005 7812 1.61E+08 test-contigs.fa
5037795 198153 67499 500 591 803 1287 1005 7812 1.62E+08 test-scaffolds.fa
k48
3736945 62667 24079 500 554 678 955 804 9769 4.42E+07 test-unitigs.fa
3736733 62628 24040 500 554 679 961 806 9769 4.43E+07 test-contigs.fa
3735950 62435 23669 500 555 684 986 817 9769 4.45E+07 test-scaffolds.fa
k64
1636437 5055 1730 500 542 655 1196 1133 12872 3.72E+06 test-unitigs.fa
1636380 5054 1717 500 542 657 1203 1142 12872 3.83E+06 test-contigs.fa
1636124 5088 1698 500 545 669 1282 1159 12872 3.83E+06 test.scaffolds
k25
6946557 228359 83689 500 578 747 1096 873 5000 1.74E+08 test-unitigs.fa
6946544 228358 83694 500 578 747 1096 873 5000 1.74E+08 test-contigs.fa
6946414 228386 83762 500 578 747 1096 873 5000 1.74E+08 test.scaffolds
k31
5114778 207133 70364 500 593 809 1301 1015 7999 1.70E+08 test-unitigs.fa
5114751 207137 70181 500 593 810 1301 1015 7999 1.70E+08 test-contigs.fa
5114566 207200 70239 500 593 810 1302 1015 7999 1.70E+08 test.scaffolds
k30
5192389 216073 73119 500 595 814 1312 1022 7998 1.78E+08 test-unitigs.fa
5192361 216073 73130 500 595 814 1312 1022 7998 1.78E+08 test-contigs.fa
5192194 216128 72984 500 595 814 1313 1022 7998 1.78E+08 test.scaffolds
For the assembly with the highest N50 (814 bp), the contigs are small and highly fragmented (and essentially no scaffolds are produced) even after mapping these contigs to the available red-throated loon genome:
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 5,237,924 5,238,436 767,438,425 767,326,331 99.99%
50 3,616,441 3,616,953 710,236,525 710,124,431 99.98%
100 2,146,720 2,147,232 604,271,394 604,159,300 99.98%
250 743,885 744,397 394,016,485 393,904,391 99.97%
500 247,247 247,755 223,350,732 223,238,838 99.95%
1 KB 62,044 62,409 98,533,822 98,431,583 99.90%
2.5 KB 5,725 5,731 18,713,830 18,710,728 99.98%
5 KB 231 231 1,310,589 1,310,589 100.00%
What I am wondering is whether anyone has any ideas why our assembly is so fragmented and if there are any techniques we can use to improve contig length. Sequence data are in the form of pe reads (291,098,878 after filtering) drawn from one insert library size (8kb)? Could the fact that we do not have multiple library sizes be to blame for the small contigs? I do not have an estimate of genome size, but it should be in the range of 1 Gb, and the species is diploid.
Here is the comand I used to run abyss for different k-mer sizes: nohup abyss-pe k=29 name=test29 np=48 in='/share/apps/Data/Loon/COLO1527-8kb_1.filtered.fastq.gz /share/apps/Data/Loon/COLO1527-8kb_2.fastq.gz' &
I am really hoping to find a way to improve contig length, but so far I have not found a way to do this or produce viable scaffolds. Thanks very much for any suggestions.
Zach
Thank you. The k-mer sizes are on the very right hand side of the first table (they range from 25 to 64, but I did not do every in between). In the sum column these are apparently the sum of the lengths of the contigs at least 100 bp in length).
I am not sure about the heterozygosity, but I have not heard of Platanus. Thanks for that although, I can't imagine why the heterozygosity is higher than other bird genomes assembled using abyss or SOAPdenovo.
The company that did the sequencing for us did an initial SOAPdenovo assembly and we moved on to abyss since their assembly had a contig N50 of approximately 200. So at 814 we have improved with abyss slighly, but not as much as could be hoped.
I believe the read depth was about 33X.
Ah, didn't see the horizontal scrolling bar.. alright.
33x is rather low coverage for de novo assembly. You'd probably get higher contiguity if you sequenced an extra paired-end library.
Follow-up questions:
I concur; 33x is way too low for 100bp reads on a large diploid genome. With even K=55 you already have only 46 kmers per read, or 33*(46/100)/2 = 7.59x kmer coverage per ploidy, and for a good assembly, you would want an even higher K. You need a lot more coverage for a good assembly, preferably from longer reads (150bp at least; 250bp would be better).
Try to aim for at least
15x20x kmer coverage per ploidy minimum. 30x is better.Sometimes you can improve things by merging your paired reads, if they are overlapping. That can increase the number of long kmers yielded per pair.
I think it is great that you've tried many kmer values, but you should also try other assemblers too. I have had great luck with MaSuRCA as compared all other assemblers that I've tested (eg., ALLPATHS N50 was 27kb, SOAP-denovo N50 was 16KB, RAY, I don't want to mention, and MaSuRCA was 2.4Mb!!).
Also, repeat content. What is your genomes estimated repeat content? I know animals have relatively low repeat content, but it will definitely matter!
Thanks very much to all of you for your help. I really appreciate it. I suspected that our sequencing coverage was too low for a while as many of you have said, but I have just been trying to make the best of the existing data. But your comments have helped me decide that we probably can't improve contig length too much without additional sequencing with another insert library size and higher coverage, so that is what we are going to look into next.
To answer, I do believe that adapters were removed from reads, and error correction has been done, although I was not familiar with BFC. That might be something to use if we get another paired-end library. I am pretty new to this type of work so I haven't estimated repeat content, but most bird genomes have low repeat content for amniotes.
Abyss has worked well for us compared to SOAPdenovo, but these are the only assemblers we have tried. I'll look into MaSuRCA. Any suggestions on what an appropriate insert size for another library would be?
Thanks,
Zach
Overlapping 2x250bp reads