Hi There,
Recently I have been trying to improve a genome assembly. It is a plant genome. It was first assembled using 454 data. And again assembled using Illumina data.
I tried to do the job using two strategies. The first one is to work from the beginning by mixing raw reads of both types using de novo assemblers like Velvet and Ray. I call this the direct hybrid assembly. But i also tried to further combine assemblies by both assemblers using a third assembler.
The second one is to assembly 454 reads using Newbler (i.e., GS de novo assembler) and then assemble Illumina reads using Velvet. Then the assemblies were hybridized using a third assembler. I called this the stepwise hybrid assembly approach.
I found that the first strategies produced more wrong assemblies (assessed through comparing scaffolds to protein sequences) than the second one.
I also found that when i further combined the two assemblies produced by the two assemblers (one is better than the other assembler based on my assessment) in strategy one, even more erroneous assemblies were produced.
Could anyone help to suggest potential reasons for this?
Many thanks.
Lhl
Hi Ole,
Thanks for your response.
The plant genome size is estimated to be 2.7-2.8 GB. We do not have a reference genome.
I chose Ray because it is designed to be a hybrid assembler (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3119603/).
I used Velvet because it can assemble both long (>200bp) and short reads (<200bp) as mentioned in the manual.
We have only a little bit 454 data (80M), which was produced by randomly sampling the genome.
As for the illumina data. we constructed 3 pair-end libraries with insert length of 125bp, 250bp and 500bp respectively. And we focused mainly on the gene-rich regions. We also have some Illumina sequences produced using RADseq (http://bfg.oxfordjournals.org/content/9/5-6/416.abstract).
The third assembler i used to combine assemblies (e.g., combining ray assembly of both types of reads AND Velvet assembling both types of reads; OR combining velvet assembly of Illumina reads with newbler assembly of 454 reads) is GAA (http://bioinformatics.oxfordjournals.org/content/28/1/13.full). I am sorry for forgetting to mention this in the question post.
I hope I am offering the right details you need. And thanks a lot for your response.
Lhl
Hi Lhl.
That's a big genome. I'm surprised you were able to use Velvet on it (or not so surprising if all you're doing is assembling exons). Ray is probably a good choice. I am still not confident that they would do the best job with the combination of reads, but other programs need a bit of tweaking to get to run properly.
So the Illumina reads are not randomly sampled from the genome? Most assemblers expect an even coverage of the genome, and I guess some might not work well with that. If that is the case, I guess both Ray and Velvet would have problems with the combination of Illumina and 454 reads when the Illumina reads is uneven distributed. Someone more knowledgeable than me would have to explain the reasons.
I guess you could take the 454 assembly, map your Illumina reads to it, and use the Illumina reads to correct mistakes in the 454 assembly.
For which purposes do you need the assembly? If you just need the gene-rich regions, then you might have a good enough assembly. Combining assemblies, at least with uneven coverage, is not an easy task. I guess I need to read that GAA article to learn more about that.
Good luck.
Ole
Hi Ole,
Thanks again for your response.
I am available to a University computer cluster. So basically I do not need to worry about computation resources. The Illumina reads are mainly sampled from the gene-rich genomic regions.
Since i only have a little bit 454 reads, which lead to a few contigs. I do not think i can use it as the base assembly and further improve it through mapping Illumina reads.
The main purpose of my study is to get as many functional components of the genome as possible.
Anyway, thanks for your discussion.
It helps to some extent.
Cheers,
Lhl