I am planning to conduct de novo assembly of a bumblebee species and it is haploid. I was wondering what assembler would be best for it? I am considering platanus, megahit and hapsembler at this moment.
Thanks for your help!
I am planning to conduct de novo assembly of a bumblebee species and it is haploid. I was wondering what assembler would be best for it? I am considering platanus, megahit and hapsembler at this moment.
Thanks for your help!
The assembly depends on research question. Why would you try a denovo assembly with just a single illumina library when you already have such a good genome as suggested by Philipp? If you are trying a reference-based assembly MIRA would be a good start.
The whole assembly pipeline would depend on what kind of data you have - coverage and quality are the prime factors to consider. Since you have 150 bp illumina, you would have to go for de-brujin assemblers that can handle multiple libraries. Paired-end overlap would be the first thing to do since your insert size is low, this should be supplied as another single-end library. All you would get in the end would be various many contigs broken at repetitive or high complexity regions.
Without multiple libraries or sequencing technologies, you would get highly fragmented assemblies - Is this what you need?
Hi Rohit,
We have done a reference based assembly (aligning with the published genome of a closely related species- the paper Philipp mentioned, Bombus impatiens) but we are now are trying to identify some (possible) novel variation in a specific part of the genome (300 KB region) as that's why we want to do a de novo assembly. I would not need the whole assembly data to be very good, I actually need the piece we are interested in (around 300 KB) in order to check for indel/ novel variation which we might have missed when we have done our reference based assembly. But not sure what would be a good assembler for de novo assembly of paired end library (insert size is 150 bp)? Please let me know if you have any suggestion.
Hi Sarthok
First you would have to look if there is a break-point at the 300kb region you are interested in, usually mapping and then break-point detection tools are good at this, omics tools would be a good place to start -
http://omictools.com/whole-genome-resequencing-category
Since you have paired-end data and small insert size just start with paired-end read merging, I usually use FLASH for this. Then try the MIRA assembler, since it is a haploid assembly with <400MB genome, it should be pretty straight forward, not to forget that MIRA has a pretty impressive mailing-list if you run into trouble. IDBA-UD also does a good job in assembling along with SOAP-denovo but mis-assemblies would be something to watch out for with any de-brujin assembler.
If your Illumina HiSeq read length is 250 bp I'd recommend DISCOVAR: https://www.broadinstitute.org/software/discovar/blog/?page_id=23
Since you don't seem to have mate-paired etc. libraries I wouldn't expect the best results with other assemblers.
Have you seen this recent bumblebee genome paper? They used Newbler and SOAPDENOVO https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0623-3
I went with MIRA and experimented with several parameters. I got a assembly which had N50 value of 50KB and blast results provided the contigs for the region I am interested in. Thanks all for your help!
I have gone through the MIRA manual (which is extremely helpful) to write and tune my parameters. It completely depends on your data type (read type, template size etc) and what kind of assembly you expect to generate (draft/accurate, de novo/ reference based) . Here is one of the parameters I used. But it could be very different for your one base on your data and assembly type.
parameters = COMMON_SETTINGS \ -GENERAL:number_of_threads=20 \ -NW:cnfs=warn \ -NW:cmrnl=warn \ SOLEXA_SETTINGS \ -CL:pec job = genome,denovo,accurate readgroup = DataIlluminaPairedLib data = /storage/foo_R1.fastq /storage/foo_R2.fastq technology = solexa template_size = 350 700 autorefine segment_placement = ---> <---
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What about your data? Is it Illumina, PacBio, Nanopore etc.? What are the insert sizes if you have Illumina? This will inform the choice of assembler
The data is from Illumina Hi seq, it's paired end and the insert size is 150 bp.