I have Single End Illumina reads with read length around 76bp; and probably around 30X coverge for the Drosophila related species. After Quality filtering the Reads, I am trying denovo assembly with Abyss and IDBA assemblers, Got contigs with N50 length of around 800 to 1200 bp depending on the K-mer value used (25-60) (Is it terrible?). I am facing difficulty to build super-contigs (scaffolds) since I do not have Paired end information as they are SE. Suggest me with possible denovo assembly pipeline(s) to build scaffolds with SE data in hand.
Without paired-end you are sorry out of luck when it comes to reliable scaffolding. You simply cannot build "scaffolds" if your data is not paired-end.
What you can do is to place the contigs you have via simple MUMMER searches (or BLAST or MegaBLAST or even aligners if this seems appropriate to you) on some closely related species. Be very aware that this placement of contigs will absolutely not reflect the organism you sequenced as you will not be able to say with confidence that the order of contigs you get out of this placement is the one from your organism.
And regarding a N50 of ~1kb: it is terrible. One can use this kind of data to go on a gene fishing expedition in prokaryotes or perhaps even targeted sequence analysis in higher eukaryotes, but for everything else I think it's just junk (sorry to be so blunt).
Thanks for your suggestions @Bach and @ketil; Actually SE reads were initially generated keeping in mind that we map them on to the D.melanogaster as reference. But now attempting to see if we can get deno assemble of it. But it looks like a very challenging task without PE/mate pairs.
Of the de-Bruijn assemblers, I got best result with the commercial CLC - way better than SOAP or Abyss. De Bruijn is very sensitive to filtering, so make sure you aggressively remove low quality reads. You might also want to try out Celera, which is more difficult to get to run, but tends to give the best results in many cases.
Using this tool you could (in principle, no personal experience) use different assembly programs and 'merge' them. MAIA uses the reference to find beginning and endpoints in the merged 'contig graphs'.
I've had very good experience with SOAP denovo for a eukaryotic genome (fire ant with a ~500mb genome). It performed lightyears better than Abyss or Velvet. Obviously we had paired data. But I think you should give it a shot. (it's free)
Have you removed duplicate reads?
Also, you could try reducing the complexity of your assembly by separating your dataset into subsets (maybe chromosomes or smaller regions that are syntenic between closely related Drosophilae). But give how cheap sequencing would be... it's probably not worth your time.
hi guys , im new to this forum.
Could anyone help me abt the single end read assembly (SOAP de brujian )procedure and publication related to the same.