Question

Denovo Assembly/Scafolding Pipeline(S) For The Eukaryote Genome: Single End Illumina Reads

4

Entering edit mode

14.0 years ago

Rm 8.3k

I have Single End Illumina reads with read length around 76bp; and probably around 30X coverge for the Drosophila related species. After Quality filtering the Reads, I am trying denovo assembly with Abyss and IDBA assemblers, Got contigs with N50 length of around 800 to 1200 bp depending on the K-mer value used (25-60) (Is it terrible?). I am facing difficulty to build super-contigs (scaffolds) since I do not have Paired end information as they are SE. Suggest me with possible denovo assembly pipeline(s) to build scaffolds with SE data in hand.

Treat it as follow up of the previous Question assked in the Biostar

scaffolding read assembly next-gen sequencing • 7.0k views

ADD COMMENT • link updated 13.9 years ago by Senthilkumar ▴ 90 • written 14.0 years ago by Rm 8.3k

0

Entering edit mode

hi guys , im new to this forum.

Could anyone help me abt the single end read assembly (SOAP de brujian )procedure and publication related to the same.

ADD REPLY • link 12.8 years ago by Senthilkumar ▴ 90

score 2 · Answer 1 · 2010-12-09

Without paired-end you are sorry out of luck when it comes to reliable scaffolding. You simply cannot build "scaffolds" if your data is not paired-end.

What you can do is to place the contigs you have via simple MUMMER searches (or BLAST or MegaBLAST or even aligners if this seems appropriate to you) on some closely related species. Be very aware that this placement of contigs will absolutely not reflect the organism you sequenced as you will not be able to say with confidence that the order of contigs you get out of this placement is the one from your organism.

And regarding a N50 of ~1kb: it is terrible. One can use this kind of data to go on a gene fishing expedition in prokaryotes or perhaps even targeted sequence analysis in higher eukaryotes, but for everything else I think it's just junk (sorry to be so blunt).

score 1 · Answer 2 · 2010-12-09

Of the de-Bruijn assemblers, I got best result with the commercial CLC - way better than SOAP or Abyss. De Bruijn is very sensitive to filtering, so make sure you aggressively remove low quality reads. You might also want to try out Celera, which is more difficult to get to run, but tends to give the best results in many cases.

score 1 · Answer 3 · 2010-12-13

Since you have a refernce genome (it seems) you could try MAIA: http://bioinformatics.oxfordjournals.org/content/26/18/i433.short

Using this tool you could (in principle, no personal experience) use different assembly programs and 'merge' them. MAIA uses the reference to find beginning and endpoints in the merged 'contig graphs'.

score 1 · Answer 4 · 2010-12-14

I've had very good experience with SOAP denovo for a eukaryotic genome (fire ant with a ~500mb genome). It performed lightyears better than Abyss or Velvet. Obviously we had paired data. But I think you should give it a shot. (it's free) Have you removed duplicate reads?

Also, you could try reducing the complexity of your assembly by separating your dataset into subsets (maybe chromosomes or smaller regions that are syntenic between closely related Drosophilae). But give how cheap sequencing would be... it's probably not worth your time.