Assembly Of Illumina Paired-End Data With Wide/Bimodal Insert Distributions
2
1
Entering edit mode
13.2 years ago
Mitchell ▴ 40

Has anyone seen a bimodal (two distinct peaks about ~200bp apart) or very wide insert length distribution (s.d. of ~100 bp) in their paired-end data?

We believe that there is a fault at our sequencing provider, but this data is already 6 months late and have proceeded to try and de novo assemble this data.

We have been using velvet, but setting the insertlen and insertlensd options to auto, or providing these parameters determined from mapping with BWA, results in a large number of N characters in the scaffolded contigs. If we use a very tight insertlen_sd (10% of the mean) we can eliminate if not all of the N characters. However we loose a lot of the data that falls outside of the defined region.

Has anyone tried assembling such data with velvet? Does anyone have any suggestions of things to try? Could someone suggest an assembly program/algorithm that may better handle such mentioned data?

assembly velvet paired • 3.4k views
ADD COMMENT
0
Entering edit mode

Do you have a close or similar reference genome you can use?

ADD REPLY
0
Entering edit mode

It sounds like very poor fragment library construction. You should be able to get it re-done for free, however the fact it is so late suggests that probably won't happen.

ADD REPLY
0
Entering edit mode

Try just assembling as single end reads "velveth -short" and see what sort of contigs you get. Then align your reads back and plot the insert size distribution.

ADD REPLY
0
Entering edit mode

Hi Torst - these come from a variety of bacteria and in some cases we are lucky to have published references. Our latest data seems to be of similar fate which is worrying. I'll see what we get when not considering the paired library information.

ADD REPLY
2
Entering edit mode
13.1 years ago
Botond Sipos ★ 1.7k

Velvet allows for multiple categories of reads (two by default) in order to deal with these situations. Check out the "Using multiple categories" section of the Velvet manual.

ADD COMMENT
0
Entering edit mode
13.2 years ago
Vitis ★ 2.6k

If the distribution is bimodal, can you try dividing the data to two data sets with two insert sizes, then de novo assemble them separately? I think sam would tell you the predicted insert sizes. You can merge/combine the contigs later. Although we never tried this, we routinely assemble the same reads with different k-mers and merge the assemblies afterwards using CAP or Phrap. Usually it yields better results form a single assembly.

ADD COMMENT
0
Entering edit mode

This will only work if Mitchell has a reference genome to align reads to.

ADD REPLY

Login before adding your answer.

Traffic: 3312 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6