Question

Illumina Pair-End Library De-Novo Assembly - Broken Pairs?

4

Entering edit mode

13.5 years ago

Martin A Hansen 3.0k

When trimming paired-end sequences according to quality scores some reads are filtered out because they end up being too short. Thus you may end up with a sequence file where one of a pair is missing. I am looking for information on how different de-novo assemblers handle broken pairs. I am especially interested in Velvet, IDBA, and Ray.

illumina assembly paired denovo • 5.9k views

ADD COMMENT • link updated 13.5 years ago by Lhl ▴ 760 • written 13.5 years ago by Martin A Hansen 3.0k

score 2 · Answer 1 · 2011-06-01

2

Entering edit mode

13.5 years ago

Benm ▴ 710

I don't think you need to input broken pairs. You can use this script below to classify them to normal pairs and single ends, then you can shuffle pairs and merge the single ends for velvet input. It is just available for Illumina reads.

#!/usr/bin/perl -w
use strict;

die "perl $0 <IN1:PE1.fastq> <IN2:PE2.fastq> <OUT1:out.PE1.fastq> <OUT2:out.PE2.fastq>\n" if (@ARGV!=4);

#@ILLUMINA-57021F:7:1:1020:11315#0/1
my ($PE1,$PE2,$out1,$out2)=@ARGV;
my %hash;
open (IN,$PE1) || die $!;
while(<IN>)
{
    if (/^\@(\S+)\#\S+\/[12]/)
    {
        $hash{$1}=$_;
        $hash{$1}.=<IN>;
        $hash{$1}.=<IN>;
        $hash{$1}.=<IN>;
    }
}
close IN;

open (IN,$PE2) || die $!;
open (OUT1,">$out1") || die $!;
open (OUT2,">$out2") || die $!;
open (SE1,">$out1.single") || die $!;
open (SE2,">$out2.single") || die $!;
while(<IN>)
{
    if (/^\@(\S+)\#\S+\/[12]/)
    {
        if (exists $hash{$1})
        {
            print OUT1 $hash{$1};
            undef $hash{$1};
            delete $hash{$1};
            print OUT2 $_;
            $_=<IN>;
            print OUT2 $_;
            $_=<IN>;
            print OUT2 $_;
            $_=<IN>;
            print OUT2 $_;
        }
        else
        {
            print SE2 $_;
            $_=<IN>;
            print SE2 $_;
            $_=<IN>;
            print SE2 $_;
            $_=<IN>;
            print SE2 $_;
        }
    }
}
foreach my $key(keys %hash)
{
    if ((defined $hash{$key})&&($hash{$key} ne ""))
    {
        print SE1 $hash{$key};
    }
}

close IN;
close OUT1;
close OUT2;
close SE1;
close SE2;

ADD COMMENT • link 13.5 years ago by Benm ▴ 710

1

Entering edit mode

I don't think Velvet is a robot, and it doesn't need to check this for your input, because it has already defined multiple input parameter, this problem should be dealt with before in data preprocessing. It won't tell you "oh, this is broken - let ... brabrabra" by current version as I know.

ADD REPLY • link 13.5 years ago by Benm ▴ 710

1

Entering edit mode

Hi, I use your script, but i found that this script use huge memory, I have two library, 8.3G and 7.8G, but the memory requirement seems to be larger than 128G.

ADD REPLY • link 12.8 years ago by Plantae ▴ 390

0

Entering edit mode

I know that you can input multiple libraries to Velvet. But I was wondering how Velvet (and the other assemblers) treat broken reads if supplied as paired-end reads? It would make everything a lot simpler if these assemblers automagically used the unbroken paired-end reads as paired-end reads while broken pairs were used as single end reads. I suspect this could be the case - but I fear that encountering broken pais may trigger a "oh, this is broken - lets make a single-end assembly only" response.

ADD REPLY • link 13.5 years ago by Martin A Hansen 3.0k

0

Entering edit mode

I'd be interested to know how BWA handles these scenarios also.

ADD REPLY • link 13.5 years ago by Travis ★ 2.8k

0

Entering edit mode

The safest way would be to split the data into unbroken an broken pairs beforehand. Velvet supports explicit loading paired and unpaired reads, but IDBA and Ray don't. I'd say that a clever assembler reports the number of paired reads and broken pairs loaded - and assemble away according to this.

ADD REPLY • link 13.5 years ago by Martin A Hansen 3.0k

0

Entering edit mode

As Plantae already mentioned, I don't think that your script will work with most of the NGS data due to memory problem. I myself tried your script and it exceed 256GB available for our machine.

ADD REPLY • link 12.6 years ago by Neil McCauley • 0

score 0 · Answer 2 · 2011-12-09

0

Entering edit mode

13.0 years ago

Lhl ▴ 760

"Note that it is not necessary to ensure that reads appear only once or that every read in the file is paired. Velvet detects which reads are unpaired and handles them as such." I read this in this paper by the author of velvet -- http://www.ncbi.nlm.nih.gov/pubmed/20836074

ADD COMMENT • link 13.0 years ago by Lhl ▴ 760

1

Entering edit mode

Yes, but that bit you quoted appears to refer specifically to providing reads in a SAM/BAM file, not FASTA/Q. The paragraph immediately following that one says, "Velvet requires that paired-end FASTA and FASTQ datasets come in a single merged file where each read is paired with the one directly above or the one directly below."

ADD REPLY • link 13.0 years ago by Kmcarr ▴ 10

0

Entering edit mode

Yes, Kmcarr. You are write i wrote to the author and he responded and said it is necessary to distinguish broken pairs from un-broken ones.

ADD REPLY • link 13.0 years ago by Lhl ▴ 760

0

Entering edit mode

Yes, Kmcarr. You are right. i wrote to the author and he responded and said it is necessary to distinguish broken pairs from paired ones.

ADD REPLY • link 13.0 years ago by Lhl ▴ 760