Illumina Pair-End Library De-Novo Assembly - Broken Pairs?
2
4
Entering edit mode
13.6 years ago

When trimming paired-end sequences according to quality scores some reads are filtered out because they end up being too short. Thus you may end up with a sequence file where one of a pair is missing. I am looking for information on how different de-novo assemblers handle broken pairs. I am especially interested in Velvet, IDBA, and Ray.

illumina assembly paired denovo • 5.9k views
ADD COMMENT
2
Entering edit mode
13.6 years ago
Benm ▴ 710

I don't think you need to input broken pairs. You can use this script below to classify them to normal pairs and single ends, then you can shuffle pairs and merge the single ends for velvet input. It is just available for Illumina reads.

#!/usr/bin/perl -w
use strict;

die "perl $0 <IN1:PE1.fastq> <IN2:PE2.fastq> <OUT1:out.PE1.fastq> <OUT2:out.PE2.fastq>\n" if (@ARGV!=4);

#@ILLUMINA-57021F:7:1:1020:11315#0/1
my ($PE1,$PE2,$out1,$out2)=@ARGV;
my %hash;
open (IN,$PE1) || die $!;
while(<IN>)
{
    if (/^\@(\S+)\#\S+\/[12]/)
    {
        $hash{$1}=$_;
        $hash{$1}.=<IN>;
        $hash{$1}.=<IN>;
        $hash{$1}.=<IN>;
    }
}
close IN;

open (IN,$PE2) || die $!;
open (OUT1,">$out1") || die $!;
open (OUT2,">$out2") || die $!;
open (SE1,">$out1.single") || die $!;
open (SE2,">$out2.single") || die $!;
while(<IN>)
{
    if (/^\@(\S+)\#\S+\/[12]/)
    {
        if (exists $hash{$1})
        {
            print OUT1 $hash{$1};
            undef $hash{$1};
            delete $hash{$1};
            print OUT2 $_;
            $_=<IN>;
            print OUT2 $_;
            $_=<IN>;
            print OUT2 $_;
            $_=<IN>;
            print OUT2 $_;
        }
        else
        {
            print SE2 $_;
            $_=<IN>;
            print SE2 $_;
            $_=<IN>;
            print SE2 $_;
            $_=<IN>;
            print SE2 $_;
        }
    }
}
foreach my $key(keys %hash)
{
    if ((defined $hash{$key})&&($hash{$key} ne ""))
    {
        print SE1 $hash{$key};
    }
}

close IN;
close OUT1;
close OUT2;
close SE1;
close SE2;
ADD COMMENT
1
Entering edit mode

I don't think Velvet is a robot, and it doesn't need to check this for your input, because it has already defined multiple input parameter, this problem should be dealt with before in data preprocessing. It won't tell you "oh, this is broken - let ... brabrabra" by current version as I know.

ADD REPLY
1
Entering edit mode

Hi, I use your script, but i found that this script use huge memory, I have two library, 8.3G and 7.8G, but the memory requirement seems to be larger than 128G.

ADD REPLY
0
Entering edit mode

I know that you can input multiple libraries to Velvet. But I was wondering how Velvet (and the other assemblers) treat broken reads if supplied as paired-end reads? It would make everything a lot simpler if these assemblers automagically used the unbroken paired-end reads as paired-end reads while broken pairs were used as single end reads. I suspect this could be the case - but I fear that encountering broken pais may trigger a "oh, this is broken - lets make a single-end assembly only" response.

ADD REPLY
0
Entering edit mode

I'd be interested to know how BWA handles these scenarios also.

ADD REPLY
0
Entering edit mode

The safest way would be to split the data into unbroken an broken pairs beforehand. Velvet supports explicit loading paired and unpaired reads, but IDBA and Ray don't. I'd say that a clever assembler reports the number of paired reads and broken pairs loaded - and assemble away according to this.

ADD REPLY
0
Entering edit mode

As Plantae already mentioned, I don't think that your script will work with most of the NGS data due to memory problem. I myself tried your script and it exceed 256GB available for our machine.

ADD REPLY
0
Entering edit mode
13.1 years ago
Lhl ▴ 760

"Note that it is not necessary to ensure that reads appear only once or that every read in the file is paired. Velvet detects which reads are unpaired and handles them as such." I read this in this paper by the author of velvet -- http://www.ncbi.nlm.nih.gov/pubmed/20836074

ADD COMMENT
1
Entering edit mode

Yes, but that bit you quoted appears to refer specifically to providing reads in a SAM/BAM file, not FASTA/Q. The paragraph immediately following that one says, "Velvet requires that paired-end FASTA and FASTQ datasets come in a single merged file where each read is paired with the one directly above or the one directly below."

ADD REPLY
0
Entering edit mode

Yes, Kmcarr. You are write i wrote to the author and he responded and said it is necessary to distinguish broken pairs from un-broken ones.

ADD REPLY
0
Entering edit mode

Yes, Kmcarr. You are right. i wrote to the author and he responded and said it is necessary to distinguish broken pairs from paired ones.

ADD REPLY

Login before adding your answer.

Traffic: 1740 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6