When trimming paired-end sequences according to quality scores some reads are filtered out because they end up being too short. Thus you may end up with a sequence file where one of a pair is missing. I am looking for information on how different de-novo assemblers handle broken pairs. I am especially interested in Velvet, IDBA, and Ray.
I don't think you need to input broken pairs. You can use this script below to classify them to normal pairs and single ends, then you can shuffle pairs and merge the single ends for velvet input. It is just available for Illumina reads.
#!/usr/bin/perl -w
use strict;
die "perl $0 <IN1:PE1.fastq> <IN2:PE2.fastq> <OUT1:out.PE1.fastq> <OUT2:out.PE2.fastq>\n" if (@ARGV!=4);
#@ILLUMINA-57021F:7:1:1020:11315#0/1
my ($PE1,$PE2,$out1,$out2)=@ARGV;
my %hash;
open (IN,$PE1) || die $!;
while(<IN>)
{
if (/^\@(\S+)\#\S+\/[12]/)
{
$hash{$1}=$_;
$hash{$1}.=<IN>;
$hash{$1}.=<IN>;
$hash{$1}.=<IN>;
}
}
close IN;
open (IN,$PE2) || die $!;
open (OUT1,">$out1") || die $!;
open (OUT2,">$out2") || die $!;
open (SE1,">$out1.single") || die $!;
open (SE2,">$out2.single") || die $!;
while(<IN>)
{
if (/^\@(\S+)\#\S+\/[12]/)
{
if (exists $hash{$1})
{
print OUT1 $hash{$1};
undef $hash{$1};
delete $hash{$1};
print OUT2 $_;
$_=<IN>;
print OUT2 $_;
$_=<IN>;
print OUT2 $_;
$_=<IN>;
print OUT2 $_;
}
else
{
print SE2 $_;
$_=<IN>;
print SE2 $_;
$_=<IN>;
print SE2 $_;
$_=<IN>;
print SE2 $_;
}
}
}
foreach my $key(keys %hash)
{
if ((defined $hash{$key})&&($hash{$key} ne ""))
{
print SE1 $hash{$key};
}
}
close IN;
close OUT1;
close OUT2;
close SE1;
close SE2;
I don't think Velvet is a robot, and it doesn't need to check this for your input, because it has already defined multiple input parameter, this problem should be dealt with before in data preprocessing. It won't tell you "oh, this is broken - let ... brabrabra" by current version as I know.
Hi, I use your script, but i found that this script use huge memory, I have two library, 8.3G and 7.8G, but the memory requirement seems to be larger than 128G.
I know that you can input multiple libraries to Velvet. But I was wondering how Velvet (and the other assemblers) treat broken reads if supplied as paired-end reads? It would make everything a lot simpler if these assemblers automagically used the unbroken paired-end reads as paired-end reads while broken pairs were used as single end reads. I suspect this could be the case - but I fear that encountering broken pais may trigger a "oh, this is broken - lets make a single-end assembly only" response.
The safest way would be to split the data into unbroken an broken pairs beforehand. Velvet supports explicit loading paired and unpaired reads, but IDBA and Ray don't. I'd say that a clever assembler reports the number of paired reads and broken pairs loaded - and assemble away according to this.
As Plantae already mentioned, I don't think that your script will work with most of the NGS data due to memory problem. I myself tried your script and it exceed 256GB available for our machine.
"Note that it is not necessary to ensure that reads appear only
once or that every read in the file is paired. Velvet detects which reads are unpaired and
handles them as such." I read this in this paper by the author of velvet -- http://www.ncbi.nlm.nih.gov/pubmed/20836074
Yes, but that bit you quoted appears to refer specifically to providing reads in a SAM/BAM file, not FASTA/Q. The paragraph immediately following that one says, "Velvet requires that paired-end FASTA and FASTQ datasets come in a single merged file where each read is paired with the one directly above or the one directly below."
I don't think Velvet is a robot, and it doesn't need to check this for your input, because it has already defined multiple input parameter, this problem should be dealt with before in data preprocessing. It won't tell you "oh, this is broken - let ... brabrabra" by current version as I know.
Hi, I use your script, but i found that this script use huge memory, I have two library, 8.3G and 7.8G, but the memory requirement seems to be larger than 128G.
I know that you can input multiple libraries to Velvet. But I was wondering how Velvet (and the other assemblers) treat broken reads if supplied as paired-end reads? It would make everything a lot simpler if these assemblers automagically used the unbroken paired-end reads as paired-end reads while broken pairs were used as single end reads. I suspect this could be the case - but I fear that encountering broken pais may trigger a "oh, this is broken - lets make a single-end assembly only" response.
I'd be interested to know how BWA handles these scenarios also.
The safest way would be to split the data into unbroken an broken pairs beforehand. Velvet supports explicit loading paired and unpaired reads, but IDBA and Ray don't. I'd say that a clever assembler reports the number of paired reads and broken pairs loaded - and assemble away according to this.
As Plantae already mentioned, I don't think that your script will work with most of the NGS data due to memory problem. I myself tried your script and it exceed 256GB available for our machine.