We do the same thing but add a pairFixer script. Basically it goes through the pair files twice to fix the pairs. The first time to see all the headers(just the common part) that are seen twice and a second time to create these files:
pair1
pair2
singles
I also added a -M12 to the clipper part to tolerate non-full length adapters but maybe 12 is a bit too short.
[HTML][HTML]
!/usr/bin/perl
use strict;
my $outName = $ARGV[0];
my $pair1 = $ARGV[1];
my $pair2 = $ARGV[2];
my $pair1FH;
my $pair2FH;
my $pair1outFH;
my $pair2outFH;
my $pairSingleFH;
open ($pair1FH, '-|', 'zcat', "$pair1") || die "Cannot Open pair 1 File";
open ($pair2FH, '-|', 'zcat', "$pair2") || die "Cannot Open pair 2 File";
my @pairedFiles = ($pair1FH,$pair2FH);
my %readCount;
my $totalRecords = 0;
for my $pairFH (@pairedFiles) {
while (my $head = <$pairFH>) {
<$pairFH>;
<$pairFH>;
<$pairFH>;
$totalRecords++;
my $key = substr($head, 0, length($head)-3);
my $value = $readCount{$key};
if(defined $value) {
$value++;
$readCount{$key} = $value;
}
else {
$readCount{$key} = 1;
}
}
close($pairFH);
}
open ($pair1FH, '-|', 'zcat', "$pair1") || die "Cannot Open pair 1 File";
open ($pair2FH, '-|', 'zcat', "$pair2") || die "Cannot Open pair 2 File";
@pairedFiles = ($pair1FH,$pair2FH);
open ($pair1outFH, ">$outName.pair1.fastq") || die "Cannot Open pair out 1 File";
open ($pair2outFH, ">$outName.pair2.fastq") || die "Cannot Open pair out 2 File";
open ($pairSingleFH, ">$outName.single.fastq") || die "Cannot Open single File";
my @outputFiles = ($pair1outFH,$pair2outFH);
my $idx=0;
my $records=0;
my $pairs=0;
my $single=0;
print "Copying reads\n";
for my $pairFH (@pairedFiles) {
my $outFH = $outputFiles[$idx];
$idx++;
while(my $head = <$pairFH>) {
my $seq = <$pairFH>;
my $head2 = <$pairFH>;
my $qual = <$pairFH>;
my $key = substr($head, 0, length($head)-3);
my $value = $readCount{$key};
my $fh;
$records++;
if($value == 1) {
$fh = $pairSingleFH;
$single++;
}
elsif($value > 1) {
$fh = $outFH;
$pairs++;
}
else {
die "No value found for: $key\n";
}
print $fh $head;
print $fh $seq;
print $fh $head2;
print $fh $qual;
print "\rTotal: $totalRecords, Records: $records, Pairs: $pairs, Single: $single ";
}
close($outFH);
close($pairFH);
}
close($pairSingleFH);
print "Total: $totalRecords, Records: $records, Pairs: $pairs, Single: $single\n";
[HTML]
[HTML]
2 hours per read? :-) hope you don't have any deadlines in the next few hundred years.
Oops. I mean per "sample" :), now corrected in the question. !
2 CPU hours? I think that is a reasonable speed, at least way faster than mapping.
It only takes 2 hours for your samples? We do something very similar (fastqc, fastx_clipper, fastq_quality_trimmer, fastqc again), and I've found that for our samples these steps can take up to 3X longer than alignment with BWA. It's on the order of 12 hours per sample for our Hiseq read sets.