It might seem like this is a problem that doesn't quite need a fix, but each step is rather large because the reads file is so large. So, I appreciate anyone helping me save some computation time by skipping the middle man (ie fasta)
Lee, this perl script 'fastq2afg.pl' should do what you want. You'll have to tweak it if your FASTQ files use a different offset. Let me know if it works, and I'll add in some command line options and release it on my website. Torst
#!/usr/bin/perl
use strict;
use warnings;
# Copyright (C) 2011 Torsten Seemann <torsten@seemann.id.au>
#
# http://bioinformatics.net.au
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
my $FASTQ_OFFSET = 33; # 33=sanger 64=illumina(<1.8)
my $AFG_OFFSET = 60;
my $iid=0;
while (my $eid = <ARGV>) {
die "bad fastq ID line '$eid'" unless $eid =~ m/^\@/;
$eid = substr $eid, 1; # remove '@'
my $seq = scalar(<ARGV>);
chomp $seq;
my $id2 = scalar(<ARGV>);
die "bad fastq ID2 line '$id2" unless $id2 =~ m/^\+/;
my $qlt = scalar(<ARGV>);
chomp $qlt;
$iid++;
print "{RED\n";
print "iid:$iid\n";
print "eid:$eid"; # already has \n
print "seq:\n";
for (my $i=0; $i < length($seq); $i+=60) {
print substr($seq, $i, 60), "\n";
}
print ".\n";
print "qlt:\n";
my @q = split m//, $qlt;
@q = map { chr( ord($_)-$FASTQ_OFFSET + $AFG_OFFSET ) } @q;
for (my $i=0; $i < @q; $i++) {
print $q[$i];
print "\n" if ($i+1) % 60 == 0;
}
print ".\n";
print "}\n";
}
I modified Torsten's script so that it is in a neat subroutine, and I modified it a little to make it faster (at the least, I think it could make it faster). It's not totally tested thoroughly as a subroutine yet though.
Thanks so much again Torsten.
# modified from Torsten Seemann <torsten@seemann.id.au>
# A: Convert Fastq To Afg
sub fastqToAfg{
my($fastq,$outfile,$settings)=@_;
$$settings{qOffset}||=33; # 33=sanger 64=illumina(<1.8)
$$settings{aOffset}||=60;
$|++;
my $secondLine=`head -2 $fastq|tail -1`;
chomp($secondLine);
my $readLength=length($secondLine);
open(FASTQ,$fastq) or die "Could not open $fastq because $!";
open(AFG,">",$outfile) or die "Could not open $outfile for writing because $!";
my $iid=0;
while (my $eid = <FASTQ>) {
die "bad fastq ID line '$eid'" unless $eid =~ m/^\@/;
$eid = substr $eid, 1; # remove '@'
my $seq = scalar(<FASTQ>);
chomp $seq;
$seq=~s/(.{60})/\1\n/g; #newline every 60 bp
my $id2 = scalar(<FASTQ>);
die "bad fastq ID2 line '$id2" unless $id2 =~ m/^\+/;
my $qlt = scalar(<FASTQ>);
chomp $qlt;
$iid++;
print AFG "{RED\n";
print AFG "iid:$iid\n";
print AFG "eid:$eid"; # already has \n
print AFG "seq:\n$seq\n";
print AFG "qlt:\n";
my @q = split m//, $qlt;
@q = map { chr( ord($_)-$$settings{qOffset} + $$settings{aOffset} ) } @q;
my $qltAfg=join("",@q);
$qltAfg=~s/(.{60})/\1\n/g; #newline every 60 bp
print AFG $qltAfg."\n";
print AFG "}\n";
print "." if($iid % 10000 == 0);
#last if($iid>20);
}
print "\n";
close FASTQ; close AFG;
$|--;
return $outfile;
}
I was going to use the s/(.{60})/ trick but didn't in the end. I reckon the substr might be faster, but ultimately doesn't matter much. I note your $readLength variable is never used? If it is, make sure it is in the INNER loop, as reads can be variable length. Also be aware that my/our script does not handle FASTQ where the seq/qual is spread over multiple lines! Also $|++ will SLOW it down as it turns off output buffering. And the map/split qlt changer could probably be done efficiently in situ with a pack() or tr// command.
I think I was in the middle of testing the subroutine and left in some artifacts. Your points are absolutely right on target, and I will fix my subroutine on my computer. I would love to see that tr// command but that's not something I trust myself doing (I wish I were as good as you at all this!)
It might seem like this is a problem that doesn't quite need a fix, but each step is rather large because the reads file is so large. So, I appreciate anyone helping me save some computation time by skipping the middle man (ie fasta)