This seems like a pretty straightforward scripting exercise. This script should work with your read naming scheme, if its use of a hash table might be a bit memory-intensive. How well this works may depend on the size of your input and system memory. To use it, run something like:
$ ./splitFq.pl < convertedReads.fq
This will create two files firstPair.fq
and secondPair.fq
, with reads in identical order (if read data are present for a given header). So if there are reads present in the input for headers A/1
, A/2
, B/1
, C/1
, C/2
(in whatever order: C, A, B
, or B, C, A
, etc.), then the output for the first file will be read data for A, B, C
and the output for the second file will be read data for A, C
— in that order. If header pairs match, then you will have identical ordering in both files.
Note the renaming of headers in the two output files, where the last two characters are stripped.
#!/usr/bin/env perl
use strict;
use warnings;
my $header;
my $pair;
my $sequence;
my $secondHeader;
my $quality;
my $lineIdx = 0;
my $fqRef;
while (<>) {
chomp;
if ($lineIdx == 0) {
$header = $_;
$pair = substr $header, -1, 1; # get read pair number, either '1' or '2'
$header = substr $header, 0, length($header) - 2; # strip last two characters to make an ID that points to both read pairs
}
elsif ($lineIdx == 1) {
$sequence = $_;
}
elsif ($lineIdx == 2) {
$secondHeader = $_;
}
elsif ($lineIdx == 3) {
$quality = $_;
}
$lineIdx++;
if ($lineIdx == 4) {
$lineIdx = 0;
$fqRef->{$header}->{$pair}->{sequence} = $sequence;
$fqRef->{$header}->{$pair}->{secondHeader} = $secondHeader;
$fqRef->{$header}->{$pair}->{quality} = $quality;
}
}
open FIRSTPAIR, "> firstPair.fq";
open SECONDPAIR, "> secondPair.fq";
foreach $header (sort keys %{$fqRef}) {
if (defined $fqRef->{$header}->{1}->{sequence}) {
$sequence = $fqRef->{$header}->{1}->{sequence};
$secondHeader = $fqRef->{$header}->{1}->{secondHeader};
$quality = $fqRef->{$header}->{1}->{quality};
my $firstFq = "$header\n$sequence\n$secondHeader\n$quality";
print FIRSTPAIR "$firstFq\n";
}
if (defined $fqRef->{$header}->{2}->{sequence}) {
$sequence = $fqRef->{$header}->{2}->{sequence};
$secondHeader = $fqRef->{$header}->{2}->{secondHeader};
$quality = $fqRef->{$header}->{2}->{quality};
my $secondFq = "$header\n$sequence\n$secondHeader\n$quality";
print SECONDPAIR "$secondFq\n";
}
}
close SECONDPAIR;
close FIRSTPAIR;
How look headers of two reads that make up a pair?
The header for Read1:
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090 1:N:0:TTAGGC
The header for Read 2:
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090 2:N:0:TTAGGC
I do have a script to convert the headers to:
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090#0/1
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090#0/2
Thanks,
NP