Sorting Fastq Files After Trimming (Orphans And Pe)
4
6
Entering edit mode
12.0 years ago
neal.platt ▴ 240

I have a bunch of Illumina PE data that has been run through fastx trimmer and clipper. I am ready to map these reads, but am needing to create 2 files for paired end reads (the left and right hand reads in separate files) and a file with the orphaned reads. Of course the paired end files need to have the reads in the same order.

This has to be a common problem, but I can't seem to find a tool that parses fastq files in this way (I swear I searched the Biostar forum).

Any help would be greatly appreciated.

NP

illumina paired-end mapping • 12k views
ADD COMMENT
0
Entering edit mode

How look headers of two reads that make up a pair?

ADD REPLY
0
Entering edit mode

The header for Read1:
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090 1:N:0:TTAGGC

The header for Read 2:
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090 2:N:0:TTAGGC

I do have a script to convert the headers to:
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090#0/1
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090#0/2

Thanks,

NP

ADD REPLY
3
Entering edit mode
12.0 years ago
lh3 33k

FIrst of all, processing reads separately is a bad practice we should always avoid. That is why you have not seen related questions here. Usually, mappers are able to perform basic filtering. We do not need to throw reads before hand. When we have to drop reads, we should mark the reads to be deleted (e.g. trim reads down to 1bp) and then merge the two ends. In that case, we just need to sequentially read two files and that is very easy and fast.

In your case, you can compare the original fastq and the read-dropped fastq to mark dropped reads. If you do not have the original fastq, use sort as follows:

mkfifo tmp
awk 'NR%4==1{n=$1}NR%4==2{s=$1}NR%4==0{print n,s,$1}' r1.fq | sort -S 2G > tmp &
awk 'NR%4==1{n=$1}NR%4==2{s=$1}NR%4==0{print n,s,$1}' r2.fq | sort -S 2G | join -a1 -a2 tmp - | awk 'NF==5{print $1"\n"$2"\n+\n"$3 >"x1.fq";print $1"\n"$4"\n+\n"$5 >"x2.fq"}NF==3{print $1"\n"$2"\n+\n"$3>"orphan.fq"}'

This procedure only requires 2G memory. You may also use Alex's script, but make sure you have a machine with RAM huge enough to hold your sequences. Again, I am showing the solution, but we should avoid this complication in the first place.

ADD COMMENT
0
Entering edit mode

Hi lh3,

Thanks for your input. In my case, I need to select the reads that have sequence X in the first mate and sequence Y in the second mate, which are most reads in the fastq files, but not all. So I really need to drop these non-useful reads before mapping because of the aim of my analysis.

In your example, r1.fq and r2.fq are the clipped/trimmed/etc files. I don't understand clearly, however, your statement that "you can compare the original fastq and the read-dropped fastq to mark dropped reads. If you do not have the original fastq, use sort as follows:" . Would you care to elaborate, please?

Thanks so much, Carmen

ADD REPLY
0
Entering edit mode
12.0 years ago

This seems like a pretty straightforward scripting exercise. This script should work with your read naming scheme, if its use of a hash table might be a bit memory-intensive. How well this works may depend on the size of your input and system memory. To use it, run something like:

$ ./splitFq.pl < convertedReads.fq

This will create two files firstPair.fq and secondPair.fq, with reads in identical order (if read data are present for a given header). So if there are reads present in the input for headers A/1, A/2, B/1, C/1, C/2 (in whatever order: C, A, B, or B, C, A, etc.), then the output for the first file will be read data for A, B, C and the output for the second file will be read data for A, C — in that order. If header pairs match, then you will have identical ordering in both files.

Note the renaming of headers in the two output files, where the last two characters are stripped.

#!/usr/bin/env perl

use strict;
use warnings;

my $header;
my $pair;
my $sequence;
my $secondHeader;
my $quality;
my $lineIdx = 0;
my $fqRef;

while (<>) {
    chomp;
    if ($lineIdx == 0) {
        $header = $_;
        $pair = substr $header, -1, 1; # get read pair number, either '1' or '2'
        $header = substr $header, 0, length($header) - 2; # strip last two characters to make an ID that points to both read pairs
    }
    elsif ($lineIdx == 1) {
        $sequence = $_;
    }
    elsif ($lineIdx == 2) {
        $secondHeader = $_;
    }
    elsif ($lineIdx == 3) {
        $quality = $_;
    }
    $lineIdx++;
    if ($lineIdx == 4) {
        $lineIdx = 0;
        $fqRef->{$header}->{$pair}->{sequence} = $sequence;
        $fqRef->{$header}->{$pair}->{secondHeader} = $secondHeader;
        $fqRef->{$header}->{$pair}->{quality} = $quality;
    }
}

open FIRSTPAIR, "> firstPair.fq";
open SECONDPAIR, "> secondPair.fq";
foreach $header (sort keys %{$fqRef}) {
    if (defined $fqRef->{$header}->{1}->{sequence}) {
        $sequence = $fqRef->{$header}->{1}->{sequence};
        $secondHeader = $fqRef->{$header}->{1}->{secondHeader};
        $quality = $fqRef->{$header}->{1}->{quality};
        my $firstFq = "$header\n$sequence\n$secondHeader\n$quality";
        print FIRSTPAIR "$firstFq\n";
    }
    if (defined $fqRef->{$header}->{2}->{sequence}) {
        $sequence = $fqRef->{$header}->{2}->{sequence};
        $secondHeader = $fqRef->{$header}->{2}->{secondHeader};
        $quality = $fqRef->{$header}->{2}->{quality};
        my $secondFq = "$header\n$sequence\n$secondHeader\n$quality";
        print SECONDPAIR "$secondFq\n";
    }
}
close SECONDPAIR;
close FIRSTPAIR;
ADD COMMENT
0
Entering edit mode

Hi Alex,

Forgive me, I'm a bit confused as to how to use your above script. My reads1 and reads2 to be checked for orphans and sorted are in 2 separate files, but in the script I don't see where it takes that input in. Would you mind pointing me in the right direction? My guess is all reads /1 and /2 must be in the same file?

Thanks! Carmen

ADD REPLY
0
Entering edit mode

Correct. Just do something like cat reads1 reads2 > reads.fq and then splitFq.pl < reads.fq. The reads come into the script via standard input by way of the < operand.

ADD REPLY
0
Entering edit mode
12.0 years ago

I've done something like this previously. I.e. take two fastq files for the /1 and /2 read pairs, do quality trimming etc and then interleave them into a single fastq file and spit out orphans into their own file. My approach (http://bigsa.org.au/node/92) uses cdbfasta and awk:

  1. Use cdbfasta to index a concatenated fastq file of the two paired fastq files.
  2. Use cdbyank to pull out the read IDs from the index
  3. Parse the read IDs with awk, in 2 passes, to identify duplicate and orphaned read names. Assuming "/" delimits the the template name from the read pair suffix.
  4. Use cdbyank to pull out the fastq sequences from the concatenated fastq file in step 1 for the paired and orphaned reads using the read IDs from step 3.
  5. It you want deinterleaved pairs, run the interleaved fastq file through my deinterleaving script.

There are lots of was to do this, but these scripts might give you a head start!

ADD COMMENT
0
Entering edit mode
12.0 years ago
Ryan Thompson ★ 3.6k

Perhaps you should interleave the read pairs into a single file, then do whatever trimming you require, and finally split the remaining interleaved pairs back out into separate files. This avoids the need to match up the reads afterward, because they will always be adjacent to each other.

Also, depending on your needs, SeqPrep might be a good fit for you. It does some kinds of trimming while preserving read pairs.

ADD COMMENT

Login before adding your answer.

Traffic: 946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6