Split Fastq Files With A Pattern
2
1
Entering edit mode
11.4 years ago
Assa Yeroslaviz ★ 1.9k

Hi everybody,

I would like to split a fastq file into two separate fastq files on a specific position. For example, I have these reads:

@HWI-ST225:523:D1AY5ACXX:8:1101:1566:2149 1:N:0:GTGGCC
GCATACCCTCCCTGTCTCAGTTGCTGTTGAAAGAAGAAATCCGCGATATCTTATCCAACCCGCGATATCTTATCCAACGAAGCCAAAACCCTCGCAGTCTG
+
??@DDDDDHHHHHIGIHGHICHEGAHC<HHII9?FB3?F4C<FDD>0?9D*99*??D@;AA9=?'''3@@>@A@>::?B?2?-?CC3<A??8?B@B#####
@HWI-ST225:523:D1AY5ACXX:8:1101:2000:2191 1:N:0:GTGGCC
TTGCTTGTTGCGTGTCTCAGGCGGAAAAACGCCAAAATCAACCGCGATATACATTCCAACCCGCGATATACATTCCAACGTCAGCTCTGAGCTGCTGATCT
+
@@CFFFFFHHHHFGFIIJJJIIIHHIJJIIGGGGIEHICHAEDEHHHHADDB9?BEDDDBBDDDDBBB>DDC@@C@CCDBBD05>CCDACDDDDDDC:3>>

If you look a bit closer, you can see that there is a pair of linkers in each of the reads. The linkers are:

A - CCGCGATAT CTTA TCCAAC

B - CCGCGATAT ACAT TCCAAC

The linkers differ in only four bases in the middle. I would like to split the fastq file into two, or trim each time either sides of the fastq file exactly between these two linkers (but still keep the quality values for the reads).

Does anyone know a way to do so?

Thanks Assa

fastq split • 5.2k views
ADD COMMENT
2
Entering edit mode
11.4 years ago
JC 13k

One solution in Perl:

#!/usr/bin/perl

use strict;
use warnings;

my $linkA = 'CCGCGATATCTTATCCAAC';
my $linkB = 'CCGCGATATACATTCCAAC';

open L, ">left.fq"  or die;
open R, ">right.fq" or die;
my $nl = 0;
my ($id, $seq, $sep, $qual);
while (<>) {
    chomp;
    $nl++;
    if ($nl == 1) { 
        $id   = $_; 
    }
    elsif ($nl == 2) { 
        $seq  = $_; 
    }
    elsif ($nl == 3) { 
        $sep  = $_; 
    }
    elsif ($nl == 4) { 
        $qual = $_;
        splitRead($id, $seq, $sep, $qual);
        $nl = 0;
    }
}
close L;
close R;

sub splitRead {
    my ($id, $seq, $sep, $qual) = @_;
    my ($lseq, $rseq, $lqual, $rqual);
    if ($seq =~ m/$linkA$linkA/) {
        ($lseq, $rseq) = split (/$linkA$linkA/, $seq);
        $lqual = substr ($qual, 0, length $lseq);
        $rqual = substr ($qual, (2 * length $linkA) + length $lseq);
    }
    elsif ($seq =~ m/$linkB$linkB/) {
        ($lseq, $rseq) = split (/$linkB$linkB/, $seq);
        $lqual = substr ($qual, 0, length $lseq);
        $rqual = substr ($qual, (2 * length $linkB) + length $lseq);
    }
    else {
        warn "ERROR: read doesn't contain the linkers:\n$id $seq $sep $qual\n";
        return;
    }
    print L "$id\n$lseq\n$sep\n$lqual\n";
    print R "$id\n$rseq\n$sep\n$rqual\n";
    return;
}

Then, you run as:

perl splitReads.pl < file.fastq
ADD COMMENT
0
Entering edit mode

thanks for the help. Unfortunately I was wrong in my description. The combination of linkAlinkA is not correct. What I need is revcomp(linkA)linkA and rhe same for linker B. The reverse complement function I have already found:

sub reverse_complement {
        my $dna = shift;
    # reverse the DNA sequence
        my $revcomp = reverse($dna);    
    # complement the reversed DNA sequence
        $revcomp =~ tr/ACGTacgt/TGCAtgca/;
        return $revcomp;
}

To make the script works better I would also like to add two things (if possible)

  1. Is it possible to not only split the reads in the middle, but also trim them? I would like to have for the left linker also 20 bases before the linker and for the right linker I would like to have the 20 bases after the linker sequences.

  2. Is it possible to add an option for possible mismatches? Sometimes due to sequencing errors the linker sequences is not as exact as given in the pattern for linker A and B. Is it possible to add an option for adding mismatches?

Thanks Assa

ADD REPLY
1
Entering edit mode
11.1 years ago
SRKR ▴ 180

You can use simple regular expression to get it done. This is the regular expression you need to search for

(@[^\n]{1,}\n)([ATGC]{1,})CCGCGATAT[ATGC]{4}TCCAAC([ATGC]{1,})([\s]{1,}\+[\s]{1,}[^\n]{1,}\n)

Then simply replace it with

\1\2\4\1\3\4

This will remove your linker sequence completely and separate the sequences into two with same detail lines. The result for your query will be the following:

@HWI-ST225:523:D1AY5ACXX:8:1101:1566:2149 1:N:0:GTGGCC
GCATACCCTCCCTGTCTCAGTTGCTGTTGAAAGAAGAAATCCGCGATATCTTATCCAAC
+
??@DDDDDHHHHHIGIHGHICHEGAHC<HHII9?FB3?F4C<FDD>0?9D*99*??D@;AA9=?'''3@@>@A@>::?B?2?-?CC3<A??8?B@B#####
@HWI-ST225:523:D1AY5ACXX:8:1101:1566:2149 1:N:0:GTGGCC
GAAGCCAAAACCCTCGCAGTCTG
+
??@DDDDDHHHHHIGIHGHICHEGAHC<HHII9?FB3?F4C<FDD>0?9D*99*??D@;AA9=?'''3@@>@A@>::?B?2?-?CC3<A??8?B@B#####
@HWI-ST225:523:D1AY5ACXX:8:1101:2000:2191 1:N:0:GTGGCC
TTGCTTGTTGCGTGTCTCAGGCGGAAAAACGCCAAAATCAACCGCGATATACATTCCAAC
+
@@CFFFFFHHHHFGFIIJJJIIIHHIJJIIGGGGIEHICHAEDEHHHHADDB9?BEDDDBBDDDDBBB>DDC@@C@CCDBBD05>CCDACDDDDDDC:3>>
@HWI-ST225:523:D1AY5ACXX:8:1101:2000:2191 1:N:0:GTGGCC
GTCAGCTCTGAGCTGCTGATCT
+
@@CFFFFFHHHHFGFIIJJJIIIHHIJJIIGGGGIEHICHAEDEHHHHADDB9?BEDDDBBDDDDBBB>DDC@@C@CCDBBD05>CCDACDDDDDDC:3>>

You can do this regular expression replacement in any of the languages. I use PHP. Hope it's useful.

ADD COMMENT
0
Entering edit mode

Hi, thanks for that, But I do need the linker sequence for further down-stream analysis. But I can still do it the same way, just without loosing the linkers.

ADD REPLY

Login before adding your answer.

Traffic: 1980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6