Question

Fastq Splitter For Paired End Reads

16

Entering edit mode

12.9 years ago

Geparada ★ 1.5k

Hi!!

I have to analyze paired-end RNA-seq read that are in an unusual format: both pair-end reads are joined in one FASTQ. I need to split the file in two separated FASTQ paried-end files.

There are a galaxy tool named FASTQ splitter that can do this:

FASTQ splitter

What it does

Splits a single fastq dataset representing paired-end run into two datasets (one for each end). This tool works only for datasets where both ends have the same length.

Sequence identifiers will have /1 or /2 appended for the split left-hand and right-hand reads, respectively.

Input format

A multiple-fastq file, for example:

@HWI-EAS91_1_30788AAXX:7:21:1542:1758
GTCAATTGTACTGGTCAATACTAAAAGAATAGGATCGCTCCTAGCATCTGGAGTCTCTATCACCTGAGCCCA
+HWI-EAS91_1_30788AAXX:7:21:1542:1758
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh`hfhhVZSWehR

Outputs

Left-hand Read:

@HWI-EAS91_1_30788AAXX:7:21:1542:1758/1
GTCAATTGTACTGGTCAATACTAAAAGAATAGGATC
+HWI-EAS91_1_30788AAXX:7:21:1542:1758/1
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

Right-hand Read:

@HWI-EAS91_1_30788AAXX:7:21:1542:1758/2
GCTCCTAGCATCTGGAGTCTCTATCACCTGAGCCCA
+HWI-EAS91_1_30788AAXX:7:21:1542:1758/2
hhhhhhhhhhhhhhhhhhhhhhhh`hfhhVZSWehR

Do you know any other standard alone script that can do this job?

rna • 51k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 12.9 years ago by Geparada ★ 1.5k

7

Entering edit mode

Just as a side note, this output is what you get when you use 'fastq-dump' of the SRA tools to download paired-end data from the SRA archive. In that case, there is a nice option: 'fastq-dump --split-files' which will output one file for all the /1 pairs and another file for the /2 pairs.

ADD REPLY • link 10.9 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

I tried this command on interleaved fq containing paired reads from both ends, but it did not work even if I changed .fq to .sra. Should it have worked?

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by trakhtenberg ▴ 160

2

Entering edit mode

Hi trakhtenberg,

Hopefully you were able to figure this out given how long ago this was, but for posterity of others looking for answers, the fastq-dump does not act on local files. It instead downloads files from SRA and converts them into fastq files for you. If the SRA accession number (usually SRR#######) stores paired-end reads, you should use the following command:

fastq-dump --split-files -O path/to/output SRR#######

More information about fastq-dump and other SRA toolkit utilities can be found here: http://www.ncbi.nlm.nih.gov/books/NBK158900/

(edit to make code clearer)

ADD REPLY • link 9.0 years ago by warren-mcgee ▴ 40

8

Entering edit mode

12.9 years ago

Marvin ▴ 900

For the left read:

awk 'NR%2==1 { print $0 "/1" } ; NR%2==0 { print substr($0,0,length($0)/2) }'

for the right read:

awk 'NR%2==1 { print $0 "/2" } ; NR%2==0 { print substr($0,length($0)/2) }'

ADD COMMENT • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by Marvin ▴ 900

4

Entering edit mode

Very elegant, but string indices in awk are 1-based so that as is, above code duplicates the last base of the first read as the first base of the second read. To avoid that, instead run:

awk 'NR%2==1 { print $0 "/2" } ; NR%2==0 { print substr($0,length($0)/2+1) }'

for 2nd/right read. (Code for 1st/left read can remain unchanged since awk substr returns the same result for both 0 and 1 as the start index.)

ADD REPLY • link 8.8 years ago by rschulz ▴ 40

0

Entering edit mode

Thank you for this indexing fix, @rschulz, and for the general method, @Marvin!

I used this method to separate some PE reads downloaded through NCBI's sratools fastq-dump. This works just great with the fixed indexing at the end of the "right read" function.

ADD REPLY • link 8.3 years ago by paanvaannd • 0

0

Entering edit mode

much more simpler and efficient than perl or python

ADD REPLY • link 8.9 years ago by tiago211287 ★ 1.5k

2

Entering edit mode

12.9 years ago

Geparada ★ 1.5k

Thanks for your script David Langenberger!

I don't understand perl scripts, so I decided to make my own python script, but again thanks for your time!

import sys
from Bio import SeqIO
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
from Bio.SeqRecord import SeqRecord

def main(fastq):

    Rd1_out_name = fastq.replace(".fastq", "Rd1.fastq")
    Rd2_out_name = fastq.replace(".fastq", "Rd2.fastq")
    Rd1_out = open(Rd1_out_name, 'w')
    Rd2_out = open(Rd2_out_name, 'w')

    for record in SeqIO.parse(fastq, "fastq"):

        seq = str(record.seq)
        Rd1_seq = seq[:len(seq)/2]
        Rd2_seq = seq[len(seq)/2:]                                  
        Q = record.letter_annotations["phred_quality"]
        Rd1_Q = Q[:len(Q)/2]
        Rd2_Q = Q[len(Q)/2:]            
        Rd1_id = record.id.strip("/1").strip("/2") + "/1"
        Rd2_id = record.id.strip("/1").strip("/2") + "/2"

        Rd1 = SeqRecord( Rd1_seq , id = Rd1_id, description = "" )
        Rd1.letter_annotations["phred_quality"] = Rd1_Q         
        Rd2 = SeqRecord( Rd2_seq , id = Rd2_id, description = "" )
        Rd2.letter_annotations["phred_quality"] = Rd2_Q

        Rd1_out.write(Rd1.format("fastq"),)
        Rd2_out.write(Rd2.format("fastq"),)


if __name__ == '__main__':
    main(sys.argv[1])

ADD COMMENT • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by Geparada ★ 1.5k

0

Entering edit mode

Great idea! I think in Galaxy they ask you to cite them, when using the tool, don't they? Haha! I think you can publish everything....

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by David Langenberger 11k

0

Entering edit mode

I just try to don't use galaxy because I run my pipelines by shell scripts ...

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by Geparada ★ 1.5k

0

Entering edit mode

I just use it sometimes to test some tools... when I'm too lazy to install it. But for high-throughput analysis it's completely useless.

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by David Langenberger 11k

1

Entering edit mode

10.3 years ago

smilefreak ▴ 420

Using List Comprehension in python

fastq_1 = open('fastq_r1.fastq')
fastq_2 = open('fastq_r2.fastq')

[r1.write(line) if (i % 8 < 4) else r2.write(line) for i, line in enumerate(open('test.fastq'))]

fastq_1.close()
fastq_2.close()

ADD COMMENT • link updated 3.9 years ago by Ram 44k • written 10.3 years ago by smilefreak ▴ 420

1

Entering edit mode

Elegant and simple, but you necro'd a 2.6-year-old old topic :)

Edit: might as well contribute anyway. I'd recommend wrapping using 'with' to make it even more idiomatic:

fastq_1 = open('fastq_r1.fastq')

becomes

with open('fastq_r1.fastq', 'a') as fastq_1:

etc., in case you forget to close your file handles.

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 10.3 years ago by Brice Sarver ★ 3.8k

Ram · Accepted Answer · 2012-04-04

14

Entering edit mode

12.9 years ago

David Langenberger 11k

copy this code to a file, e.g. splitPairedEndReads.pl

use strict;
use warnings;
my $file = $ARGV[0];
open(FILE, "<$file") || die "cannot open $file\n";
open(OUT1, ">$file\_1") || die "cannot open $file\_1\n";
open(OUT2, ">$file\_2") || die "cannot open $file\_2\n";
while(<FILE>){
    chomp;
    print OUT1 "$_\/1\n";
    print OUT2 "$_\/2\n";
    my $newline = <FILE>; chomp($newline);
    print OUT1 substr($newline, 0, length($newline)/2)."\n";
    print OUT2 substr($newline, length($newline)/2, length($newline)/2)."\n";
    $newline = <FILE>; chomp($newline);
    print OUT1 "$newline\/1\n";
    print OUT2 "$newline\/2\n";
    $newline = <FILE>; chomp($newline);
    print OUT1 substr($newline, 0, length($newline)/2)."\n";
    print OUT2 substr($newline, length($newline)/2, length($newline)/2)."\n";
}
close(FILE);

Now call it with ./splitPairedEndReads.pl input.fastq.

I hope it works.... haven't checked it.

ADD COMMENT • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by David Langenberger 11k

0

Entering edit mode

@david : nice simple script: its splits to half: what if datasets where both ends have the not same length? thats what Geparada might be looking for?

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by Rm 8.3k

0

Entering edit mode

Well, if they don't have the same length, then there has to be an information about the position to split. In his example, I cannot find anything like that. Furthermore, this format (split in the middle) is well known.

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by David Langenberger 11k

0

Entering edit mode

But it's simple to change the script and split on different positions.

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by David Langenberger 11k

0

Entering edit mode

I agree with that!!

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by Rm 8.3k

0

Entering edit mode

@David: Thanks a lot for the Script. It is exactly wha was looking for. However, unfortunately it does not really do what it is supposed to. Here is the head of the original file:

@HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG
GCTAGGGATGACTCCAGAGACTGATAATATTTCTGAACAAAATGATATCAATTCCTTATTACTACCAAACTCAGAAAATAAAGATTTACCTCGACCAT
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG
DFFFFHHHHHJJJJJJJJJJJJJIJJJIJJJJJJJIJJJJJJJJJJJJIIJJJIIJJJJJJJJJJJJJJJJJHIJHHHHHFFFFFFFEEECEBDDDDD
@HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG
GATTCCATTCATTCATTGTAGTTTCACTTGAATTTAGCTCAGAATTCTGTGTATAAGCAGGTGAAGGCATATCACTTTGGTACCAGGTGGAAGAACTC
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG
FDFFFHGHHHJJJJJJJJJIJJJJJJJJIJHIIJIIJJJJJJJIIJJGHIHHJIJJJJJJJBFGHIJCGGIJIJJJJJFIHIJHFHHCDFEDEEEEEE
@HWI-ST980:107:D098EACXX:2:1101:1423:1987 1:N:0:ACAGTG
CGTTGGAGTGTTTGCGTTGCGAAGCTGCTGCAACCGTTGGCAGCTGATTTGAACTGTTCAGTTGTTGGCAGAAGGTAGGAGGCATTCCCGGGCTGG

and here after splitting:

File1:

@HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/1
GCTAGGGATGACTCCAGAGACTGATAATATTTCTGAACAAAATGATATC
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/1
DFFFFHHHHHJJJJJJJJJJJJJIJJJIJJJJJJJIJJJJJJJJJJJJI
@HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/1
GATTCCATTCATTCATTGTAGTTTCACTTGAATTTAGCTCAGAATTCTG
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/1
FDFFFHGHHHJJJJJJJJJIJJJJJJJJIJHIIJIIJJJJJJJIIJJGH
@HWI-ST980:107:D098EACXX:2:1101:1423:1987 1:N:0:ACAGTG/1
CGTTGGAGTGTTTGCGTTGCGAAGCTGCTGCAACCGTTGGCAGCTGAT

File2:

@HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/2
AATTCCTTATTACTACCAAACTCAGAAAATAAAGATTTACCTCGACCAT
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 1:N:0:ACAGTG/2
IJJJIIJJJJJJJJJJJJJJJJJHIJHHHHHFFFFFFFEEECEBDDDDD
@HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/2
TGTATAAGCAGGTGAAGGCATATCACTTTGGTACCAGGTGGAAGAACTC
+HWI-ST980:107:D098EACXX:2:1101:1471:1986 2:N:0:ACAGTG/2
IHHJIJJJJJJJBFGHIJCGGIJIJJJJJFIHIJHFHHCDFEDEEEEEE
@HWI-ST980:107:D098EACXX:2:1101:1423:1987 1:N:0:ACAGTG/2
TTGAACTGTTCAGTTGTTGGCAGAAGGTAGGAGGCATTCCCGGGCTGG

I guess that the script did not realize the paired information that is given with the number (1 or 2) after the first space (CASAVA 1.8). I know that this is exactly the problem with other scipts that do not work anymore after the Illumina CASAVA 1.8 Update. What would I have to change to split the above shown format? Thanks a lot for any help on this issue!

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by Nico Posnien • 0

0

Entering edit mode

At a first glance, it seems there is a problem with your format. Directly after the quality values in the third line, there seems to be no newline. Is that a problem of the editor here, or is it in you file?

There have to be 4 lines for each entry!

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by David Langenberger 11k

0

Entering edit mode

I tried the head of your file (after manually inserting the newlines) and the output seems to be correct.

ADD REPLY • link updated 3.9 years ago by Ram 44k • written 12.9 years ago by David Langenberger 11k