subset fastq file
3
0
Entering edit mode
10.1 years ago
cara78 ▴ 10

Hello,

I have a question on perl coding to create a subset of a fastq file. I most take every 40th seqence from a larger file. The oringinal file is so big I cant use arrays and such.

I have this code

use strict;
use warnings;

open (my $fh,"fast.fastq") or die "Failed to open file: $!\n";
open (my $out_fh, "> out10.fastq");
while(<$fh>) {
      chomp;
#$. holds the number of the last read line
#every 40 seq
 next if $. <= 40;

   my $header_line = $_;
   my $seq = <$fh>;
   my $plus = <$fh>;
   my $qualities = <$fh>;
   print $out_fh "$header_line$seq$plus$qualities";

}
close $fh;
close $out_fh;

it skips the first 40 lines which isn't much good to me. And then prints out the rest of the file which also isn't much good to me.

I am looking for a way to loop it so I get every 40th seq not line and all the ways through the file to get the subset.

Thanks

fastq subset • 2.9k views
ADD COMMENT
2
Entering edit mode
10.1 years ago

Sorry it was long time ago when I've used Perl :)) Here is a solution in Java/Groovy

new File("out10.fastq").withPrintWriter { pw ->
   new File("fast.fastq").withReader { reader ->
      def header
      def index = 0
      while ((header = reader.readLine()) != null) {
         if (index++ % 40 == 0) {
            pw.println(header)
            pw.println(reader.readLine())
            pw.println(reader.readLine())
            pw.println(reader.readLine())
         } else {
            reader.readLine()
            reader.readLine()
            reader.readLine()
         }
      }
   }
}

Perl solution: change to

next if $. % 160 != 1
ADD COMMENT
0
Entering edit mode

Thanks but didn't work, it just print out 40th sequence no others

ADD REPLY
0
Entering edit mode

Thanks but didn't work. it only took lines 3 to 6 then line 32 to33

ADD REPLY
0
Entering edit mode

Changed solution to one from a language I'm more familiar with :)

ADD REPLY
0
Entering edit mode

I changed 3 to 1 and it worked. Thanks

ADD REPLY
0
Entering edit mode

Ok, great :) It also seems that you need to change

"$header_line$seq$plus$qualities"

to

"$header_line\n$seq$plus$qualities"
ADD REPLY
0
Entering edit mode
10.1 years ago

a bash solution.

split -l 160 fast.fastq mysplit

edited from your comment: linearize, use awk+line+modulo, convert to fastq

cat fast.fastq | paste - - - - | awk '(NR%4==0)' | tr "\t" "\n" > output.fastq
ADD COMMENT
0
Entering edit mode

Thanks but not exactly what I am looking for.

I have a fastq file. I want to take out every other 40th sequence and put them in a new file. A small example would be, if I had a fastq file of 12 sequences and want every 4th.

sequence 1
sequence 2
sequence 3
sequence 4
sequence 5
sequence 6
sequence 7
sequence 8
sequence 9
sequence 10
sequence 11
sequence 12

the out file would hold

sequence 4
sequence 8
sequence 12
ADD REPLY
0
Entering edit mode
10.1 years ago
smilefreak ▴ 420

Python solution

using list comprehensions.

160 = 40 * 4 and 156,157,158,159 are the indexes of the 40th line in the fastq file.

#!/usr/bin/env python

with open('reads.fastq') as f:
   with open('output.fastq) as out:
      [out.write(line) if (i % 160 in [156,157,158,159]) for i, line in enumerate(f)]
ADD COMMENT

Login before adding your answer.

Traffic: 2059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6