Question

subset fastq file

0

Entering edit mode

10.1 years ago

cara78 ▴ 10

Hello,

I have a question on perl coding to create a subset of a fastq file. I most take every 40th seqence from a larger file. The oringinal file is so big I cant use arrays and such.

I have this code

use strict;
use warnings;

open (my $fh,"fast.fastq") or die "Failed to open file: $!\n";
open (my $out_fh, "> out10.fastq");
while(<$fh>) {
      chomp;
#$. holds the number of the last read line
#every 40 seq
 next if $. <= 40;

   my $header_line = $_;
   my $seq = <$fh>;
   my $plus = <$fh>;
   my $qualities = <$fh>;
   print $out_fh "$header_line$seq$plus$qualities";

}
close $fh;
close $out_fh;

it skips the first 40 lines which isn't much good to me. And then prints out the rest of the file which also isn't much good to me.

I am looking for a way to loop it so I get every 40th seq not line and all the ways through the file to get the subset.

Thanks

fastq subset • 2.8k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by cara78 ▴ 10

0

Entering edit mode

10.1 years ago

Pierre Lindenbaum 164k

a bash solution.

split -l 160 fast.fastq mysplit

edited from your comment: linearize, use awk+line+modulo, convert to fastq

cat fast.fastq | paste - - - - | awk '(NR%4==0)' | tr "\t" "\n" > output.fastq

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks but not exactly what I am looking for.

I have a fastq file. I want to take out every other 40th sequence and put them in a new file. A small example would be, if I had a fastq file of 12 sequences and want every 4th.

sequence 1
sequence 2
sequence 3
sequence 4
sequence 5
sequence 6
sequence 7
sequence 8
sequence 9
sequence 10
sequence 11
sequence 12

the out file would hold

sequence 4
sequence 8
sequence 12

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by cara78 ▴ 10

0

Entering edit mode

10.1 years ago

smilefreak ▴ 420

Python solution

using list comprehensions.

160 = 40 * 4 and 156,157,158,159 are the indexes of the 40th line in the fastq file.

#!/usr/bin/env python

with open('reads.fastq') as f:
   with open('output.fastq) as out:
      [out.write(line) if (i % 160 in [156,157,158,159]) for i, line in enumerate(f)]

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by smilefreak ▴ 420

Ram · Accepted Answer · 2014-11-02

2

Entering edit mode

10.1 years ago

mikhail.shugay 3.5k

Sorry it was long time ago when I've used Perl :)) Here is a solution in Java/Groovy

new File("out10.fastq").withPrintWriter { pw ->
   new File("fast.fastq").withReader { reader ->
      def header
      def index = 0
      while ((header = reader.readLine()) != null) {
         if (index++ % 40 == 0) {
            pw.println(header)
            pw.println(reader.readLine())
            pw.println(reader.readLine())
            pw.println(reader.readLine())
         } else {
            reader.readLine()
            reader.readLine()
            reader.readLine()
         }
      }
   }
}

Perl solution: change to

next if $. % 160 != 1

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Thanks but didn't work, it just print out 40th sequence no others

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by cara78 ▴ 10

0

Entering edit mode

Thanks but didn't work. it only took lines 3 to 6 then line 32 to33

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by cara78 ▴ 10

0

Entering edit mode

Changed solution to one from a language I'm more familiar with :)

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by mikhail.shugay 3.5k

0

Entering edit mode

I changed 3 to 1 and it worked. Thanks

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by cara78 ▴ 10

0

Entering edit mode

Ok, great :) It also seems that you need to change

"$header_line$seq$plus$qualities"

to

"$header_line\n$seq$plus$qualities"

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by mikhail.shugay 3.5k