Question

Filtering Fastq Sequences Based On Lengths

8

Entering edit mode

11.8 years ago

empyrean999 ▴ 180

As the question says,

I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another file. how can i do that?

awk perl unix • 32k views

ADD COMMENT • link updated 5.2 years ago by komjinhubuio ▴ 50 • written 11.8 years ago by empyrean999 ▴ 180

score 19 · Answer 1 · 2013-03-21

19

Entering edit mode

11.8 years ago

Wen.Huang ★ 1.2k

cat your.fastq | paste - - - - | awk 'length($2)  >= 21 && length($2) <= 25' | sed 's/\t/\n/g' > filtered.fastq

ADD COMMENT • link 11.8 years ago by Wen.Huang ★ 1.2k

4

Entering edit mode

If I may, a pure Awk command is twice faster:

awk 'BEGIN {FS = "\t" ; OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 21 && length(seq) <= 25) {print header, seq, qheader, qseq}}' < your.fastq > filtered.fastq

Like your solution with paste, it assumes that a fastq record takes exactly 4 lines.

Edit: deal with spaces in sequence names, as suggested by brianpenghe.

ADD REPLY • link 6.1 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

How would this be modified for gzip's fastqs?

ADD REPLY • link 8.6 years ago by shanasabri ▴ 40

0

Entering edit mode

zcat decompresses the data of all the input files, and writes the result on the standard output. zcat concatenates the data in the same way cat do

zcat your.fastq.gz | ...

ADD REPLY • link 7.6 years ago by Medhat 9.8k

0

Entering edit mode

....................removed....................

ADD REPLY • link 6.1 years ago by shanasabri ▴ 40

0

Entering edit mode

Like, a normal for loop?

for fastq in *.fastq
do
awk ... < $fastq > filtered_$fastq
done

ADD REPLY • link 7.6 years ago by WouterDeCoster 47k

0

Entering edit mode

One note: This command doesn't work when the read names contain spaces. better use awk -F"\t" instead of awk

ADD REPLY • link 6.1 years ago by brianpenghe ▴ 80

0

Entering edit mode

This is just printing zero lines! Note I changed constraints to either >=16 only, or >=16 && <= 500.

ADD REPLY • link 6.0 years ago by caverill ▴ 40

score 5 · Answer 2 · 2019-10-12

5

Entering edit mode

5.2 years ago

komjinhubuio ▴ 50

I would like to introduce you to a powerful software: seqkit (https://bioinf.shenwei.me/seqkit/usage/), with which you can easily manipulate fastq/fasta format seq. seqkit seq youseq.fastq -m 21 -M 25 > result.fq

ADD COMMENT • link 5.2 years ago by komjinhubuio ▴ 50

score 3 · Answer 3 · 2013-03-21

3

Entering edit mode

11.8 years ago

Martin A Hansen 3.0k

Using Biopieces www.biopieces.org)

read_fastq -i in.fq | grab -e 'SEQ_LEN>=21' | grab -e 'SEQ_LEN<=25' | write_fastq -o out.fq -x

And when you realize that you want to do a lot of extra things besides filtering on sequence length you will find lots of useful tools in Biopieces.

ADD COMMENT • link 11.8 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Is grab the new grep? ;-)

ADD REPLY • link 11.8 years ago by Christof Winter ★ 1.1k

0

Entering edit mode

Not quite: https://code.google.com/p/biopieces/wiki/grab

ADD REPLY • link 11.8 years ago by Martin A Hansen 3.0k

score 2 · Answer 4 · 2013-03-20

2

Entering edit mode

11.8 years ago

Philipp Bayer 8.8k

Edit: If you're coming via Google, my answer is very, very old. Consider komjinhubuio's answer, seqkit is the 'modern' way to go

Using the awesome readfq-library in perl, and their modified example:

  my @aux = undef; # this is for keeping intermediate data
  while (my ($name, $seq, $qual) = readfq(\*STDIN, \@aux)) { 
     if( (length($seq) >= 21) && (length($seq) <= 25) ) { 
         print "@$name\n";
         print "$seq\n"; 
         print "+\n";
         print "$qual\n";
     }
  }

(Beware: Haven't tested this yet)

ADD COMMENT • link 5.2 years ago by Philipp Bayer 8.8k

2

Entering edit mode

You are missing a closing bracket, a "n" in the last line (there's also an unnecessary comma), and use strict; use warnings; which will tell you such things :)

ADD REPLY • link 11.8 years ago by SES 8.6k

0

Entering edit mode

:) Thank you! I usually never use Perl.

ADD REPLY • link 11.8 years ago by Philipp Bayer 8.8k

score 2 · Answer 5 · 2016-08-24

You can easily do this with prinseq-lite:

FILTER OPTIONS

-min_len <integer>
        Filter sequence shorter than min_len.

-max_len <integer>
        Filter sequence longer than max_len.

prinseq-lite.pl -fastq yourfile.fastq -out_format 4 -out_good seqs_good -min_len 21 -trim_to_len 25

http://prinseq.sourceforge.net/manual.html