Filtering Fastq Sequences Based On Lengths
5
8
Entering edit mode
11.7 years ago
empyrean999 ▴ 180

As the question says,

I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another file. how can i do that?

awk perl unix • 32k views
ADD COMMENT
19
Entering edit mode
11.7 years ago
Wen.Huang ★ 1.2k
cat your.fastq | paste - - - - | awk 'length($2)  >= 21 && length($2) <= 25' | sed 's/\t/\n/g' > filtered.fastq
ADD COMMENT
4
Entering edit mode

If I may, a pure Awk command is twice faster:

awk 'BEGIN {FS = "\t" ; OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 21 && length(seq) <= 25) {print header, seq, qheader, qseq}}' < your.fastq > filtered.fastq

Like your solution with paste, it assumes that a fastq record takes exactly 4 lines.

Edit: deal with spaces in sequence names, as suggested by brianpenghe.

ADD REPLY
0
Entering edit mode

How would this be modified for gzip's fastqs?

ADD REPLY
0
Entering edit mode

zcat decompresses the data of all the input files, and writes the result on the standard output. zcat concatenates the data in the same way cat do

zcat your.fastq.gz | ...
ADD REPLY
0
Entering edit mode

....................removed....................

ADD REPLY
0
Entering edit mode

Like, a normal for loop?

for fastq in *.fastq
do
awk ... < $fastq > filtered_$fastq
done
ADD REPLY
0
Entering edit mode

One note: This command doesn't work when the read names contain spaces. better use awk -F"\t" instead of awk

ADD REPLY
0
Entering edit mode

This is just printing zero lines! Note I changed constraints to either >=16 only, or >=16 && <= 500.

ADD REPLY
5
Entering edit mode
5.1 years ago
komjinhubuio ▴ 50

I would like to introduce you to a powerful software: seqkit (https://bioinf.shenwei.me/seqkit/usage/), with which you can easily manipulate fastq/fasta format seq. seqkit seq youseq.fastq -m 21 -M 25 > result.fq

ADD COMMENT
3
Entering edit mode
11.7 years ago

Using Biopieces www.biopieces.org)

read_fastq -i in.fq | grab -e 'SEQ_LEN>=21' | grab -e 'SEQ_LEN<=25' | write_fastq -o out.fq -x

And when you realize that you want to do a lot of extra things besides filtering on sequence length you will find lots of useful tools in Biopieces.

ADD COMMENT
0
Entering edit mode

Is grab the new grep? ;-)

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
11.7 years ago

Edit: If you're coming via Google, my answer is very, very old. Consider komjinhubuio's answer, seqkit is the 'modern' way to go

Using the awesome readfq-library in perl, and their modified example:

  my @aux = undef; # this is for keeping intermediate data
  while (my ($name, $seq, $qual) = readfq(\*STDIN, \@aux)) { 
     if( (length($seq) >= 21) && (length($seq) <= 25) ) { 
         print "@$name\n";
         print "$seq\n"; 
         print "+\n";
         print "$qual\n";
     }
  }

(Beware: Haven't tested this yet)

ADD COMMENT
2
Entering edit mode

You are missing a closing bracket, a "n" in the last line (there's also an unnecessary comma), and use strict; use warnings; which will tell you such things :)

ADD REPLY
0
Entering edit mode

:) Thank you! I usually never use Perl.

ADD REPLY
2
Entering edit mode
8.2 years ago
etienne.rifa ▴ 20

You can easily do this with prinseq-lite:

FILTER OPTIONS

-min_len <integer>
        Filter sequence shorter than min_len.

-max_len <integer>
        Filter sequence longer than max_len.

prinseq-lite.pl -fastq yourfile.fastq -out_format 4 -out_good seqs_good -min_len 21 -trim_to_len 25

http://prinseq.sourceforge.net/manual.html

ADD COMMENT

Login before adding your answer.

Traffic: 1915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6