As the question says,
I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another file. how can i do that?
As the question says,
I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another file. how can i do that?
cat your.fastq | paste - - - - | awk 'length($2) >= 21 && length($2) <= 25' | sed 's/\t/\n/g' > filtered.fastq
I would like to introduce you to a powerful software: seqkit (https://bioinf.shenwei.me/seqkit/usage/), with which you can easily manipulate fastq/fasta format seq. seqkit seq youseq.fastq -m 21 -M 25 > result.fq
Using Biopieces www.biopieces.org)
read_fastq -i in.fq | grab -e 'SEQ_LEN>=21' | grab -e 'SEQ_LEN<=25' | write_fastq -o out.fq -x
And when you realize that you want to do a lot of extra things besides filtering on sequence length you will find lots of useful tools in Biopieces.
Edit: If you're coming via Google, my answer is very, very old. Consider komjinhubuio's answer, seqkit is the 'modern' way to go
Using the awesome readfq-library in perl, and their modified example:
my @aux = undef; # this is for keeping intermediate data
while (my ($name, $seq, $qual) = readfq(\*STDIN, \@aux)) {
if( (length($seq) >= 21) && (length($seq) <= 25) ) {
print "@$name\n";
print "$seq\n";
print "+\n";
print "$qual\n";
}
}
(Beware: Haven't tested this yet)
You can easily do this with prinseq-lite:
FILTER OPTIONS
-min_len <integer>
Filter sequence shorter than min_len.
-max_len <integer>
Filter sequence longer than max_len.
prinseq-lite.pl -fastq yourfile.fastq -out_format 4 -out_good seqs_good -min_len 21 -trim_to_len 25
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
If I may, a pure Awk command is twice faster:
Like your solution with
paste
, it assumes that a fastq record takes exactly 4 lines.Edit: deal with spaces in sequence names, as suggested by brianpenghe.
How would this be modified for gzip's fastqs?
zcat decompresses the data of all the input files, and writes the result on the standard output. zcat concatenates the data in the same way cat do
....................removed....................
Like, a normal
for
loop?One note: This command doesn't work when the read names contain spaces. better use awk -F"\t" instead of awk
This is just printing zero lines! Note I changed constraints to either >=16 only, or >=16 && <= 500.