How to filter sequencing reads in a fastq file by size?
2
0
Entering edit mode
5.6 years ago
Beginner ▴ 80

How to filter sequencing reads in a fastq file by size?

I want to filter a Ribo-Seq fastq file to only maintain sequences between 26 and 32 bp. The approaches I tried did all mess up the fastq line arrangement of the output fastq file. Also an old post that I found from 6 years ago did not help since the suggested options did not keep the fastq line arrangement.

Thank you for your help

Ribo-Seq RNA-Seq fastq • 7.3k views
ADD COMMENT
0
Entering edit mode

Also an old post that I found from 6 years ago did not help since the suggested options did not keep the fastq line arrangement.

which post ?

ADD REPLY
0
Entering edit mode

Filtering Fastq Sequences Based On Lengths

This one. Didn't try the biopieces option though due to issued I had with one requuired library.

ADD REPLY
2
Entering edit mode
5.6 years ago

I would use a combination of tools: seqtk, awk and cut. Each tool can be piped, so that the outcome of one tool is used by the second tool.

Code:

# Step 1: make a list of sequence names that are within the length range
seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | cut --fields 1 > selected-sequences-names.list

# Step 2: subset the fastq file for those sequences only
seqtk subseq my.fastq selected-sequences-names.list > subsetted.fq

Explanations:

seq comp my.fastq

will print a summary for each sequence found in the fastq file. The first column is the name of the sequence, the second column is the length of the sequence. See an example below:

seqtk comp my.fastq | head
ABCD-0123:987:GW1805231090:1:1101:13250:1696    27  52  14  21  63  0   0   0   4   0
ABCD-0123:987:GW1805231090:1:1101:20740:1766    27  46  19  21  64  0   0   0   10  0
ABCD-0123:987:GW1805231090:1:1101:16691:3004    56  44  14  22  10  0   0   0   3   0

awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }'

will print only the rows in which the second column (here coded $2) has values higher or equal to 26 ($2 >= 26) and lower or equal to 32 ($2 <= 32). See an example below:

seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | head  
ABCD-0123:987:GW1805231090:1:1101:13250:1696    27  52  14  21  63  0   0   0   4   0
ABCD-0123:987:GW1805231090:1:1101:20740:1766    27  46  19  21  64  0   0   0   10  0
ABCD-0123:987:GW1805231090:1:1101:15595:1784    27  51  27  34  38  0   0   0   12  0

cut --fields 1

will print the names of those sequences. They are found in the first column (here coded --fields 1). See an example below:

seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | cut --fields 1 | head  
ABCD-0123:987:GW1805231090:1:1101:13250:1696
ABCD-0123:987:GW1805231090:1:1101:20740:1766
ABCD-0123:987:GW1805231090:1:1101:15595:1784

To keep the names of the those sequences, the single greater-than (>) sign at the end of the command redirects the current output (here, the names of the sequences) into a new file (here, called selected-sequences-names.list)

seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | cut --fields 1 > selected-sequences-names.list
head selected-sequences-names.list
ABCD-0123:987:GW1805231090:1:1101:13250:1696
ABCD-0123:987:GW1805231090:1:1101:20740:1766
ABCD-0123:987:GW1805231090:1:1101:15595:1784

To actually subset the original fastq file for only those sequences, one command from seqtk is needed, and the new file (subsetted.fq) contains just what you wanted.

seqtk subseq my.fastq selected-sequences-names.list > subsetted.fq
ADD COMMENT
0
Entering edit mode

WOW! This was incredible helpful! Thank you for this great explanation!

ADD REPLY
0
Entering edit mode

I tried it and it worked. Thank you for your help!

ADD REPLY
1
Entering edit mode
5.6 years ago
h.mon 35k

reformat.sh from the BBTools package, see this post: filtering the reads based on the length

ADD COMMENT

Login before adding your answer.

Traffic: 2633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6