Remove sequences <300 bases from FASTA file
4
7
Entering edit mode
6.3 years ago
zoppisemma ▴ 70

I have a multiple FASTA file containing contigs deriving from metagenomic data. I need to remove all contigs less than 300 bp long. Ho do I proceed?

genome next-gen sequencing assembly • 19k views
ADD COMMENT
2
Entering edit mode

Hi zoppisemma

There are multiple solutions provided by different users. you should upvote/ accept answers which helped. This will help others looking for such solutions.

accept or upvote

ADD REPLY
1
Entering edit mode

See this post and tweak for 300.

ADD REPLY
0
Entering edit mode

This should be just a comment and not an answer, as you're only pointing to an existing post/answer. I've moved it to one.

ADD REPLY
0
Entering edit mode

Thanks for the correction Ram.

ADD REPLY
1
Entering edit mode
ADD REPLY
1
Entering edit mode

Other people gave you excellent solutions. Nevertheless, you may be also interested in SEDA (http://www.sing-group.org/seda/ ), an open-source tool for processing FASTA files. Among other functions, it provides an operation to apply different filters, including sequence length (https://www.sing-group.org/seda/manual/operations.html#filtering ). Regards.

ADD REPLY
15
Entering edit mode
6.3 years ago

using seqkit

seqkit seq -m 300 your_fasta.fa

download here

ADD COMMENT
8
Entering edit mode
6.3 years ago
GenoMax 147k

Using reformat.sh from BBMap suite.

reformat.sh in=your.fa out=filtered.fa minlength=300
ADD COMMENT
6
Entering edit mode
6.3 years ago
harish ▴ 470

Hi!

You can use seqtk for the same. The command should be:

seqtk seq -L 300 contigs.fasta > file.fasta
ADD COMMENT
1
Entering edit mode

@harish

Just FYI (for larger datasets), see this (seqkit benchmark)

https://bioinf.shenwei.me/seqkit/#benchmark

ADD REPLY
0
Entering edit mode

Ahh. That's nice. Glad to learn something new today!

ADD REPLY
0
Entering edit mode

I am using it for all sorts of fasta/q manipulation and found it really fast and effective.

ADD REPLY
4
Entering edit mode
6.3 years ago

awk solution which should work for multiline fasta files:

awk -v RS=">" -v FS="\n" '{for(i=2;i<NF;i++) {l+=length($i)}; if(l>300) printf ">%s", $0}' test.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 2415 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6