how to filter fasta file?
4
0
Entering edit mode
2.2 years ago
bioinfo223 ▴ 10

Hello, I need to filter this fasta file, on the basis of len mentioned in the header of the fasta file, I need less than equal to 100 len. I am new to bioinformatics, please let me know the one-line command for this.

>CM0 len:16 (+),score=7.52 CM040936.1:11567243-11571 523:2589-2636(+)
 ATGGCATTAGTTCTGGCAGGTCACGTGAGTCAAGCTCGCATCAGCTGA

Thank you.

command linux • 1.2k views
ADD COMMENT
2
Entering edit mode
2.2 years ago

You can also use SeqKit to filter FASTA sequences by length. The -M option prints sequences shorter than the maximum length.

seqkit seq -M 101 -o output.fa input.fa
ADD COMMENT
2
Entering edit mode
2.2 years ago
iraun 6.2k

Hi! I believe this command would do what you are looking for.

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}'  test.fa | tail -n+1 | awk -F"\t" '{split($1,h,"len:");split(h[2],l," "); if (l[1]<=100){print}}'  | awk -F'\t' '{print $1"\n"$2}'

This "one-liner" does different things:

1) convert fasta into tab format with:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}'  test.fa 

1.5) Remove the unwanted first blank line with:

tail -n+1

2) Filter based on "len:" field in the header with:

awk -F"\t" '{split($1,h,"len:");split(h[2],l," "); if (l[1]<=100){print}}'

3) Go back from tab to fasta format with:

awk -F'\t' '{print $1"\n"$2}'
ADD COMMENT
2
Entering edit mode
2.2 years ago
Asaf 10k

You can use cutadapt to do that. Just use the -m option. Something like:

cutadapt -m 100 -o output.fa input.fa

should work.

ADD COMMENT
0
Entering edit mode
2.2 years ago
GenoMax 147k

Using reformat.sh from BBMap suite.

reformat.sh -Xmx2g in=input.fa out=filtered.fa maxlength=100
ADD COMMENT

Login before adding your answer.

Traffic: 2283 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6