Question

how to filter fasta file?

0

Entering edit mode

2.2 years ago

bioinfo223 ▴ 10

Hello, I need to filter this fasta file, on the basis of len mentioned in the header of the fasta file, I need less than equal to 100 len. I am new to bioinformatics, please let me know the one-line command for this.

>CM0 len:16 (+),score=7.52 CM040936.1:11567243-11571 523:2589-2636(+)
 ATGGCATTAGTTCTGGCAGGTCACGTGAGTCAAGCTCGCATCAGCTGA

Thank you.

command linux • 1.2k views

ADD COMMENT • link updated 2.2 years ago by GenoMax 147k • written 2.2 years ago by bioinfo223 ▴ 10

0

Entering edit mode

duplicate of How To Filter Multi Fasta By Length?? ; FASTA file of fixed length ; Fasta Length ;

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

2.2 years ago

Asaf 10k

You can use cutadapt to do that. Just use the -m option. Something like:

cutadapt -m 100 -o output.fa input.fa

should work.

ADD COMMENT • link 2.2 years ago by Asaf 10k

0

Entering edit mode

2.2 years ago

GenoMax 147k

Using reformat.sh from BBMap suite.

reformat.sh -Xmx2g in=input.fa out=filtered.fa maxlength=100

ADD COMMENT • link 2.2 years ago by GenoMax 147k

score 2 · Accepted Answer · 2022-09-18

2

Entering edit mode

2.2 years ago

Andrzej Zielezinski 11k

You can also use SeqKit to filter FASTA sequences by length. The -M option prints sequences shorter than the maximum length.

seqkit seq -M 101 -o output.fa input.fa

ADD COMMENT • link 2.2 years ago by Andrzej Zielezinski 11k

score 2 · Accepted Answer · 2022-09-18

Hi! I believe this command would do what you are looking for.

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}'  test.fa | tail -n+1 | awk -F"\t" '{split($1,h,"len:");split(h[2],l," "); if (l[1]<=100){print}}'  | awk -F'\t' '{print $1"\n"$2}'

This "one-liner" does different things:

1) convert fasta into tab format with:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}'  test.fa

1.5) Remove the unwanted first blank line with:

tail -n+1

2) Filter based on "len:" field in the header with:

awk -F"\t" '{split($1,h,"len:");split(h[2],l," "); if (l[1]<=100){print}}'

3) Go back from tab to fasta format with:

awk -F'\t' '{print $1"\n"$2}'