Question

filter out fasta File by pattern

2

Entering edit mode

10.2 years ago

Assa Yeroslaviz ★ 1.9k

Hi,

I would like to filter out some sequences from a fasta file by using a specific pattern.

For example I have this file:

>input1
UGAGGUAGUAGG
>input2
CUAUGCUUACC
>out1
UCCCUGAGACCGUGA
>out2
CUCCGGGUACC
>desc1
ACUUCCUUACAUGCCC

I know already how I can extract all the fasta sequences with a specific pattern into a new file by using awk.

But what I would like to do is to remove all entries of a specific pattern from the original fasta file and save the newly made file into a new one. In my file above, I would like for example to remove all sequences with the header pattern out. and save only the other to a new file.

Is there a tool somewhere for doing that, or is it possible in awk/sed or even grep

Thanks

Assa

pattern fasta awk sed regexp • 9.8k views

ADD COMMENT • link updated 3.0 years ago by cpad0112 21k • written 10.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

3.0 years ago

hans ▴ 20

Using samtools

$ samtools faidx in.fasta
$ awk '/^>/ {print substr($1,2,400) }'  in.fasta | grep -v "out" list >selected_list
$ samtools faidx in.fasta -r selected_list > out.fasta

This will give you all the functionality of "grep" to select the desired sequences.

ADD COMMENT • link 3.0 years ago by hans ▴ 20

0

Entering edit mode

Why print substr($1,2,400)?

Why piping in grep -v "out" while the same grep reads a file list?

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 3.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

print substr($1,2,400)

Keeps only the first field of the pasta header while deleting the ">" from the header. The file name "list" in the grep command is a mistake, it should be deleted:

awk '/^>/ {print substr($1,2,400) }'  in.fasta | grep -v "out"  >selected_list

ADD REPLY • link 3.0 years ago by hans ▴ 20

Ram · Accepted Answer · 2015-03-04

8

Entering edit mode

10.2 years ago

Pierre Lindenbaum 166k

I would like for example to remove all sequences with the header pattern out. and save only the other to a new file

awk '/^>/ {P=index($0,"out")==0} {if(P) print} ' in.fasta > out.fasta

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks. That was fast :-)

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Hi Pierre! Thanks for your command.

I have a question on this issue. Instead of one header pattern "out" (in your case), I am looking for many patterns that are stored in a file. So what should I do? Your help is appreciated in advance.

ADD REPLY • link 9.9 years ago by xiachongjing ▴ 10

1

Entering edit mode

You could modify my answer below:

$ pip install pyfaidx
$ xargs faidx in.fasta -g > out.fasta < patterns.txt

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 9.9 years ago by Matt Shirley 10k

0

Entering edit mode

See also this post: How to remove some fasta sequences by header information from a large fasta file, any command and script please?

ADD REPLY • link 7.4 years ago by tlorin ▴ 370

0

Entering edit mode

Thank you Pierre for your answer. I used your command and it seems it is removing the header with a particular pattern from fasta file . I was wondering if this line of command removes the whole sequence associated to that header as my fasta file is not linearized and sequences are stored in multiple lines.

Also do you have any Idea how can I store the target sequences for deletion in another file?

ADD REPLY • link 3.2 years ago by bioyas ▴ 20

0

Entering edit mode

Please

Do not revive years old post as the answers are years old and new tools have come in.
Post your query as a new post with example input and output

ADD REPLY • link 3.0 years ago by cpad0112 21k

Ram · Accepted Answer · 2015-03-05

I like Pierre's answer to this since it's simple. However, I had been thinking about adding regular expression filtering to my pyfaidx project, and this morning I finished up adding this functionality:

$ pip install pyfaidx
$ faidx in.fasta -g "out" > out.fasta

The (small) advantage here is that faidx will perform filtering on an indexed file, preventing you from reading the entire file through your filter.