filter out fasta File by pattern
3
2
Entering edit mode
9.7 years ago
Assa Yeroslaviz ★ 1.9k

Hi,

I would like to filter out some sequences from a fasta file by using a specific pattern.

For example I have this file:

>input1
UGAGGUAGUAGG
>input2
CUAUGCUUACC
>out1
UCCCUGAGACCGUGA
>out2
CUCCGGGUACC
>desc1
ACUUCCUUACAUGCCC

I know already how I can extract all the fasta sequences with a specific pattern into a new file by using awk.

But what I would like to do is to remove all entries of a specific pattern from the original fasta file and save the newly made file into a new one. In my file above, I would like for example to remove all sequences with the header pattern out. and save only the other to a new file.

Is there a tool somewhere for doing that, or is it possible in awk/sed or even grep

Thanks

Assa

pattern fasta awk sed regexp • 9.2k views
ADD COMMENT
8
Entering edit mode
9.7 years ago

I would like for example to remove all sequences with the header pattern out. and save only the other to a new file

awk '/^>/ {P=index($0,"out")==0} {if(P) print} ' in.fasta > out.fasta
ADD COMMENT
0
Entering edit mode

Thanks. That was fast :-)

ADD REPLY
0
Entering edit mode

Hi Pierre! Thanks for your command.

I have a question on this issue. Instead of one header pattern "out" (in your case), I am looking for many patterns that are stored in a file. So what should I do? Your help is appreciated in advance.

ADD REPLY
1
Entering edit mode

You could modify my answer below:

$ pip install pyfaidx
$ xargs faidx in.fasta -g > out.fasta < patterns.txt
ADD REPLY
0
Entering edit mode

Thank you Pierre for your answer. I used your command and it seems it is removing the header with a particular pattern from fasta file . I was wondering if this line of command removes the whole sequence associated to that header as my fasta file is not linearized and sequences are stored in multiple lines.

Also do you have any Idea how can I store the target sequences for deletion in another file?

ADD REPLY
0
Entering edit mode

Please

  1. Do not revive years old post as the answers are years old and new tools have come in.
  2. Post your query as a new post with example input and output
ADD REPLY
4
Entering edit mode
9.7 years ago

I like Pierre's answer to this since it's simple. However, I had been thinking about adding regular expression filtering to my pyfaidx project, and this morning I finished up adding this functionality:

$ pip install pyfaidx
$ faidx in.fasta -g "out" > out.fasta

The (small) advantage here is that faidx will perform filtering on an indexed file, preventing you from reading the entire file through your filter.

ADD COMMENT
0
Entering edit mode
2.6 years ago
hans ▴ 20

Using samtools

$ samtools faidx in.fasta
$ awk '/^>/ {print substr($1,2,400) }'  in.fasta | grep -v "out" list >selected_list
$ samtools faidx in.fasta -r selected_list > out.fasta

This will give you all the functionality of "grep" to select the desired sequences.

ADD COMMENT
0
Entering edit mode

Why print substr($1,2,400)?

Why piping in grep -v "out" while the same grep reads a file list?

ADD REPLY
0
Entering edit mode
print substr($1,2,400) 

Keeps only the first field of the pasta header while deleting the ">" from the header. The file name "list" in the grep command is a mistake, it should be deleted:

awk '/^>/ {print substr($1,2,400) }'  in.fasta | grep -v "out"  >selected_list
ADD REPLY

Login before adding your answer.

Traffic: 2012 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6