How to delete seq_id + sequence from FASTA file matching specific seq_id
3
0
Entering edit mode
9.2 years ago

Hi, I should delete sequences from a FASTA file and I have a list of all the sequences to delete.

Since I have to remove the sequence ID and the sequence I have to delete two lanes when there is the seq_id of interest.

I tried to use grep with the -v option:

grep -vA1 M03047:9:000000000-AD969:1:1101:16618:2205/1 file1 > file2

but it removes only the first line (the one with the seq_id): it seems that the -v option has effect only on -A and not on -A1

I tried other ways with sed instead of grep but I always get some problems, grep seems the best way but I have to figure out how to remove not only the first lane but also the second one.

Any ideas?

Thank you

grep FASTA • 13k views
ADD COMMENT
5
Entering edit mode
9.2 years ago
samuelmiver ▴ 440

I add a new solution using biopython able to remove the sequence with identifiers in a given file. I consider Biopython more reliable and reproducible than grep/sed/awk when working with a great number of sequences.

As I do not know your experience with python, I will explain the solution in detail:

1) Ensure you have python and biopython installed. Type in your terminal:

python -c "import Bio"
echo $?

If "0" displayed, pass to step 2. In case error message appears, install it by easy_install or pip:

sudo easy_install -f http://biopython.org/DIST/ biopython

or

sudo pip install biopython

In case of not having pip installed, type:

sudo apt-get install pip

2) Create a script ".py" with this script ( filter.py, for example ) using your favourite text editor:

#!/usr/bin/env python

import sys
from sets import Set
from Bio import SeqIO

def list_ids():
    """
    Return a set containing the identifiers presented in a file,
    line by line, starting with ">"
    """

    # read the first file given and generate a set (faster iteration respect lists

    identifiers = Set([])

    with open(sys.argv[1], 'r') as fi:
        for line in fi:
            line = line.strip()
            identifiers.add(str(line).replace(">", ""))

    return identifiers

def filter():
    """
    Writes a file containing only the sequences with identifiers NOT
    present in a set of identifiers
    """

    identifiers = list_ids()

    with open(sys.argv[2]) as original_fasta, open(sys.argv[3], 'w') as corrected_fasta:
        records = SeqIO.parse(original_fasta, 'fasta')
        for record in records:
            print record.id
            if record.id not in identifiers:
                SeqIO.write(record, corrected_fasta, 'fasta')

if __name__ == '__main__':
    filter()

3) Run the program as follows:

chmod +x filter.py
./filter.py your_ids.txt IN.fasta OUT.fasta

where:

  • your_ids.txt --> A file containing the identifiers including > you want to EXCLUDE, one identifier per line; like:

    >hhh 
    >ghag 
    >MAMND
    
  • IN.fasta --> your original fasta file

  • OUT.fasta --> your output file
ADD COMMENT
0
Entering edit mode

Thank you very much for your help. The script works well.

ADD REPLY
0
Entering edit mode

You're welcome, happy to help!

ADD REPLY
0
Entering edit mode

Thank you very much. It totally worked :)

ADD REPLY
0
Entering edit mode

Happy to hear that, you're welcome!

ADD REPLY
0
Entering edit mode

Will this script still work with python3 ? How can I modify it so it will ?

ADD REPLY
0
Entering edit mode

I had the same problem, I tried several alternatives in the windows subsystem for Linux, until I found your script and it has been very helpful. Thank you samuelmiver

ADD REPLY
0
Entering edit mode

Hi samuelmiver,

When I run filter.py, it says

  File "/home/zwl/test/./filter.py", line 35
    print record.id
          ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(record.id)?
ADD REPLY
2
Entering edit mode
9.2 years ago
michael.ante ★ 3.9k

You can also have a look at FAST (Fast Analysis of Sequences Toolbox). There you have fasgrep:

fasgrep -v "M03047:9:000000000-AD969:1:1101:16618:2205/1" file1 > file2

It will remove the whole entry, but change the window-length. You could adjust it with the fastx toolkit:

fasta_formatter -i file2 -o file3 -w 60
ADD COMMENT
0
Entering edit mode

Thank you michael.ante

ADD REPLY
1
Entering edit mode
9.2 years ago
samuelmiver ▴ 440

This is a perfect case where awk can save you:

cat your_fasta.fasta | awk '{if (substr($0,1) == ">M03047:9:000000000-AD969:1:1101:16618:2205/1") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > fixed.fasta

I have just adapted one example found here

ADD COMMENT
0
Entering edit mode

I would ask for a second advice: I have to repeat this step (delete sequences) for 100K times or more.

Is there any way to use a file containing the list of sequences to remove instead of creating a script for everyone?

ADD REPLY
0
Entering edit mode

Any problem with (bio)python?

ADD REPLY
0
Entering edit mode

It would be a good idea to add this requirement to the initial message (EDIT button) in order to help and correctly inform further users about the whole problem.

ADD REPLY
0
Entering edit mode

Simpler awk code for deleting record:

cat your_fasta.fasta | awk '/>M03047:9:000000000-AD969:1:1101:16618:2205/ {getline; while(!/>/) {getline}} 1' > fixed.fasta
ADD REPLY

Login before adding your answer.

Traffic: 1160 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6