How To Remove Certain Sequences From A Fasta File
3
12
Entering edit mode
14.0 years ago
Emmecola ▴ 180

Hi everyone

Is there a fast way to do this filter?

I have a huge Fasta file (sequences are short reads coming from an Illumina instrument). I have also a list of nucleotide sequences (not Fasta, just the sequences) and I want to remove from the big Fasta file all entries identical to those in the list.

My idea was simply to go down through the Fasta file and then, for every read, check all the sequences of the list. If the read matches one of the sequences then do nothing, otherwise print the read into a new file. I made this with perl but it takes ages!

The list is made up of nucleotide sequences, not IDs. It's something like this:

AACGACTACTTATCGATC

TCGGCGATATACGTAC

CCAGTTTCGGGGCTAT ....

Thanks!

fasta parsing short • 31k views
ADD COMMENT
1
Entering edit mode

can you make a better example... you have a file with a lot of sequences, and a list of sequence ids to remove from it. Is it correct?

ADD REPLY
1
Entering edit mode

I'm fairly sure this has been asked/answered before - check the "Related" box or search the archives.

ADD REPLY
0
Entering edit mode

Yes, except that the list is not made up of IDs. They are nucleotide sequences.

ADD REPLY
0
Entering edit mode

Do I understand correctly that you have a list of nucleotide sequences, and you want to go through a FASTA file and remove all entries that contain an exact, full-length hit to one of the nucleotide sequences in your list? Are the nucleotide sequences in your list by any chance all of the same length?

ADD REPLY
0
Entering edit mode

Have a look at these two questions:

Also, make a search in the archives.

ADD REPLY
12
Entering edit mode
14.0 years ago
  • linearize your fastq file into 3 columns: sequence, ID, qualities (using 'awk', or 'perl')
  • sort it using the sequence as the key ( unix 'sort')
  • sort your 2nd file of sequences ( unix 'sort')
  • use unix 'join' to filter out the first file and re-transform it to FASTQ using awk or perl
ADD COMMENT
0
Entering edit mode

Pierre's solution is the way to go for sure, assuming you don't need to account for mismatches, INDELs, etc. Use the -v option in join to exclude the FASTQ sequences that match your filter set.

The beauty of this recipe is that is applies to many other scenarios and is not constrained by memory. I probably do something close to this recipe (in other situations) 100 times a year.

ADD REPLY
12
Entering edit mode
14.0 years ago

Hi,

EDIT: I had misunderstood that you actually had the nucleotide sequences in the exclusion file. I modified the code accordingly.

You can use the following script. Put it in a file named fasta_remove.py, make the file executable:

chmod +x fasta_remove.py

and use it by typing:

./fasta_remove.py <input_fasta> <remove_file> <output_fasta>

The remove_file file contains one sequence name per line that needs to be removed from the input_fasta file. The input fasta file is not limited in size. You will need to have Biopython installed and this works under Linux, obviously.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file

remove = set()
with open(remove_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            remove.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        nuc = seq.seq.tostring()
        if nuc not in remove and len(nuc) > 0:
            SeqIO.write([seq], f, "fasta")

Cheers!

ADD COMMENT
0
Entering edit mode

Hi

I am working with old sanger reads, and I have the fasta and qual files separated. I will like to remove some sequences from the qual file using a list of names of those sequences but I am having problems with these. ¿Do yo have a solution for this?

Thankyou very much

ADD REPLY
0
Entering edit mode

ask this as a new question please.

ADD REPLY
0
Entering edit mode

For me, I had to change the code to this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file

remove = set()
with open(remove_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            remove.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        **nam = seq.id
        **nuc = str(seq.seq)
        **if nam not in remove and len(nuc) > 0:
            SeqIO.write([seq], f, "fasta")

(I changed the lines beginning with '**' )

Versions: Python 2.7.11 and Biopython 1.67

ADD REPLY
0
Entering edit mode

Hi Eric. Thanks for the script. Do you have a suggestion so as how the 'wanted' file should look exactly? I tried with text file like this:

>TRINITY_DN10009_c0_g1_i1   
>TRINITY_DN103518_c0_g1_i1
>TRINITY_DN104798_c0_g1_i1

or just to be sure also like this

TRINITY_DN10009_c0_g1_i1    
TRINITY_DN103518_c0_g1_i1
TRINITY_DN104798_c0_g1_i1

but the script didn't work. The output file was the same as input.

Any suggestions for the troubleshooting? or other script? MilleThanks.

ADD REPLY
0
Entering edit mode

Hi I tried to use the script but it needs to be updated. I am not an expert so just changed my_seq.tostring() to str() but it is not working. Also does the fasta file needs to have a single line per sample?

Message:

'BiopythonDeprecationWarning: This method is obsolete; please use str(my_seq) instead of my_seq.tostring().
  BiopythonDeprecationWarning)'

Thanks

ADD REPLY
1
Entering edit mode
14.0 years ago

An alternative to Pierres suggestion: use

grep -v SeqToBeDeleted inputfile > outputfile

and iterate on all your Sequences to be removed.

ADD COMMENT
5
Entering edit mode

Unless the number of sequences that you want to remove is very small, this will be prohibitively slow. In any case, if you want to use grep for it, here are two simple optimizations:

  1. use the -f option to give grep a file with all sequences that you want to search with instead of running grep multiple times, and
  2. use fgrep instead of grep since you only need to do exact string matching and not full regular expression matching.
ADD REPLY
2
Entering edit mode

Plus it will only remove the sequence part of the FASTQ entry and will result in a mangled file consisting of proper FASTQ entries that should remain and incomplete (i.e. head, separator, qual, yet no sequence) entries for the matched sequences.

ADD REPLY

Login before adding your answer.

Traffic: 2637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6