Filtering sequences from a Fasta file using Biopython
4
2
Entering edit mode
7.9 years ago
tpaisie ▴ 80

Hi everyone, I'm pretty sure this would be a super simple python script using Biopython, but could someone point me in the right direction? I have a Fasta file with 229 sequences, they are named by only their genbank accession number. I have a text file with the genbank accession numbers of the sequences I want to extract from the original fasta file with the 229 sequences. I would like to be able to filter out the sequences listed in my text file to create a new fasta file, with only my sequences of interest.

I feel like I accidentally did this once when I was trying to write a script for another purpose. I also am trying to actively learn biopython, so just a push in the right direction would help a lot!! Thanks!

Biopython Python Phylogenetics Fasta • 9.4k views
ADD COMMENT
1
Entering edit mode
7.9 years ago

I assume you went through the Biopython Tutorial and Cookbook?

This is indeed pretty straightforward. Read the text file with accession numbers in a list (desiredList), iterate over your fasta file and if the record.id attribute is in the desiredList, add it to an output list. Then after looping, write the desired sequences to an output file. Need more pointers?

ADD COMMENT
0
Entering edit mode

so biopython has this as their filtering example:

from Bio import SeqIO
input_file = "big_file.sff"
id_file = "short_list.txt"
output_file = "short_list.sff"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print("Found %i unique identifiers in %s" % (len(wanted), id_file))
records = (r for r in SeqIO.parse(input_file, "sff") if r.id in wanted)
count = SeqIO.write(records, output_file, "sff")
print("Saved %i records from %s to %s" % (count, input_file, output_file))
if count < len(wanted):
    print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

so when i modify this example to my files it gives me only one sequence in my output fasta file, I guess i'm confused why its doing that.

ADD REPLY
0
Entering edit mode

I assume you adapted the code to work for fasta format? Could you show your code as well? This seems right...

ADD REPLY
0
Entering edit mode
from Bio import SeqIO
input_file = "cov_uniq_aln_103016.fasta"
id_file = "CoV_seq_names_010517.txt"
output_file = "cov_less_seqs_010517.fasta"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print("Found %i unique identifiers in %s" % (len(wanted), id_file))
records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")
print("Saved %i records from %s to %s" % (count, input_file, output_file))
if count < len(wanted):
    print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

Maybe i'm missing something in the code for the fasta format? it does output a fasta file, just only with one sequence instead of the 80 or so sequence ids in the text file.

ADD REPLY
0
Entering edit mode

And the wanted variable looks okay? It contains the expected number of identifiers (and the expected values)?

ADD REPLY
0
Entering edit mode

This is my output from the terminal:

Found 1 unique identifiers in CoV_seq_names_010517.txt Saved 1 records from cov_uniq_aln_103016.fasta to cov_less_seqs_010517.fasta

And the output fasta file has only the first sequences on my id_file, so I think the problem is this line:

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))

ADD REPLY
0
Entering edit mode

Indeed, only one identifier was found. You will need to modify the line you mentioned. How does your id_file look like? How is it separated?

ADD REPLY
1
Entering edit mode
7.9 years ago
mforde84 ★ 1.4k

You're making this way too hard on yourself. Bash is pretty awesome if you give it a try. :)

$ head lookup.txt;
GI123
GI987
GI3873
$ while read -r line; do
     sed -n -e "/>$line/,/>/ p" sequence.fasta | head -n-1 >> newfasta.txt;
  done < "lookup.txt";
ADD COMMENT
1
Entering edit mode

Definitely agree with the usefulness of bash for simple tasks, but in the spirit of learning biopython for more complex tasks I understand OP tries Biopython for this job as well.

(And I think I don't need a lot more lines in Python to do the same job ;-)
Constructs like "/>$line/,/>/ p" scare me. )

ADD REPLY
1
Entering edit mode
7.9 years ago
Joe 21k

Wrote a multipurpose fasta-fetcher a while back here: https://github.com/jrjhealey/bioinfo-tools/blob/master/fastafetcher.py

It supports finding the right fasta by a file of keys, by string directly, and also has an 'invert' parameter to retrieve everything but the ones you list. Usage I think is pretty self explanatory (script has a help function thanks to argparse) but feel free to ask for clarification:

python fastafetcher.py -k keyfile.txt -f fastafile.fasta [--invert]

key file has to be a simple list of exact strings, one per line.

# Extract fasta files by their descriptors stored in a separate file.
# Requires biopython

import StringIO
from Bio import SeqIO
import sys
import traceback
import warnings
import argparse

def getKeys(keyFile):
    """Turns the input key file into a tuple. May be memory intensive."""

    with open(keyFile, "r") as kfh:
        keys = []
            for line in kfh:
            line = line.rstrip('\n')
            line = line.lstrip('>')
            keys.append(line)

    return keys

def main():
    """Takes a list of strings in a text file (one per line) and retreives them and their sequences from a provided multifasta."""
    # Parse arguments from the commandline:
    try:
        parser = argparse.ArgumentParser(description='Retrieve one or more fastas from a given multifasta.')
        parser.add_argument(
            '-f',
            '--fasta',
            action='store',
            required=True,
            help='The multifasta to search.')
        parser.add_argument(
            '-k',
            '--keys',
            action='store',
            required=True,
            help='A file of header strings to search the multifasta for. Must be exact. Must be one per line.')
        parser.add_argument(
            '-o',
            '--outfile',
            action='store',
            default=None,
            help='Output file to store the new fasta sequences in. Just prints to screen by default.')
        parser.add_argument(
            '-v',
            '--verbose',
            action='store_true',
            help='Set whether to print the key list out before the fasta sequences. Useful for debugging.')
        parser.add_argument(
            '-i',
            '--invert',
            action='store_true',
            help='Invert the search, and retrieve all sequences NOT specified in the keyfile.')
        args = parser.parse_args()

    except:
        print "An exception occured with argument parsing. Check your provided options."
        traceback.print_exc()

    # Rename args to enable them to be provided to the getKeys function:
    keyFile = args.keys
    inFile = args.fasta
    outFile = args.outfile
# Main code:
    # Call getKeys() to create the tuple of keys from the provided file:
    try:
        keys = getKeys(keyFile)
    except IOError:
        keys = args.keys


    if args.verbose is not False:
        if args.invert is False:
            print('Fetching the following keys from: ' + inFile)
            for key in keys:
                print(key)
        else:
            print('Ignoring the following keys, and retreiving everything else from: ' + inFile)
            for key in keys:
                print(key)

    # Parse in the multifasta and assign an iterable variable:
        seqIter = SeqIO.parse(inFile, 'fasta')

    # For each sequence in the multifasta, check if it's in the keys[] tuple. If so, print it out:
    for seq in seqIter:
        if args.invert is False:
            if seq.id in keys:
                print(seq.format("fasta"))
            if args.outfile is not None:
                SeqIO.write(seq, outFile, "fasta")
        else:
    # If the --invert/-i flag is given, print all fastas NOT listed in the keyfile
            if seq.id not in keys:
                print(seq.format("fasta"))
            if args.outfile is not None:
                SeqIO.write(seq, outFile, "fasta")

if __name__ == "__main__":
    main()
ADD COMMENT
1
Entering edit mode

Looks pretty, nicely commented too.
But don't you mean "list" where you write tuple? That's not the same structure :p Also, I would rewrite getKeys() as below:

def getKeys(keyFile):
    """Turns the input key file into a list. May be memory intensive."""
    with open(keyFile, "r") as kfh:
        return([line.lstrip('>') for line in kfh if not line = ""])
ADD REPLY
1
Entering edit mode

Oh very probably :P I'm still fairly novice when it comes to python scripting so I tend to write things in an overly vebose way just partly because thats how I think about the problem, but also because I usually give the scripts to others in my lab who are even less well versed in it than I am so I figure it makes it easier to follow :P comments and criticisms are always welcome for my own learning anyway :)

ADD REPLY
0
Entering edit mode

There is for sure nothing wrong with writing verbose code, definitely a plus for readability. I think my version would be a bit faster, but for most (small) lists the difference is probably neglectable.

ADD REPLY
0
Entering edit mode
7.9 years ago
tpaisie ▴ 80

So I figured it out! The original script works, I saved my id_file has a tab delimited file in excel and that was the problem! You guys are really motivating me to work on my python!

ADD COMMENT
0
Entering edit mode

Good to hear! You were pretty close to the solution yourself, too.

ADD REPLY
0
Entering edit mode

Excel strikes again!

ADD REPLY

Login before adding your answer.

Traffic: 2512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6