Extracting named fasta sequences according to list with Biopython
1
0
Entering edit mode
3.1 years ago
lachiemck • 0

Hi all, I'm trying to work out a quick script to extract a set of sequence fasta files from a multifasta and write them all to a new, single fasta file. To elaborate, I've got a proteome, and I want to extract a group of 15 or so proteins associated with a certain process, and write them to a new multifasta. To do so I want the script to read a document with a list of sequence names in it and sort the original multifasta using that list. I'm aiming to do this using Biopython.

This is my code so far:

from Bio import SeqIO
import sys

sample_file = open(str(sys.argv[2]), "r")
seq_list = []

outfile = open(str(sys.argv[3]), "w")

#This reads the guide document and turns each line into a list item in seq_list.
for line in sample_file:
    stripped_line = line.strip()
    line_list = stripped_line.split()
    seq_list.append(line_list)

sample_file.close()
print(seq_list)
#This print function is to confirm that seq_list is indeed storing the names.

for record in SeqIO.parse(str(sys.argv[1]), "fasta"):
    for n in seq_list:
        if n == record.id:
            SeqIO.write(record, outfile, "fasta")

outfile.close()

The main problem so far is that I can load the document of names into seq_list and print the list, but parsing SeqIO with it doesn't seem to do anything. However, hardcoding the names into the code seemed to work fine. Any help would be greatly appreciated.

Thanks, Lachlan

Biopython FASTA • 1.8k views
ADD COMMENT
1
Entering edit mode
3.1 years ago

The problem is that the elements of your seq_list are other lists, whereas the record.id is a string. Plus you should never use the in operator on a list, here is a better solution

collect = set()
for line in sample_file:
    stripped_line = line.strip()
    line_list = stripped_line.split()
    collect.update(line_list)

then later:

for record in SeqIO.parse(str(sys.argv[1]), "fast"):
    if record.id in collect:
         SeqIO.write(record, outfile, "fasta")
ADD COMMENT
0
Entering edit mode

the last line of SeqIO.write(record.....'fasta') in a for loop will input the the last record into new fasta file and it will only contain the one sequence or none considering the conditions. the best option would be using 'a' while opening up the file: outfile = open(str(sys.argv[3]), "a") or, if you want to use regular 'w' mode while opening the file: then,

with open(str(sys.argv[3],'w') as file:
     for record in SeqIO.parse(str(sys.argv[1],'fasta'):
              if record.id not in collect:
                   SeqIO.write(record,file,'fasta')
ADD REPLY

Login before adding your answer.

Traffic: 2358 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6