Question

Extracting named fasta sequences according to list with Biopython

0

Entering edit mode

3.1 years ago

lachiemck • 0

Hi all, I'm trying to work out a quick script to extract a set of sequence fasta files from a multifasta and write them all to a new, single fasta file. To elaborate, I've got a proteome, and I want to extract a group of 15 or so proteins associated with a certain process, and write them to a new multifasta. To do so I want the script to read a document with a list of sequence names in it and sort the original multifasta using that list. I'm aiming to do this using Biopython.

This is my code so far:

from Bio import SeqIO
import sys

sample_file = open(str(sys.argv[2]), "r")
seq_list = []

outfile = open(str(sys.argv[3]), "w")

#This reads the guide document and turns each line into a list item in seq_list.
for line in sample_file:
    stripped_line = line.strip()
    line_list = stripped_line.split()
    seq_list.append(line_list)

sample_file.close()
print(seq_list)
#This print function is to confirm that seq_list is indeed storing the names.

for record in SeqIO.parse(str(sys.argv[1]), "fasta"):
    for n in seq_list:
        if n == record.id:
            SeqIO.write(record, outfile, "fasta")

outfile.close()

The main problem so far is that I can load the document of names into seq_list and print the list, but parsing SeqIO with it doesn't seem to do anything. However, hardcoding the names into the code seemed to work fine. Any help would be greatly appreciated.

Thanks, Lachlan

Biopython FASTA • 1.8k views

ADD COMMENT • link updated 6 months ago by Rubayetul • 0 • written 3.1 years ago by lachiemck • 0

score 1 · Answer 1 · 2021-10-19

1

Entering edit mode

3.1 years ago

Istvan Albert 102k

The problem is that the elements of your seq_list are other lists, whereas the record.id is a string. Plus you should never use the in operator on a list, here is a better solution

collect = set()
for line in sample_file:
    stripped_line = line.strip()
    line_list = stripped_line.split()
    collect.update(line_list)

then later:

for record in SeqIO.parse(str(sys.argv[1]), "fast"):
    if record.id in collect:
         SeqIO.write(record, outfile, "fasta")

ADD COMMENT • link 3.1 years ago by Istvan Albert 102k

0

Entering edit mode

the last line of SeqIO.write(record.....'fasta') in a for loop will input the the last record into new fasta file and it will only contain the one sequence or none considering the conditions. the best option would be using 'a' while opening up the file: outfile = open(str(sys.argv[3]), "a") or, if you want to use regular 'w' mode while opening the file: then,

with open(str(sys.argv[3],'w') as file:
     for record in SeqIO.parse(str(sys.argv[1],'fasta'):
              if record.id not in collect:
                   SeqIO.write(record,file,'fasta')

ADD REPLY • link 6 months ago by Rubayetul • 0