Struggle to loop over multiple files with BioPython
3
1
Entering edit mode
9.6 years ago
jon.brate ▴ 310

I modified a script using BioPython to remove sequences with only gaps from multifasta files. But I struggle on how to loop this over multiple files, as I don't fully understand how BioPython works. How would I set the input and output files in this case?

Thanks, Jon

from Bio import SeqIO

INPUT  = "test.fas"
OUTPUT = "test_output.fas"

def main():
    records = SeqIO.parse(INPUT, 'fasta')
    filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq))
    SeqIO.write(filtered, OUTPUT, 'fasta')

if __name__=="__main__":
    main()
python loop biopython • 4.5k views
ADD COMMENT
0
Entering edit mode
9.6 years ago
João Rodrigues ★ 2.5k

Hi Jon,

This has nothing to do with Biopython. You just need a regular for-loop that provides that INPUT variable of yours to the parser and the OUTPUT variable to the writer. Loops in Python work like this:

for var in iterable:
    # do something with var

You should create a new function to hold the parsing/writing logic. The main function would then call this one repeatedly for each argument you provide. Probably you want to be able to run the script with some arguments, not to change the code every time you run it. You will need to look into the sys module and the argv method, or preferably, into the argparse module. Finally, you'll want to modify the name of the INPUT file for something like INPUT_output, like you have in your code. Have a look at string slicing for this, assuming your filenames are all similar (i.e. ending in .fas) or at the str.split function.

Good luck!

ADD COMMENT
0
Entering edit mode
9.6 years ago

Extension to Rodrigues answer:

It's something like:

Get a list of files. E.g Using os module.

Write a function that will take input and output file as arguments and performs the logic.

def filter(INPUT,OUTPUT):
    #your logic here

Use the function over each file in list of files.

for fasta in list_of_files:
    INPUT=   #input file name
    OUTPUT   #output file name, can be generated with substring of input+"_output.fa"
    filter(INPUT,OUTPUT)
ADD COMMENT
0
Entering edit mode

Thanks both for the comments! If I may ask, am I onto something here?

from Bio import SeqIO
import re,fileinput,os

INPUTDIR  = "Users/test"

def main():
    for files in os.walk(INPUTDIR):
           records = SeqIO.parse(files, 'fasta')
        filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq))
        with open (files+"stripped.fas", "w") as out:
            SeqIO.write(filtered, out, 'fasta')

if __name__=="__main__":
    main()

I get this error message:

Traceback (most recent call last):
  File "loop_test-remove_empty_fasta.py", line 14, in <module>
    main()
  File "loop_test-remove_empty_fasta.py", line 10, in main
    with open (files+"stripped.fas", "w") as out:
TypeError: can only concatenate tuple (not "str") to tuple
ADD REPLY
0
Entering edit mode

Run this and you will understand where is the problem

def main():
    for files in os.walk(INPUTDIR):
        records = SeqIO.parse(files, 'fasta')
        filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq))
        print files
        #with open (files+"stripped.fas", "w") as out:
        #    SeqIO.write(filtered, out, 'fasta')
ADD REPLY
0
Entering edit mode

Just a heads-up: filter is a function in Python, bad practice to replace it with yours. Use another name.

ADD REPLY
0
Entering edit mode
9.6 years ago
from Bio import SeqIO
import re,fileinput,os

INPUTDIR  = "."

list_of_files=[ fasta for fasta in os.listdir(INPUTDIR) if fasta.endswith(".fasta") ]

def filter_fasta(input,output):
    records = SeqIO.parse(fasta, 'fasta')
    filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq))
    with open (output, "w") as out:
        SeqIO.write(filtered, out, 'fasta')

for input in list_of_files:
    output="".join(str(input).split(".")[:-1])+"_filtered.fasta"
    filter_fasta(input,output)
ADD COMMENT

Login before adding your answer.

Traffic: 1515 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6