Question

A code for splitting FASTA file, No output is written for the file

0

Entering edit mode

7.6 years ago

AsoInfo ▴ 300

I have written code for splitting FASTA into desired number of sequences and desired number of files. But it doesn't write output for the second file or the last file. I couldn't figure it out.

The code is as follows:

file_open=open("sequence.txt","r") # Input file
count_file=2 # Number of output files
count_sequence=5 # Sequences in each output file

sequence=0 # Count total sequences until count_sequences
for i in range(count_file,0,-1): # Loop for file names generation
    sequence_file=open("sequence"+str(i)+".fasta","w") # Output file
    for line in file_open: # Reading the sequences one by one
        if line.startswith(">"):
            fasta=""
            for data_line in file_open:
                if data_line.startswith("\n"):
                    break
                else:
                    fasta=fasta+data_line

            print(sequence_file)
            print(line+fasta)

            sequence_file.write(line+fasta+"\n") # Writing the FASTA sequences to file
            sequence=sequence+1 # Increment the count of the sequence 

        if sequence==count_sequence: # If number of sequences are equal to the desired number of sequences
            sequence=0 # reset the counter
            break

In this case, sequence1.fasta is empty but sequence2.fasta have the first five sequences.

I know that there are many tools mentioned in the forum that do the desired thing but I want the specific format and output extension, that's why I wrote this but couldn't run it.

Thanks!

Splitting FASTA Python • 2.3k views

ADD COMMENT • link updated 7.6 years ago by Andrzej Zielezinski 11k • written 7.6 years ago by AsoInfo ▴ 300

0

Entering edit mode

the loop logic for lines is not right. only one level is needed

ADD REPLY • link 7.6 years ago by shenwei356 8.7k

0

Entering edit mode

I modified the code and it worked....

ADD REPLY • link 7.6 years ago by AsoInfo ▴ 300

0

Entering edit mode

first I think you need to make an index from the fasta.

file_open = open (sys.argv[1],"r")  # Input file
seqs = {}
seq = ""
for line in file_open:

    line = line.rstrip('\n')

    if line.startswith('>'):

        if seq:

            seqs[name] = seq 
            seq = ""
        name = line

    else:

        seq = seq + line
#the last sequence
seqs[name] = seq 

file_open.close()

Then, you can print its elements in loop for i in range(int(count_file)):

ADD REPLY • link 7.6 years ago by Buffo ★ 2.4k

1

Entering edit mode

Just preference but this would be nicer, the with block is so handy (untested):

with open(sys.argv[1],"r") as file_open:  # Input file
    seqs = {}
    seq = ""
    for line in file_open:
        line = line.rstrip('\n')
        if line.startswith('>'):
            if seq:
            seqs[name] = seq 
            seq = ""
        name = line
    else:
        seq = seq + line
#the last sequence
seqs[name] = seq

ADD REPLY • link 7.6 years ago by Joe 21k

0

Entering edit mode

It worked fine in my computer with python 3.4.2 (windows 8.1). However, your code

data_line.startswith("\n"):

means that sequences in the file must end with empty line. I think it does not match the specification of the FASTA format. I recommend you to check your input file "sequence.txt". (Does it really have more than 5 sequences?)

But I recommend much stronger to change the code compatible for any FASTA formatted files.

In addition,

for line in file_open: # Reading the sequences one by one

for data_line in file_open:

I don't like to use same file object in inner loop (But I'm not python person. It may be a common way in python). (It may be the same thing which shenwei356 mentioned.)

ADD REPLY • link 7.6 years ago by fishgolden ▴ 520

0

Entering edit mode

Could you explain what your desired format / output extension is? It might be easier to use seqkit to split the sequence and then write a much simpler shell script to do the post-processing?

ADD REPLY • link 7.6 years ago by James Ashmore ★ 3.5k

0

Entering edit mode

I modified the code and it worked perfectly, thanks for your help.

ADD REPLY • link 7.6 years ago by AsoInfo ▴ 300

score 3 · Accepted Answer · 2017-04-21

3

Entering edit mode

7.6 years ago

Andrzej Zielezinski 11k

I followed the logic of your script and tweaked it a little. Now, the script should do what you wanted. However, I just want to let you know that there are other (more efficient and less complicated) - ways to implement this task in Python.

file_open = open("sequence.txt","r")
MAX_FILES = 2    # Number of output files
SEQ_PER_FILE = 5 # Sequences in each output file

sequence_n = 0 # Counter for sequences
file_n = 0  # Counter for output files

for line in file_open:
    if not line.strip(): continue
    if line.startswith('>'):
        if sequence_n:
            if sequence_n%SEQ_PER_FILE == 0:
                file_n += 1
                oh = open('sequence'+str(file_n)+'.fasta', 'w')
                oh.write(fasta)
                oh.close()
                if file_n == MAX_FILES: 
                    break
                fasta = line
            else:
                fasta += line
        else:
            fasta = line
        sequence_n += 1
    else:
        fasta += line

if sequence_n%SEQ_PER_FILE:
    oh = open('sequence'+str(file_n+1)+'.fasta', 'w')
    oh.write(fasta)
    oh.close()

ADD COMMENT • link 7.6 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

I have tried it and it worked perfectly...Thanks!

ADD REPLY • link 7.6 years ago by AsoInfo ▴ 300

0

Entering edit mode

Just curious if this was an assignment? You had done the right thing by posting your own code along with your question.