A code for splitting FASTA file, No output is written for the file
1
0
Entering edit mode
7.8 years ago
AsoInfo ▴ 300

I have written code for splitting FASTA into desired number of sequences and desired number of files. But it doesn't write output for the second file or the last file. I couldn't figure it out.

The code is as follows:

file_open=open("sequence.txt","r") # Input file
count_file=2 # Number of output files
count_sequence=5 # Sequences in each output file

sequence=0 # Count total sequences until count_sequences
for i in range(count_file,0,-1): # Loop for file names generation
    sequence_file=open("sequence"+str(i)+".fasta","w") # Output file
    for line in file_open: # Reading the sequences one by one
        if line.startswith(">"):
            fasta=""
            for data_line in file_open:
                if data_line.startswith("\n"):
                    break
                else:
                    fasta=fasta+data_line

            print(sequence_file)
            print(line+fasta)

            sequence_file.write(line+fasta+"\n") # Writing the FASTA sequences to file
            sequence=sequence+1 # Increment the count of the sequence 

        if sequence==count_sequence: # If number of sequences are equal to the desired number of sequences
            sequence=0 # reset the counter
            break

In this case, sequence1.fasta is empty but sequence2.fasta have the first five sequences.

I know that there are many tools mentioned in the forum that do the desired thing but I want the specific format and output extension, that's why I wrote this but couldn't run it.

Thanks!

Splitting FASTA Python • 2.4k views
ADD COMMENT
0
Entering edit mode

the loop logic for lines is not right. only one level is needed

ADD REPLY
0
Entering edit mode

I modified the code and it worked....

ADD REPLY
0
Entering edit mode

first I think you need to make an index from the fasta.

file_open = open (sys.argv[1],"r")  # Input file
seqs = {}
seq = ""
for line in file_open:

    line = line.rstrip('\n')

    if line.startswith('>'):

        if seq:

            seqs[name] = seq 
            seq = ""
        name = line

    else:

        seq = seq + line
#the last sequence
seqs[name] = seq 

file_open.close()

Then, you can print its elements in loop for i in range(int(count_file)):

ADD REPLY
1
Entering edit mode

Just preference but this would be nicer, the with block is so handy (untested):

with open(sys.argv[1],"r") as file_open:  # Input file
    seqs = {}
    seq = ""
    for line in file_open:
        line = line.rstrip('\n')
        if line.startswith('>'):
            if seq:
            seqs[name] = seq 
            seq = ""
        name = line
    else:
        seq = seq + line
#the last sequence
seqs[name] = seq
ADD REPLY
0
Entering edit mode

It worked fine in my computer with python 3.4.2 (windows 8.1). However, your code

data_line.startswith("\n"):

means that sequences in the file must end with empty line. I think it does not match the specification of the FASTA format. I recommend you to check your input file "sequence.txt". (Does it really have more than 5 sequences?)

But I recommend much stronger to change the code compatible for any FASTA formatted files.

In addition,

for line in file_open: # Reading the sequences one by one

for data_line in file_open:

I don't like to use same file object in inner loop (But I'm not python person. It may be a common way in python). (It may be the same thing which shenwei356 mentioned.)

ADD REPLY
0
Entering edit mode

Could you explain what your desired format / output extension is? It might be easier to use seqkit to split the sequence and then write a much simpler shell script to do the post-processing?

ADD REPLY
0
Entering edit mode

I modified the code and it worked perfectly, thanks for your help.

ADD REPLY
3
Entering edit mode
7.8 years ago

I followed the logic of your script and tweaked it a little. Now, the script should do what you wanted. However, I just want to let you know that there are other (more efficient and less complicated) - ways to implement this task in Python.

file_open = open("sequence.txt","r")
MAX_FILES = 2    # Number of output files
SEQ_PER_FILE = 5 # Sequences in each output file

sequence_n = 0 # Counter for sequences
file_n = 0  # Counter for output files

for line in file_open:
    if not line.strip(): continue
    if line.startswith('>'):
        if sequence_n:
            if sequence_n%SEQ_PER_FILE == 0:
                file_n += 1
                oh = open('sequence'+str(file_n)+'.fasta', 'w')
                oh.write(fasta)
                oh.close()
                if file_n == MAX_FILES: 
                    break
                fasta = line
            else:
                fasta += line
        else:
            fasta = line
        sequence_n += 1
    else:
        fasta += line

if sequence_n%SEQ_PER_FILE:
    oh = open('sequence'+str(file_n+1)+'.fasta', 'w')
    oh.write(fasta)
    oh.close()
ADD COMMENT
0
Entering edit mode

I have tried it and it worked perfectly...Thanks!

ADD REPLY
0
Entering edit mode

Just curious if this was an assignment? You had done the right thing by posting your own code along with your question.

ADD REPLY
0
Entering edit mode

No this was not an assignment. Just a small project stuff.

ADD REPLY

Login before adding your answer.

Traffic: 3461 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6