I have written code for splitting FASTA into desired number of sequences and desired number of files. But it doesn't write output for the second file or the last file. I couldn't figure it out.
The code is as follows:
file_open=open("sequence.txt","r") # Input file
count_file=2 # Number of output files
count_sequence=5 # Sequences in each output file
sequence=0 # Count total sequences until count_sequences
for i in range(count_file,0,-1): # Loop for file names generation
sequence_file=open("sequence"+str(i)+".fasta","w") # Output file
for line in file_open: # Reading the sequences one by one
if line.startswith(">"):
fasta=""
for data_line in file_open:
if data_line.startswith("\n"):
break
else:
fasta=fasta+data_line
print(sequence_file)
print(line+fasta)
sequence_file.write(line+fasta+"\n") # Writing the FASTA sequences to file
sequence=sequence+1 # Increment the count of the sequence
if sequence==count_sequence: # If number of sequences are equal to the desired number of sequences
sequence=0 # reset the counter
break
In this case, sequence1.fasta is empty but sequence2.fasta have the first five sequences.
I know that there are many tools mentioned in the forum that do the desired thing but I want the specific format and output extension, that's why I wrote this but couldn't run it.
Thanks!
the loop logic for lines is not right. only one level is needed
I modified the code and it worked....
first I think you need to make an index from the fasta.
Then, you can print its elements in loop for i in range(int(count_file)):
Just preference but this would be nicer, the
with
block is so handy (untested):It worked fine in my computer with python 3.4.2 (windows 8.1). However, your code
means that sequences in the file must end with empty line. I think it does not match the specification of the FASTA format. I recommend you to check your input file "sequence.txt". (Does it really have more than 5 sequences?)
But I recommend much stronger to change the code compatible for any FASTA formatted files.
In addition,
I don't like to use same file object in inner loop (But I'm not python person. It may be a common way in python). (It may be the same thing which shenwei356 mentioned.)
Could you explain what your desired format / output extension is? It might be easier to use seqkit to split the sequence and then write a much simpler shell script to do the post-processing?
I modified the code and it worked perfectly, thanks for your help.