Hello again everyone,
Once again I'm asking for help, this time I have managed to write a semi-decent script to split a FASTA file into multiple files dependent on a User defined number
Problem is, it is only counting headers, so for example counting up to 100 headers in the below example but not really taking notice of the sequence for the 100th headers sequence. That is pushed into the top of the next file. I've tried counting Name and saving it to a list to then zip with Listy to then be saved to the file but at this point I think I'm trying to edit some thing that's too far gone. Anyone have any ideas?
x = '/Users/me/Desktop/FUGU/Original/TakiFugu.fa'
d = '/Users/me/Desktop/FUGU/'
m = 'FUGU'
f = 100 #Takes about 3 minutes to run
def EnteriesPer(FileInput, FileOutPut, OrganismOI, EntryNo):
EntryNo += 1
Name = ""
Counter = 0
Listy = []
FileCounter = 0
FileCounter = 0
with open(FileInput, 'r') as Org:
for line in Org:
Listy.append(line.strip())
if line[0] == '>':
Counter += 1
if Name == "":
if Counter == EntryNo:
FileCounter += 1
with open(FileOutPut + OrganismOI + str(FileCounter) + '.fasta', 'w') as Oh:
Oh.write('\n'.join(Listy).strip())
Counter = 0
Listy = []
Length = 0
print('Thats '+ str(FileCounter))
This is currently part of my attempt at a much larger script which will parse files in a pipeline I will eventually set up (a rite of passage before starting work proper I am told), this will involve 5 functions and some arg.parse to call each function when needed so thats fun.
Thanks in advance everyone.
What you are experiencing is because you increment the
Counter
after you've seen the header. Then you check if counter is specified value and write the file - while you have NOT yet read the sequence.You'll need to either set up some kind of buffer for reading one sequence completely (and then add it to the others) or be able to seek to previous line and pop last item from the list, when you are writing the file (the
f.tell()
andf.seek()
might interest you).I'll leave the implementation up to you.