Hi, I have a large fasta file about ~200,000 KB. The files consist of around 1800 genes, more or less the structure of the file is :
>gene_1 ['chromosomal replication initiator protein'] H15100067005-3045-2-1_gnl|BL_ORD_ID|87 NODE_88_length_41459_cov_22.304518
seq
>gene_1 ['chromosomal replication initiator protein'] H16264065902-6889-1-1_gnl|BL_ORD_ID|140 NODE_151_length_52653_cov_21.638121
seq
and so on. There are 93 of each gene, gene_1, gene_3, gene_5 and so on. I have been trying the above script that I found in this link: []http://biopython.org/wiki/Split_large_file[1] This is the script the way I use it:
def batch_iterator(iterator, batch_size):
#to sort out the genes, and produce a file for each gene + the matching genes.
#from biopython code
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = iterator.next()
except StopIteration:
entry = None
if entry is None:
# End of file
break
batch.append(entry)
if batch:
yield batch
record_iter=SeqIO.parse(open('/path/sorted_sequences.fa'), 'fasta')
for i, batch in enumerate(batch_iterator(record_iter, 93)):
filename='group_%i.fasta' %(i+1)
with open(filename, 'w') as handle:
count=SeqIO.write(batch,handle, 'fasta')
print"Wrote %i records to %s" % (count, filename)
However it does not split the files properly, it produces file_1.fa, file_2.fa which are exactly the same, file_3 contained a gene from file_1 or file_2 and the rest of genes. I don't understand where the error lies. Can someone kindly help me with this issue please? I just want to split the file so I have all gene_1 in a file, gene_2 in another and so on. Or could you direct me to a better way to do this, please? Thanks a lot
Am I understanding correctly that you want all of the "gene_1"s in a file, and all of the "gene_2"s in a separate file and so on?
If so that script doesn't split the sequences by name, it just subsets the input file in to batches. For example, if you had 10,000 sequences and a batch size of 1000, that script would make 10 files, each with 1000 sequences, but in the same order they occured in the original file.
Hi, Yes, you understood right. I know it doesn't split by name. By the file I tried to split is order, and the first 93 sequences are gene-1, and the following 93 are gene_2, and so on. So I thought that by adjusting the batch size to 93, it would split the file, and produce the files I need it. But my thoughts are wrong, obviously. Besides, when I run the scripts it produces duplicates of the files, meaning it will produce two files with 93 sequences that belong to gene_1. This maybe a case of indentation but this script is copy exactly the way that it shown in the website. Could you suggest a way to do what I am intending to do? Thanks,
Can you paste an extended version of your file? Maybe the first 5 seqs for each
gene1
,gene2
...gene5
or however long you want to do. I might have a script that does this already, but want to check.Hi, Thanks. I found where the problem lied. When, I was preparing a little file to post it for you, I saw something odd. I noticed that I had 96 sequences per gene instead of the 93. So I deleted the bad file, and run the full script (this one has several def():), and I got the right file, and it was split the way I wanted with 93 sequence on each. Now, I have 944 fasta file, and it is when the fun start to manually curate them. Thanks a lot for you nice comments, and help.
I think the indentation is wrong from the break like to the end of the function, couldn't understand the error though. You can iterate the fasta file, append to an array and write to a file whenever the name changes or keep all records in a dict according to sequence.name and write everything at the end
It doesn't give an error as it does the iteration. But instead of producing files with 93 sequences on each, it duplicates the files producing two files with 93 sequences per gene id i.e. gene_1, and then it does produce other files with mixed gene id i.e., gene_1 + gene_1001. The identation is the one that is on the original script enter link description here. But I am going to try some changes there. I am quite lost on how to deal with the file separation.