If you are aiming at retrieving certain sequences by ID for further manipulation, you can use the Bio.SeqIO.index()
function of Bio.SeqIO
module provided in Biopython .
This functions allows you to index moderately large sequence files without consuming much memory as it stores where each record is within the file and parse it on demand given identifiers .
A short Biopython code will do the work :
from Bio import SeqIO
def get_ID (identifier):
parts= identifier.split (",")
assert len (parts)==6 and parts[0]=="chr" and parts[1]=="genetic location:" and parts[2]=="bin num:" and parts[3]=="phase id:" and parts[4]=="scaf" and parts[5]=="marker id:"
return parts [5]
file_dictionary=SeqIO.index(" your_file.fasta ", " fasta ", key_function=get_ID)
file_dictionary.keys()
handle=open(" selected.fasta" , "w")
for ID in [ "id1", "id2", "id3", ... ] :
handle.write( file_dictionary(ID).seq )
handle.close()
You can refer to Biopython tutorial section 5.4.2 for more details and: http://biopython.org/DIST/docs/tutorial/Tutorial.html
Also, you can find other indexing methods for extremely large sequence files in section 5.4.
Hope you find this useful :)
EDITED to suit your file's header format.
Can you post example of your header? Is it
ID: 79695
orID:79695
or just79695
?This is probably simplest in either biopython or bioperl. They'll handle the presence of multi-line entries transparently and allow you to just use a regex to get out the ID (just store the target IDs in a hash/dict and query that).