Hi All,
I have a large fasta file that I am trying to sort into multiple smaller files based on their ID's. The File starts like this:
>1MUSgi|116063569|ref|NM_010065.2|
AGGGG-TGGTTGACCATCAACAACATCGGCATCATG-AAGGGAGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG-
>2MUSgi|118130562|ref|NM_019880.3|
CGGCCCGCGGCTCAGCCGTCGGCGCGCAGGATGGACGGCG-A
>2MUSgi|118130562|ref|NM_019880.3|
AGTTTAGCCAGGCCCTGGCCATCCGGAGCTACACCAAGTTTGTGATGGGGATTGCAGTGAGCATGCTGACCTACCCCTTCCTGCTCGTTGGAGATCTCATGGCAGTGAACAACCCTGGAGTAACCT
>1HOMOgi|59853098|ref|NM_004408.2|
GCATCCGCAAGGGCTGGCTGACTATCAATAATATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG-
>1
GGTGATCCGCAGGGGCTGGCTGACCATCAACAACATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTCGTGCTCACTGCCGAGTCACTGTCCTGGTACAAGGACGAAGAGGAGAAAGAGAG
>2
CGCGCCAGCACCGGCCCGCGGCGCAGCCCTCGGCCCGCAGGATGGACGGCGCGTCCGGGGGCCTGGGCTCTGGGGATAGTGCC
I want all of the ID's beginning with 1's to go on one file, ID's starting with 2's in another, and so on....
I have been trying to use SeqIO
for record in SeqIO.parse(open("QHM-clean.fasta", "rU"), "fasta") :
for i in range(1,3):
if record.id %i: #this needs to be changed "if record.id STARTS WITH %i"
print record.id
output_handle = open("%i.fasta", "w") #naming in this manner does not seem to be allowed
SeqIO.write(output_handle, "fasta")
output_handle.close()
But this seems to be not working in many obvious ways... Can anybody help me out with some advice on how to proceed?
The percent operator in Python is a numerical operator, while the record ID is a string. Try record.id.startswith("1") for a Pythonic approach. Also, Biopyhton's SeqIO.write(...) function takes three required arguments, not just two. And there was also something wrong with your indentation. I'll try to give a more detailed answer later.
I see now you've also asked on the Biopython mailing list, and got an answer there.
BioStar needs a clearer policy on "double requests" like this...