I'm trying to organize FASTA file with multiple sequences . In doing so, I'm trying to add the names to a list and add the sequences to a separate list that is parallel with the name list . I figured out how to add the names to a list but I can't figure out how to add the sequences that follow it into separate lists . I tried appending the lines of sequence into an empty string but it appended all the lines of all the sequences into a single string .
def Name_Organizer(FASTA,output):
import os
import re
in_file=open(FASTA,'r')
dir,file=os.path.split(FASTA)
temp = os.path.join(dir,output)
out_file=open(temp,'w')
data=''
name_list=[]
for line in in_file:
line=line.strip()
for i in line:
if i=='>':
name_list.append(line)
break
else:
line=line.upper()
if all([k==k.upper() for k in line]):
data=data+line
print data
how do i add the sequences to a list as a set of strings ?
the input file looks like, but with the > before the name on the top line :
>44664.3|G1E3M3IX1IW|Greengenes|2471 16S ribosomal RNA [Microbacterium oxydans]
gactATAATTTGTAAATTTCTTGAGATAGAATCATTCGTATTGAATGAGGTCAAATTCTC
TAAACTGATTAAGAAGTATAATACTTAGATGCGAGTTATTGCATCACTTAACGGAGAGTT
TGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGTGAAG
TCTGAATTGAGTACTTCGGTATGATATTTGGGTGGAAAGTGGCGGACGGGTGAGTAACAC
GTGGGTAACCTGCCTCGAAGTGGGGACAACCATTGGAAACGATGGCTAATACCGCATAGT
TCTTTAGATGCATGAGCATTTATAGATAAAACTCTGGTGCTTCGAGAGGGGTCTGCGTCC
GATTAGTTAGTTGGTGGGTAAAGGCCTACCAAGACGATGATCGGTAGCTGGTCTGAGAGG
ACGATCAGTCACACGGGAACTGAGACACGGTCCagtcgtgggagacaaggcacacagggg
ataggnnnnn
>44684.3|G1E3M3B01IW|Greengenes|2688 16S ribosomal RNA [Microbacterium oxydans]
gactATAATTTGTAAATTTCTTGAGATAGAATCATTCGTATTGAATGAGGTCAAATTCTC
TAAACTGATTAAGAAGTATAATACTTAGATGCGAGTTATTGCATCACTTAACGGAGAGTT
TGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGTGAAG
TCTGAATTGAGTACTTCGGTATGATATTTGGGTGGAAAGTGGCGGACGGGTGAGTAACAC
GTGGGTAACCTGCCTCGAAGTGGGGACAACCATTGGAAACGATGGCTAATACCGCATAGT
TCTTTAGATGCATGAGCATTTATAGATAAAACTCTGGTGCTTCGAGAGGGGTCTGCGTCC
GATTAGTTAGTTGGTGGGTAAAGGCCTACCAAGACGATGATCGGTAGCTGGTCTGAGAGG
ACGATCAGTCACACGGGAACTGAGACACGGTCCagtcgtgggagacaaggcacacagggg
ataggnnnnn