I need to count how many A, T, G and so on is in each sequence, for example:
>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIILXKKK
in this sequence theres:
M - 2
R - 2
G - 1
RR - 2
LL - 2
P - 1
II - 1
L - 1
X - 1
KKK - 1
for now on, I have a code that counts only single letters (output from my app: 1 sequences found {'G': 1, 'I': 2, 'M': 2, 'L': 3, 'P': 1, 'R': 4}):
def FASTA(filename):
try:
f = file(filename)
except IOError:
print "The file, %s, does not exist" % filename
return
order = []
sequences = {}
counts = {}
for line in f:
if line.startswith('>'):
name = line[1:].rstrip('\n')
name = name.replace('_', ' ')
order.append(name)
sequences[name] = ''
else:
sequences[name] += line.rstrip('\n').rstrip('*')
for aa in sequences[name]:
if aa in counts:
counts[aa] = counts[aa] + 1
else:
counts[aa] = 1
print "%d sequences found" % len(order)
print counts
return order, sequences
x, y = FASTA("file.fasta")
How to change it to be able to get the output above? I don't want to use any other libs, (I mean - I dont want to do BioPython for this, because I want to do it without BioPython;)
Just to clarify the poster's question. He actually wants counts of single amino acids AND consecutive amino acids. Not just single amino acids. It's actually a bit more tricky than just using the native python/biopython count functions.
It seems you are not using the real power of biopython. read fasta file first, create Seq object
my_seq
and assign sequence from fasta file and usemy_seq.count("G")
. I my opinion it is good to use Biopython.Moved to "question" since this is clearly not a "tip".