How To Count Amino Acids In Fasta Formated File?
4
3
Entering edit mode
11.9 years ago
8mazix ▴ 30

I need to count how many A, T, G and so on is in each sequence, for example:

>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIILXKKK

in this sequence theres:

M - 2
R - 2
G - 1
RR - 2
LL - 2
P - 1
II - 1
L - 1
X - 1
KKK - 1

for now on, I have a code that counts only single letters (output from my app: 1 sequences found {'G': 1, 'I': 2, 'M': 2, 'L': 3, 'P': 1, 'R': 4}):

def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}
  counts = {}

  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')
      for aa in sequences[name]:
        if aa in counts:
            counts[aa] = counts[aa] + 1
        else:
            counts[aa] = 1  

  print "%d sequences found" % len(order)
  print counts
  return order, sequences

x, y = FASTA("file.fasta")

How to change it to be able to get the output above? I don't want to use any other libs, (I mean - I dont want to do BioPython for this, because I want to do it without BioPython;)

python biopython • 13k views
ADD COMMENT
2
Entering edit mode

Just to clarify the poster's question. He actually wants counts of single amino acids AND consecutive amino acids. Not just single amino acids. It's actually a bit more tricky than just using the native python/biopython count functions.

ADD REPLY
1
Entering edit mode

It seems you are not using the real power of biopython. read fasta file first, create Seq object my_seq and assign sequence from fasta file and use my_seq.count("G"). I my opinion it is good to use Biopython.

ADD REPLY
0
Entering edit mode

Moved to "question" since this is clearly not a "tip".

ADD REPLY
4
Entering edit mode
11.9 years ago
Ketil 4.1k

Another solution is to use the standard Unix tools:

grep -v '^>' test.fasta | tr -d '\n' | sed -e 's/\(.\)/\1\n/g' | sort | uniq -c | sort -rn

:-)

ADD COMMENT
0
Entering edit mode

Awesome :) thnks

ADD REPLY
0
Entering edit mode

That's genius. Even 5 years later!

ADD REPLY
0
Entering edit mode

I know this was post a while ago, but would you mind explaining this in terms of the commands in steps? I'm very new to unix so I'm not familiar with the language/ terms

ADD REPLY
4
Entering edit mode
11.9 years ago

Here's one solution using standard Python library: itertools. The code also works with multiple Fasta format.

from itertools import groupby

fh = open('file.fasta')

groups = (x[1] for x in groupby(fh, lambda l: l[0] == ">"))
for g in groups:
    header = next(g).strip()
    seq = ''.join(s.strip() for s in next(groups))
    counts = {}
    for s in [''.join(g) for k, g in groupby(seq)]:
        if s not in counts: counts[s] = 0
        counts[s]+=1
    print header
    print counts

It gives you:

>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
{'G': 1, 'RR': 1, 'LL': 1, 'M': 2, 'L': 1, 'II': 1, 'P': 1, 'R': 2, 'KKK': 1, 'X': 1}
ADD COMMENT
0
Entering edit mode

Nice. I haven't really looked into itertools since a lot of the useful functions are 2.6+. The groupby function looks extremely useful. I think I will definitely look into it now.

ADD REPLY
0
Entering edit mode
11.9 years ago

Try this. The getCount function will get you the single and consecutive amino acid counts

inFile = open('filePath','r')

def getCounts(seq):
    current = seq[0]
    counts = {}
    for i in range(1,len(seq), 1):
        if seq[i] != current[-1]:
            if not counts.has_key(current):
                counts[current] = 0
            counts[current] += 1
            current = seq[i]
        else:
            current += seq[i]
    return counts

header = ''
seq = ''
for line in inFile:
    if line[0] == ">":
        if header != '':
            print header + "\t" + str(getCounts(seq))
        header = line.strip()[1:]
        seq = ''
    else:
        seq += line.strip()
print header + "\t" + str(getCounts(seq))
ADD COMMENT
0
Entering edit mode

Hey Damian Kao this code count 'all' separated aac once(ie. you get all 'L' in the seq?) and then counts the doubles or triples consecutive LL for ex.??

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2111 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6