Hello everyone,
I'm completely new to Python, but I have to do something that is pretty basic.
I have a fasta file from which I need to count the percentage of each aminoacid (in all sequences in the file).
How can I select only the sequences (not the names or descriptions) and turn it into a string or a list?
And how can I, for instance, select only one aminoacid (lets say the 10th) from each sequence?
Thanks everyone
Adding to this reply (which is the only one actually answering the OP's question about Python code), you can calculate the percentages directly without converting the sequences to a list and then operate on these.
Create a dictionary that holds all twenty amino acids (and gaps if you need them) associated to 0s:
For each sequence, iterate over it (it's a string) and simply increment the counts:
At the end, you should have the total amount of each residue. The sum of the entire dictionary values gives you the total of amino acids in the entire file. Then calculate your averages. If you want to do this per sequence, just place the dictionary inside the for loop so that it is created at each iteration (i.e. each new sequence, all counters reset).
Hope it helps a bit. Also, if you are starting, have a look at this tutorial. Might help a bit.
Or if you feel lazy, use the letters already defined in Biopython:
from Bio.Alphabet.IUPAC import extended_protein
d = dict.fromkeys(list(extended_protein.letters), 0)