from Bio import SeqIO
seq_list = []
for record in SeqIO.parse("sample.fasta", "fasta")
# The sequence is in the `seq` attribute of the record
print record.seq
# Print only the 10th character
print record.seq[9]
# Add the sequences to a list
seq_list.append(record.seq)
# Turn the list into a string
single_seq = "".join(seq_list)
ADD COMMENT
• link
updated 4.9 years ago by
Ram
44k
•
written 10.5 years ago by
xbello
▴
20
0
Entering edit mode
Adding to this reply (which is the only one actually answering the OP's question about Python code), you can calculate the percentages directly without converting the sequences to a list and then operate on these.
Create a dictionary that holds all twenty amino acids (and gaps if you need them) associated to 0s:
aa_count = { 'A': 0, 'C': 0, ... etc etc...}
Iterate over the sequences using SeqIO (by the way, here is a link to the tutorial page, which is full of examples), just like @xbello mentioned.
For each sequence, iterate over it (it's a string) and simply increment the counts:
sequence = record.seq
for aa in sequence:
aa_count[aa] += 1
At the end, you should have the total amount of each residue. The sum of the entire dictionary values gives you the total of amino acids in the entire file. Then calculate your averages. If you want to do this per sequence, just place the dictionary inside the for loop so that it is created at each iteration (i.e. each new sequence, all counters reset).
Hope it helps a bit. Also, if you are starting, have a look at this tutorial. Might help a bit.
This can be done without python too. Awk is particularly fast for these stuffs.
With python you can easily implement like this but again, there are faster ways to do it especially if you use the re module or use stream read (io module)
I can give you an awk one liner (I am sorry for my obsession with awk):
awk 'BEGIN{FS=""} {#to concatenate the sequences\
if($0~/</){h=$0} else{seq[h]=seq[h]$10}} END{for(i in a){x=substr(a[i],10,1);y[x]++;z++} for(j in y){print j,y[j]/z}} yourfile.fa
or in the linux terminal (if you don't want to retain the sequence headers)
sed '/>/s/^.*/>/' yourfile.fa | tr -d "\n" | tr ">" "\n"| awk -F "" 'NR>1{a[$10]++;b++} END{for(i in a){print i,a[i]/b}}'
This won't work because all the lines will be concatenated to a single line; all you will get is a single amino acid. In the second case you will pick up 10th residue from every line!!
What OP asked is to find out the percentage likelihood of an amino acid at a certain position. If you concatenate all sequences then there is only one sequence left; you will end up reporting the 10th residue of the first sequence in the fasta file!!
In the second case you'll end up picking up 10th residue from every line, not every sequence (a sequence may be folded in a fixed window which generates multiple lines per sequence).
Adding to this reply (which is the only one actually answering the OP's question about Python code), you can calculate the percentages directly without converting the sequences to a list and then operate on these.
Create a dictionary that holds all twenty amino acids (and gaps if you need them) associated to 0s:
For each sequence, iterate over it (it's a string) and simply increment the counts:
At the end, you should have the total amount of each residue. The sum of the entire dictionary values gives you the total of amino acids in the entire file. Then calculate your averages. If you want to do this per sequence, just place the dictionary inside the for loop so that it is created at each iteration (i.e. each new sequence, all counters reset).
Hope it helps a bit. Also, if you are starting, have a look at this tutorial. Might help a bit.
Or if you feel lazy, use the letters already defined in Biopython:
from Bio.Alphabet.IUPAC import extended_protein
d = dict.fromkeys(list(extended_protein.letters), 0)