Alright, I think I understood now the question.
How are your python skills? Here I have quickly coded a script (that has a lot room for improvement), but that I believe it does what you want? Let me know.
You can run it in the following way:
python biostars9523808.py <multifasta> <position> <length>
The first argument is the multi-fasta file, the second the position you want to interrogate (0-based, therefore, position 0 corresponds to the first nucleotide of the sequences), and the third argument is the length of the sequences in the fasta file (this script assumes that all the sequences have the same length, if not, I think it could work if you specify the length of the longest sequence but I have not tested it).
The input looks like this:
>f1
CAGGTAGCC
>f2
CCGGTCAGA
>f3
AGGGTTTGA
>f4
TTGGTGAGG
And the output, for position 0, like this:
python biostars9523808.py test.fa 0 9
A 1 0.25
C 2 0.50
G 0 0.00
T 1 0.25
The second column is the count, and the third the frequency.
from Bio import SeqIO
import sys
input_file=sys.argv[1]
position=int(sys.argv[2])
sequence_length=int(sys.argv[3])
frequency_list = []
for i in range(sequence_length):
frequency_list.append({'A': 0, 'C': 0, 'G': 0, 'T': 0})
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
total_sequences=0
for fasta in fasta_sequences:
total_sequences+=1
name, sequence = fasta.id, str(fasta.seq)
for idx, c in enumerate(sequence.strip()):
frequency_list[idx][c] += 1
for char in ['A', 'C', 'G', 'T']:
print("{}\t{}\t{}".format(char, frequency_list[position][char], "\t".join(["{:0.2f}".format(frequency_list[position][char]/total_sequences)])))
Something like bam-readcount tool could work, but you need a BAM as input with the alignment of the reads to your sequence of interest. The output is exactly what you need, frequency of A/T/G/C at the position of interest. There are many other tools as well, I am just suggesting the one I am familiar with :).
Thanks for your reply. This looks like a great solution, but only works for looking a single read set of interest. And I now realise I wasn't specific enough with my orginal post! Apologies, I have edited to give further clarification.
In my use case, I have multiple consensus sequences aligned in a multi-fasta file, so I am just interested in the consensus-level base frequency at particular sites across all the sequences in the aligment