Question

Summarise bases from alignment

1

Entering edit mode

2.5 years ago

mpm20 ▴ 10

Say I have an alignment of multiple consensus sequences in multi-fasta format, and a specific site (or sites) I am interested in. I'd like to know the base frequency across all sequences in the alingment at these sites of interest.

Is there a command line approach to help summarise those specific sites? For example, to tell me the number of A/T/G/C bases at position 95?

alignment • 892 views

ADD COMMENT • link updated 2.5 years ago by Pierre Lindenbaum 164k • written 2.5 years ago by mpm20 ▴ 10

0

Entering edit mode

Something like bam-readcount tool could work, but you need a BAM as input with the alignment of the reads to your sequence of interest. The output is exactly what you need, frequency of A/T/G/C at the position of interest. There are many other tools as well, I am just suggesting the one I am familiar with :).

ADD REPLY • link 2.5 years ago by iraun 6.2k

0

Entering edit mode

Thanks for your reply. This looks like a great solution, but only works for looking a single read set of interest. And I now realise I wasn't specific enough with my orginal post! Apologies, I have edited to give further clarification.

In my use case, I have multiple consensus sequences aligned in a multi-fasta file, so I am just interested in the consensus-level base frequency at particular sites across all the sequences in the aligment

ADD REPLY • link 2.5 years ago by mpm20 ▴ 10

score 0 · Answer 1 · 2022-05-20

Alright, I think I understood now the question. How are your python skills? Here I have quickly coded a script (that has a lot room for improvement), but that I believe it does what you want? Let me know.

You can run it in the following way:

python biostars9523808.py <multifasta> <position> <length>

The first argument is the multi-fasta file, the second the position you want to interrogate (0-based, therefore, position 0 corresponds to the first nucleotide of the sequences), and the third argument is the length of the sequences in the fasta file (this script assumes that all the sequences have the same length, if not, I think it could work if you specify the length of the longest sequence but I have not tested it).

The input looks like this:

>f1
CAGGTAGCC
>f2
CCGGTCAGA
>f3
AGGGTTTGA
>f4
TTGGTGAGG

And the output, for position 0, like this:

python biostars9523808.py test.fa 0 9
A   1   0.25
C   2   0.50
G   0   0.00
T   1   0.25

The second column is the count, and the third the frequency.

from Bio import SeqIO
import sys

input_file=sys.argv[1]
position=int(sys.argv[2])
sequence_length=int(sys.argv[3])

frequency_list = []
for i in range(sequence_length):
    frequency_list.append({'A': 0, 'C': 0, 'G': 0, 'T': 0})

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
total_sequences=0
for fasta in fasta_sequences:
   total_sequences+=1
   name, sequence = fasta.id, str(fasta.seq)
   for idx, c in enumerate(sequence.strip()):
      frequency_list[idx][c] += 1

for char in ['A', 'C', 'G', 'T']:
    print("{}\t{}\t{}".format(char, frequency_list[position][char], "\t".join(["{:0.2f}".format(frequency_list[position][char]/total_sequences)])))

score 0 · Answer 2 · 2022-05-20

0

Entering edit mode

2.5 years ago

Pierre Lindenbaum 164k

convert to VCF ( https://github.com/sanger-pathogens/snp-sites ) , if needed update the AC,AN,AF fields using bcftools +fill-tags, if needed convert to a tabular format using bcftools query

ADD COMMENT • link 2.5 years ago by Pierre Lindenbaum 164k