Question

How to extract polymorphic positions from several alignments?

0

Entering edit mode

7.5 years ago

s_herrera ▴ 10

Hi everyone,

I have several fasta protein alignments (~5000) and I would want to identify polymorphic positions plus the aminoacid residues that change between sequences. I have tried to write a code myself, but it has been very difficult (I'm new in programming), and I looked on BioPython but I hasn't found anything yet. I want something like:

Protein alignment:

> sp1 MQGAAYMQAAAYYMQA
> sp2 MQGAARMQGAAYYMQA
> sp3 MQGAARMQGAAYYMQM
> sp4 MQGAARMQGAAYYMQA
> sp5 MQGAARMQAAAYYMQA
           ^  ^      ^

In the example above, the alignment has 3 polymorphic positions (marked with ^). The first one is located on the 6th position, the second one has the 9th position and the third one the 16th position. A common notation of a polymorphic site coult be as follows: R6Y, which means that a change occured in the 6th position from an R to a Y. The direction of the change (R->Y or Y->R) is based on the most frequent letter at that position. So, in this case R has the highest frequency and one can infer that the direction is R->Y.

As you can see, the 6th and 16th positions have single changes (the different letter is at frequency of 1). However, the 9th position has two sequences (sp1 and sp5) with the change. I would like two distinguish between this two types of polymorphisms. Thus, in this case I woul like an output something like this:

Output :

# Alignment #1
#   Single polymorphisms:
#     R6Y: sp1
#     A16M: sp3

#   Non-single polymorphisms:
#     G9A: sp1, sp5

Any suggestion is very appreciated, thanks!!

alignment python SNP • 3.0k views

ADD COMMENT • link updated 7.4 years ago by Rodrigo ▴ 190 • written 7.5 years ago by s_herrera ▴ 10

score 5 · Accepted Answer · 2017-06-26

Supposing your sequences are the same length and have no empty spaces (no '--') as in your example, you can use the following (explained) script.

# Start of script
from collections import Counter
from glob import glob
import numpy as np
from Bio import AlignIO

Along with module AlignIO of Bio (biopython) you'll use:

numpy: a widely used python module for numerical computing.

glob and collections: two standard modules (no need to install)

sequences = glob('/path/to/sequences/*')

Put all your sequences (fasta files) in a directory here called 'sequences' and replace /path/to/sequences with the real directory.

for sequence in sequences:
    print(sequence)
    alignment = AlignIO.read(sequence, 'fasta')
    align_array = np.array([list(rec) for rec in alignment], np.character)
    single = {}
    poly = {}

For each fasta file, the script prints the name of the file and creates an alignment (using numpy array).

    for column in range(alignment.get_alignment_length()):
        counter = Counter(align_array[:,column])
        main_letter = counter.most_common(1)[0][0]

For each column main_letter is the most frequent letter, if several letter have the same frequency it will be one of them.

        if len(counter) == 2:
            # single polymorphism
            letter = counter.most_common(2)[1][0]
            for pos in np.where(align_array[:,column] == letter)[0]:
                change = main_letter + str(column + 1) + letter
                single.setdefault(change, [])
                single[change].append('sp' + str(pos + 1))

If there is only two different letters in that position means that there is a single polyorphism in the position.

        elif len(counter) > 2:
            # Non-single polymorphism
            for letter in counter:
                if letter != main_letter:
                    for pos in np.where(align_array[:,column] == letter)[0]:
                        change = main_letter + str(column + 1) + letter
                        poly.setdefault(change, [])
                        poly[change].append('sp' + str(pos + 1))

If more than two differents letters in the column, then there is a non-single polymorphism.

    print("   " + "Single polymorphisms:")
    for change in single:
        print("     " + change + ": " + str(single[change])[1:-1])
    print("   " + "Non-single polymorphisms:")
    for change in poly:
        print("     " + change + ": " + str(poly[change])[1:-1])
# End of script

Run this script as python script.py >> output.txt to have the output printed in a file called output.txt.