Question

Detecting Polymorphic Sites In Multiple Sequence Aligned Files

3

Entering edit mode

12.6 years ago

ngsgene ▴ 380

I have used ClustalW2 to get a multiple sequence aligned file (15 sequences for protein coding region). I can see the alleles that are different at specific positions for example my clustalw2 alignment is such.

seq2            AAAAAT 6
seq3            AAAAAT 6
seq1            AAAAAA 6
seq4            TTAAAA 6
                ::***:

How can I make a table of the different alleles that exist at one position. For example at position 3,4,5 I have all A's but at position 6 I have two A's and two T's

Is there a tool to make a table like this:

pos  1 6 
seq2 A T
seq3 A T
seq1 A A
seq4 T A

Thanks!

Edited: I was looking for a tool to make the job easier (and be assured of results for a large dataset) guess there is nothing out there except writing your own script.

multiple clustalw • 6.3k views

ADD COMMENT • link updated 12.6 years ago by raunakms ★ 1.1k • written 12.6 years ago by ngsgene ▴ 380

score 11 · Answer 1 · 2012-04-05

Here is a short script which uses the Bioperl module Bio::AlignIO. This script will read your alignment file (alignment_file.aln) and parse each column of the alignment at a time and then extracts the aligned nucleotide sequence from every alignment creating a tab separated table. The table is printed in an output file (table_ouput.txt).

The table will look like this (note: the number in the table indicates the column number):

1    A    A    A    T
2    A    A    A    T
3    A    A    A    A
4    A    A    A    A
5    A    A    A    A
6    T    T    A    A

Then you can write another script to parse the table and do any type of calculation you want to do !!!

Using Sequence Logo can also give you a rough idea of your over all sequence. Here is the sequence logo for your alignment above:

alt text

Here is the script to parse the alignment:

#!usr/bin/perl/ -w
use strict;
use warnings;

use Bio::AlignIO;
use Bio::LocatableSeq;

my $align_file = 'alignment_file.aln';
my $out_file = 'table_ouput.txt';

my $str = Bio::AlignIO->new('-file' => $align_file);
my $aln = $str->next_aln();

my $seq1 = $aln->get_seq_by_pos(1);
my $seq2 = $aln->get_seq_by_pos(2);
my $seq3 = $aln->get_seq_by_pos(3);
my $seq4 = $aln->get_seq_by_pos(4);

open(OUTPUT, ">$out_file") or die "error";

for (my $col = 1; $col<=$aln->length; $col++) 
{
    my $char_seq1 = $seq1->subseq($col, $col);
    my $char_seq2 = $seq2->subseq($col, $col);
    my $char_seq3 = $seq3->subseq($col, $col);
    my $char_seq4 = $seq4->subseq($col, $col);

    print OUTPUT $column, "\t";
    print OUTPUT $char_seq1, "\t";
    print OUTPUT $char_seq2, "\t";
    print OUTPUT $char_seq3, "\t";
    print OUTPUT $char_seq4, "\n";

}

close OUTPUT;
exit;

score 3 · Answer 2 · 2012-04-05

3

Entering edit mode

12.6 years ago

Niek De Klein ★ 2.6k

I don't know if you want to use the table for further processing, or you only want to use it for visualisation. In the latter case you'll probably want to make a sequence logo. This webserver makes a logo out of a multiple sequence alignment (see the examples if you don't know what a sequence logo is). It takes fasta, clustalw and flat format (see this for explanation of input).

ADD COMMENT • link 12.6 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

From this table I have to detect which site was derived (from the ancestor) and plot them to see freqeuency.. though its interesting to see what a sequence logo is. Thanks for your response.

ADD REPLY • link 12.6 years ago by ngsgene ▴ 380

score 0 · Answer 3 · 2012-04-06

0

Entering edit mode

12.6 years ago

User 2005 ▴ 70

I would suggest MySQL if your table is less than 100MB. Contact in PM if you need more info :)

ADD COMMENT • link 12.6 years ago by User 2005 ▴ 70

0

Entering edit mode

I fail to see why I would need a table in MySQL, I am referring to multiple sequence alignments to get polymorphic sites, a table for reference not an RDBMS table

ADD REPLY • link 12.6 years ago by ngsgene ▴ 380