Dear all, I have an alignment of several sequences. I need ot make statistics ot the identify of each nt site. Would you please give a hand? Thank you very much.
Dear all, I have an alignment of several sequences. I need ot make statistics ot the identify of each nt site. Would you please give a hand? Thank you very much.
Here is python script. I think this is what you want.
from numpy import matrix
from collections import defaultdict
Seq1="ATCCTTAGGT"
Seq2="ACCCTACAAT"
Seq3="TTCTGGCAGC"
SequenceLength=len(Seq1)
A = matrix( [list(Seq1),list(Seq2),list(Seq3)])
for i in range(0,SequenceLength):
dic = defaultdict(int)
for j in range(0,len(A)):
dic[A[j,i]] += 1
Str="Position:"+str(i+1)
if dic['A'] !=0:
Str = Str + " A: " + str(float(dic['A'])/len(A) *100)
if dic['T'] !=0:
Str = Str + " T: " + str(float(dic['T'])/len(A) *100 )
if dic['G'] !=0:
Str = Str + " G: " + str(float(dic['G'])/len(A) *100)
if dic['C'] !=0:
Str = Str + " C: "+ str( float(dic['C'])/len(A) *100)
print Str
Output:
Position:1 A: 66.6666666667 T: 33.3333333333
Position:2 T: 66.6666666667 C: 33.3333333333
Position:3 C: 100.0
Position:4 T: 33.3333333333 C: 66.6666666667
Position:5 T: 66.6666666667 G: 33.3333333333
Position:6 A: 33.3333333333 T: 33.3333333333 G: 33.3333333333
Position:7 A: 33.3333333333 C: 66.6666666667
Position:8 A: 66.6666666667 G: 33.3333333333
Position:9 A: 33.3333333333 G: 66.6666666667
Position:10 T: 66.6666666667 C: 33.3333333333
biolab, know what you're looking for...
This "statistics ot the identify of each nt site" is called Position weight matrix (or PSWM, PSSM).
I strongly recommend going trough this tutorial (you can analyze your sequences their).
Also, similar question were asked on biostars before: Parsing a Fasta file by columns.
Your script is close. What you want is something that keeps count of how often every letter comes up in each position. I'd use a hash of hashes. It's sort of like a three dimensional hash, where instead of one key having one value, each key is its own hash, with more keys and values. So
$hash{14}{'A'}++
would increment up the count of A's at position 14 by 1. $hash{14} is actually a reference to another hash that has the counts for all 4 letters. So:
$hash{$j}{$o}++
and so on for each sequence.
Then you use two nested foreach loops to get the data out
foreach my $position (keys %hash) {
foreach my $letter ('A', 'C', 'G', 'T') {
print "$position: $letter: ", $hash{$position}{$letter} / 4, "\n";
}
}
Once you get your script working, you next job is to generalize it, so it works on more than 4 sequences.
Dear swbarnes2,
I tried my best to write a code, as shown below. However, I still get problem. could you please check it and indicate the erros for me? THANK YOU VERY MUCH.
#!/usr/bin/perl
use strict;
use warnings;
open SEQ, $ARGV[0];
my %hash = 0;
while (<SEQ>){
chomp;
my @seq = split (/ /, $_);
for (my $i=0; $i<=$#seq; $i++){
$hash {$i}{$seq[$i]} +=1;
}
}
close SEQ;
foreach my $posi (keys %hash){
foreach my $nt ('A', 'T', 'G', 'C'){
print "$posi", "$nt", %hash{$posi}{$nt}/4, "\n";
}
}
Error message
syntax error at aln comp_2.pl line 18, near "%hash{"
Global symbol "$nt" requires explicit package name at aln comp_2.pl line 18.
syntax error at aln comp_2.pl line 18, near "},"
syntax error at aln comp_2.pl line 20, near "}"
Execution of aln comp_2.pl aborted due to compilation errors.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Can you make clear what '..calculate the identify ..' means ?
Sorry for unclear descirption. Actually my analysis is to calculate the frequency of each site for an alignment. For example, 3 sequences have been aligned as shown below. I want to know the frequency of these four sites (No.1: 100% A, NO.2: 100% T, NO3: 100%G, NO. 4: 33.3%C 66.6%G). THANKS!
i further make some comments: my sequences are the same in length. So I just need to get the nt composition information for each position. THANKS.
do you allow gaps in your alignment ?
no gaps in the alignment. Actually my sequences are all 21 nt miRNAs.
Then it would be easy. Put all sequences in array, make loop with increment counter by array indices, then calculate similarity percentage in each loop count
Hi World, I have encounted problem with perl. I am a true beginner of perl. I tried my best to write a script, but not working. I show it below. This is an example: I have one alignment file that contains 4 sequences, each of which is 21 nt long. Actually I also need to run more alignment files in batch, but do not know how to use @ARGV. Could you please take a look at the following script? THANKS!
Note I don't know how to calculate similarity percentage, so just print the four nts. Please modify my script. THANKS
I finally worked out a script for displaying the nt at each position for 4 sequences. for example
My script is as follows:
my output result is:
As a beginner i tried best to produce this output. However, I need to calculate in each position (here from position 1 to 20) the frequency of ATGC. How to achiev that? Could anyone give a hand? THANKS!