Hi,
i have a fasta file including multiple protein sequences. I want to do feature selection for them and to do this, i need features of the proteins (number of each amino acid, number of di-peptide and something like this) but i couldn't find any software or script. if possible help me. thanks a lot for any help
best regards
Thanks for your help. Compseq count dipeptide for a sequence or several sequences but it display a table (for all of them) but i need a table for each of sequences. in fact, i need a software that i can submit a fasta file and it bring me a table including number of dipeptides in each of sequences.
This is a very fundamental thing to do on protein sequences, hence you most likely won't find readily accessible scripts for that. Using your script-language of choice (like python, perl), this should be easily achievable by simple sequence parsing.
thanks for your answer. i am a bit fresh about bioinformatis specially protein bioinformatics. i need a script (it doesn't matter perl or python) to do that. if there is any thing like this inform me. thanks
What Chris is saying is that for this type of task, there is rarely a ready-made program to generate exactly the output you want. There may be something close (such as the tools in EMBOSS), but you still need to parse the output into the required form. So an experienced bioinformatician would write the code themselves, probably using a library for sequence analysis. It would be nice if they then made that code available but frequently, this is not the case. Sorry if that's not helpful, but at least you're now aware of how people do these tasks and what you need to do: learn to script yourself!
Here is a quick perl hack. Please check it thoroughly until it gives you the desired result. This however doesn't report frequencies but just only non zero counts. If certain dipeptides are missing the dipeptides and correspnding counts(0) are not reported.
use strict;
use diagnostics;
use Bio::SeqIO;
my $in = Bio::SeqIO->new(-file => "myfile.fasta" , '-format' => 'Fasta');
while ( my $seq= $in->next_seq ) {
my @dipeps=($seq->seq()=~/(?=(.{2}))/g);
my %di_count=();
$di_count{$_}++ for @dipeps;
print ">",$seq->id();
print " ",$_,",",$di_count{$_} for sort keys(%di_count);
print "\n";
}
This small modification can report all possible dipep counts. Note that I've included two extra amino acids U and X
Thanks for your answer. i am freshman about perl and it is such as stupid question. when i run the script, it give error (cant locate Bio/seqI0.pm in @INC ....)
I don't know that what things must be installed. if possible help me about this. regards
Bio::SeqIO is a Bioperl module - http://www.bioperl.org - so that's what you need to install. It's not easy for complete beginners I'm afraid, but nor is it very difficult, if you read and follow the instructions to the letter.
As others have indicated, most bioinformaticians would write their own code to perform this task.
Something like a "ready-made" solution exists in seqinr, an R package for sequence analysis. Of course, that requires that you know or are willing to learn some R. Some sample code:
# install seqinr if required
install.packages("seqinr")
library(seqinr)
# read fasta sequences in file "sequences.fasta"
seqs <- read.fasta("sequences.fasta")
# amino acid composition of first sequence
# note that you need to specify protein alphabet
count(seqs[[1]], 1, alphabet = s2c("acdefghiklmnpqrstvwy"))
a c d e f g h i k l m n p q r s t v w y
60 2 43 44 19 50 14 30 31 49 15 21 37 20 23 28 41 37 11 21
# for dipeptides, change 1 to 2 (output not shown)
count(seqs[[1]], 2, alphabet = s2c("acdefghiklmnpqrstvwy"))
# for all sequences in list, use lapply()
lapply(s1, function(x) count(x, 1, alphabet = s2c("acdefghiklmnpqrstvwy")))
$`gi|452453|gb|AAA93118.1|`
a c d e f g h i k l m n p q r s t v w y
60 2 43 44 19 50 14 30 31 49 15 21 37 20 23 28 41 37 11 21
$`gi|76667604|dbj|BAE45629.1|`
a c d e f g h i k l m n p q r s t v w y
36 6 35 29 17 43 15 32 53 39 12 20 41 16 26 16 36 42 14 23
$`gi|995617|emb|CAA62740.1|`
a c d e f g h i k l m n p q r s t v w y
57 2 39 24 14 43 14 30 45 46 10 19 35 12 23 28 39 41 11 22
Thanks for your help. Compseq count dipeptide for a sequence or several sequences but it display a table (for all of them) but i need a table for each of sequences. in fact, i need a software that i can submit a fasta file and it bring me a table including number of dipeptides in each of sequences.