Question

How Can I Count Dipeptide (Or Features Of Protein Sequences) In Multiple Protein Sequences

2

Entering edit mode

13.6 years ago

Mohammad Reza Bakhtiarizadeh ▴ 350

Hi, i have a fasta file including multiple protein sequences. I want to do feature selection for them and to do this, i need features of the proteins (number of each amino acid, number of di-peptide and something like this) but i couldn't find any software or script. if possible help me. thanks a lot for any help best regards

protein protein protein • 7.8k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 13.6 years ago by Mohammad Reza Bakhtiarizadeh ▴ 350

score 2 · Answer 1 · 2011-04-29

2

Entering edit mode

13.6 years ago

Julien ▴ 160

Try EMBOSS protein composition > compseq:

web mirror: http://pro.genomics.purdue.edu/emboss/

download: http://emboss.sourceforge.net/download/

ADD COMMENT • link 13.6 years ago by Julien ▴ 160

0

Entering edit mode

Thanks for your help. Compseq count dipeptide for a sequence or several sequences but it display a table (for all of them) but i need a table for each of sequences. in fact, i need a software that i can submit a fasta file and it bring me a table including number of dipeptides in each of sequences.

ADD REPLY • link 13.6 years ago by Mohammad Reza Bakhtiarizadeh ▴ 350

score 1 · Answer 2 · 2011-04-29

1

Entering edit mode

13.6 years ago

Chris ★ 1.6k

This is a very fundamental thing to do on protein sequences, hence you most likely won't find readily accessible scripts for that. Using your script-language of choice (like python, perl), this should be easily achievable by simple sequence parsing.

ADD COMMENT • link 13.6 years ago by Chris ★ 1.6k

0

Entering edit mode

thanks for your answer. i am a bit fresh about bioinformatis specially protein bioinformatics. i need a script (it doesn't matter perl or python) to do that. if there is any thing like this inform me. thanks

ADD REPLY • link 13.6 years ago by Mohammad Reza Bakhtiarizadeh ▴ 350

0

Entering edit mode

What Chris is saying is that for this type of task, there is rarely a ready-made program to generate exactly the output you want. There may be something close (such as the tools in EMBOSS), but you still need to parse the output into the required form. So an experienced bioinformatician would write the code themselves, probably using a library for sequence analysis. It would be nice if they then made that code available but frequently, this is not the case. Sorry if that's not helpful, but at least you're now aware of how people do these tasks and what you need to do: learn to script yourself!

ADD REPLY • link 13.6 years ago by Neilfws 49k

score 1 · Answer 3 · 2011-04-29

1

Entering edit mode

13.6 years ago

Woa ★ 2.9k

Here is a quick perl hack. Please check it thoroughly until it gives you the desired result. This however doesn't report frequencies but just only non zero counts. If certain dipeptides are missing the dipeptides and correspnding counts(0) are not reported.

use strict;
use diagnostics;
use Bio::SeqIO;
my $in  = Bio::SeqIO->new(-file => "myfile.fasta" , '-format' => 'Fasta');
while ( my $seq= $in->next_seq ) {
my @dipeps=($seq->seq()=~/(?=(.{2}))/g);
my %di_count=();
$di_count{$_}++ for @dipeps;
print ">",$seq->id();
print " ",$_,",",$di_count{$_} for sort keys(%di_count);
print "\n";
}

This small modification can report all possible dipep counts. Note that I've included two extra amino acids U and X

ADD COMMENT • link 13.6 years ago by Woa ★ 2.9k

0

Entering edit mode

Thanks for your answer. i am freshman about perl and it is such as stupid question. when i run the script, it give error (cant locate Bio/seqI0.pm in @INC ....) I don't know that what things must be installed. if possible help me about this. regards

ADD REPLY • link 13.6 years ago by Mohammad Reza Bakhtiarizadeh ▴ 350

0

Entering edit mode

Bio::SeqIO is a Bioperl module - http://www.bioperl.org - so that's what you need to install. It's not easy for complete beginners I'm afraid, but nor is it very difficult, if you read and follow the instructions to the letter.

ADD REPLY • link 13.6 years ago by Neilfws 49k

score 1 · Answer 4 · 2011-04-30

As others have indicated, most bioinformaticians would write their own code to perform this task.

Something like a "ready-made" solution exists in seqinr, an R package for sequence analysis. Of course, that requires that you know or are willing to learn some R. Some sample code:

# install seqinr if required
install.packages("seqinr")
library(seqinr)

# read fasta sequences in file "sequences.fasta"
seqs <- read.fasta("sequences.fasta")

# amino acid composition of first sequence
# note that you need to specify protein alphabet
count(seqs[[1]], 1, alphabet = s2c("acdefghiklmnpqrstvwy"))

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
60  2 43 44 19 50 14 30 31 49 15 21 37 20 23 28 41 37 11 21

# for dipeptides, change 1 to 2 (output not shown)
count(seqs[[1]], 2, alphabet = s2c("acdefghiklmnpqrstvwy"))
# for all sequences in list, use lapply()
lapply(s1, function(x) count(x, 1, alphabet = s2c("acdefghiklmnpqrstvwy")))

$`gi|452453|gb|AAA93118.1|`

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
60  2 43 44 19 50 14 30 31 49 15 21 37 20 23 28 41 37 11 21

$`gi|76667604|dbj|BAE45629.1|`

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
36  6 35 29 17 43 15 32 53 39 12 20 41 16 26 16 36 42 14 23

$`gi|995617|emb|CAA62740.1|`

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
57  2 39 24 14 43 14 30 45 46 10 19 35 12 23 28 39 41 11 22

score 0 · Answer 5 · 2011-04-29

0

Entering edit mode

13.6 years ago

Rm 8.3k

This script: protein_dipeptid.pl might be useful to you. I haven't used it though.

ADD COMMENT • link 13.6 years ago by Rm 8.3k

0

Entering edit mode

Thanks for your help. it works like Compseq (in EMBOSS ) and bring composition of dipeptides as a table (not for each sequence). regards

ADD REPLY • link 13.6 years ago by Mohammad Reza Bakhtiarizadeh ▴ 350

Ram · Answer 6 · 2014-08-01

0

Entering edit mode

10.3 years ago

ddofer ▴ 30

I'm actually in the final stages of a comprehensive package for doing exactly this right now!

That said, here's an older prototype, that should do all the things you're looking for (dipeptides, etc' ).

http://www.protonet.cs.huji.ac.il/neuropid/code/index.php

NPID_WebPackage.zip

(You want "SLEEK_FeatureGen+_new.py")

http://neuropid.cs.huji.ac.il/

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by ddofer ▴ 30