Question

How Do You Characterize Unknown Protein Sequences?

3

Entering edit mode

12.2 years ago

James Ashmore ▴ 100

Hello, I have around 400 protein sequences which have no sequence similarity, no identifiable protein domains and no identifiable motifs. What steps should I take in order to characterize these proteins, both in function and strucure? My initial thoughts were as follows:

1) Compute physiochemical properties such as molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatix index, and grand average of hydropathicity.

2) Predict secondary structure

3) Determine subcellular localization

4) Ab initio modelling

Apart from these methods, what else could I employ to broaden the analysis of these sequences?

protein sequence modeling • 5.6k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 12.2 years ago by James Ashmore ▴ 100

0

Entering edit mode

You said no sequence similarity. Did you use BLAST to do your search? Which organism is this? And how much similarity was there?

ADD REPLY • link 12.2 years ago by Jordan ★ 1.3k

0

Entering edit mode

Hi Jordan, yes these are protein sequences from the salamander species N. viridescens, the transcriptome of which has only recently been produced. There were around 600 protein-coding transcripts that did not show any hits in the NCBI databases (BLAST searches) and around 300 which showed hits to urodeles only.

ADD REPLY • link 12.2 years ago by James Ashmore ▴ 100

0

Entering edit mode

Just put the 600 "orphan" ORFs up somewhere (figshare/Github) and we can have a fun competition to help you out - seriously

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by cdsouthan ★ 1.9k

score 1 · Answer 1 · 2013-05-23

You can try to modify the BLAST parameters and try searching again. If you have reason to suspect that the sequences you are looking for are highly divergent, then you could change the substitution matrix. The default is BLOSUM62. But you could try something lower than that like BLOSUM45 or PAM250.

Note that for BLOSUM, the higher the number the more similar the sequences are. So, BLOSUM62 has higher sequence similarity than BLOSUM45. And for PAM250 it's vice versa. PAM1 is for highly similar sequences, where as PAM250 is for distantly related sequences.

And there are other things to consider like the sequence query length. I think your query should be sufficient longer if you are going to use matrices like PAM250. In a nutshell, if your query is longer, then you have a better E-value.

score 1 · Answer 2 · 2013-05-24

1

Entering edit mode

12.2 years ago

chbelhumeur2000 ▴ 40

Are you sure they're "useful" sequences and not just nonsense DNA of some type? I take it you found stop and start codons that make you think they're functional proteins? How long is a typical sequence of those you've found?

ADD COMMENT • link 12.2 years ago by chbelhumeur2000 ▴ 40

0

Entering edit mode

The data is from the recently published proteome and transcriptome of the salamander N. viridescens. The measures taken to ensure they were protein-coding are substantial, involving proteomic validation by mass spec. Here are links to the two papers if you think anything may have been overlooked 1) http://genomebiology.com/2013/14/2/R16/abstract 2) http://www.sciencedirect.com/science/article/pii/S0014482713000694

ADD REPLY • link 12.2 years ago by James Ashmore ▴ 100

Ram · Answer 3 · 2014-12-04

0

Entering edit mode

10.7 years ago

5heikki 11k

You didn't specify what you blasted against. Was it NR? You could try Interproscan, HHpred, hmmer on pfam..

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by 5heikki 11k