How Do You Characterize Unknown Protein Sequences?
3
3
Entering edit mode
11.5 years ago
James Ashmore ▴ 100

Hello, I have around 400 protein sequences which have no sequence similarity, no identifiable protein domains and no identifiable motifs. What steps should I take in order to characterize these proteins, both in function and strucure? My initial thoughts were as follows:

1) Compute physiochemical properties such as molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatix index, and grand average of hydropathicity.

2) Predict secondary structure

3) Determine subcellular localization

4) Ab initio modelling

Apart from these methods, what else could I employ to broaden the analysis of these sequences?

protein sequence modeling • 5.2k views
ADD COMMENT
0
Entering edit mode

You said no sequence similarity. Did you use BLAST to do your search? Which organism is this? And how much similarity was there?

ADD REPLY
0
Entering edit mode

Hi Jordan, yes these are protein sequences from the salamander species N. viridescens, the transcriptome of which has only recently been produced. There were around 600 protein-coding transcripts that did not show any hits in the NCBI databases (BLAST searches) and around 300 which showed hits to urodeles only.

ADD REPLY
0
Entering edit mode

Just put the 600 "orphan" ORFs up somewhere (figshare/Github) and we can have a fun competition to help you out - seriously

ADD REPLY
1
Entering edit mode
11.5 years ago
Jordan ★ 1.3k

You can try to modify the BLAST parameters and try searching again. If you have reason to suspect that the sequences you are looking for are highly divergent, then you could change the substitution matrix. The default is BLOSUM62. But you could try something lower than that like BLOSUM45 or PAM250.

Note that for BLOSUM, the higher the number the more similar the sequences are. So, BLOSUM62 has higher sequence similarity than BLOSUM45. And for PAM250 it's vice versa. PAM1 is for highly similar sequences, where as PAM250 is for distantly related sequences.

And there are other things to consider like the sequence query length. I think your query should be sufficient longer if you are going to use matrices like PAM250. In a nutshell, if your query is longer, then you have a better E-value.

ADD COMMENT
1
Entering edit mode
11.5 years ago

Are you sure they're "useful" sequences and not just nonsense DNA of some type? I take it you found stop and start codons that make you think they're functional proteins? How long is a typical sequence of those you've found?

ADD COMMENT
0
Entering edit mode

The data is from the recently published proteome and transcriptome of the salamander N. viridescens. The measures taken to ensure they were protein-coding are substantial, involving proteomic validation by mass spec. Here are links to the two papers if you think anything may have been overlooked 1) http://genomebiology.com/2013/14/2/R16/abstract 2) http://www.sciencedirect.com/science/article/pii/S0014482713000694

ADD REPLY
0
Entering edit mode
10.0 years ago
5heikki 11k

You didn't specify what you blasted against. Was it NR? You could try Interproscan, HHpred, hmmer on pfam..

ADD COMMENT

Login before adding your answer.

Traffic: 1321 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6