Comparing Protein Datasets By Sequence
3
5
Entering edit mode
13.4 years ago
Kevin ▴ 100

Hi, I have several proteins datasets which I want to compare. I want to use sequence identity (not IDs) in order to count the number of shared proteins between each dataset pair.

I want to identify proteins which match perfectly and also proteins which are nearly identical (isoforms) and different proteins which are highly similar over a large region of both proteins.

So far, I have created a local blast database for each dataset and blasted each dataset against all others. I then parsed the XML output and have been able to find the highest scoring proteins.

I'm not sure which score is the best measurement of similarity for this task. If I look for high scoring proteins (above a cutoff) I often miss some near perfect matches and If I have similar problems with evalues. I'm writing my own filter (based on length of match vs query length, e-value and score) which is working reasonably well. Is this a suitable solution or am I missing something obvious? Thanks

blast comparison • 4.4k views
ADD COMMENT
5
Entering edit mode
13.4 years ago

This seems to me as a perfectly suitable solution. In the case where you want to determine the similarity between two proteins (which is what your problem boils down to), I would indeed recommend a filter based on %similarity (or %identity, which is fairly linearly correlated), %length of matching fragments (compared to query length), and of course a threshold on the e-value. I rarely use thresholds on the score, as this is correlated with sequence length and is not easily comparable.

ADD COMMENT
3
Entering edit mode
13.4 years ago
Eric Fournier ★ 1.4k

If you are using NCBI's command line blast suite, the blast2 program -m 8 (Alignment view options -> tabular) will output your results in a tabular format which contains the identity percent of the match as well as the alignment length. It will also be a lot more straightforward to parse than XML output.

You're still going to have to grab the original sequences' lengths from elsewhere, though.

ADD COMMENT
2
Entering edit mode
13.4 years ago
Iain ▴ 260

The CD-HIT programme would be very useful for this task

http://weizhong-lab.ucsd.edu/cd-hit/

ADD COMMENT
1
Entering edit mode

I tried to use CD-HIT for such a task some years ago, and it didn't work as well as BLAST. The clustering would miss some hits in an m:n orthology situation.

ADD REPLY
0
Entering edit mode

interesting. I don't think I would expect it to be particularly great at orthology detection.

That being said, it is a useful tool for identifying a) identical sequences b) very similar sequences very quickly with out having to apply extra criteria as one would have to do with a BLAST like approach.

ADD REPLY

Login before adding your answer.

Traffic: 2447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6