Question

Comparing Protein Datasets By Sequence

5

Entering edit mode

13.8 years ago

Kevin ▴ 100

Hi, I have several proteins datasets which I want to compare. I want to use sequence identity (not IDs) in order to count the number of shared proteins between each dataset pair.

I want to identify proteins which match perfectly and also proteins which are nearly identical (isoforms) and different proteins which are highly similar over a large region of both proteins.

So far, I have created a local blast database for each dataset and blasted each dataset against all others. I then parsed the XML output and have been able to find the highest scoring proteins.

I'm not sure which score is the best measurement of similarity for this task. If I look for high scoring proteins (above a cutoff) I often miss some near perfect matches and If I have similar problems with evalues. I'm writing my own filter (based on length of match vs query length, e-value and score) which is working reasonably well. Is this a suitable solution or am I missing something obvious? Thanks

blast comparison • 4.7k views

ADD COMMENT • link updated 10.2 years ago by Biostar 20 • written 13.8 years ago by Kevin ▴ 100

score 5 · Answer 1 · 2011-07-14

This seems to me as a perfectly suitable solution. In the case where you want to determine the similarity between two proteins (which is what your problem boils down to), I would indeed recommend a filter based on %similarity (or %identity, which is fairly linearly correlated), %length of matching fragments (compared to query length), and of course a threshold on the e-value. I rarely use thresholds on the score, as this is correlated with sequence length and is not easily comparable.

score 3 · Answer 2 · 2011-07-14

If you are using NCBI's command line blast suite, the blast2 program -m 8 (Alignment view options -> tabular) will output your results in a tabular format which contains the identity percent of the match as well as the alignment length. It will also be a lot more straightforward to parse than XML output.

You're still going to have to grab the original sequences' lengths from elsewhere, though.

Ram · Answer 3 · 2011-07-14

2

Entering edit mode

13.8 years ago

Iain ▴ 260

The CD-HIT programme would be very useful for this task

http://weizhong-lab.ucsd.edu/cd-hit/

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 13.8 years ago by Iain ▴ 260

1

Entering edit mode

I tried to use CD-HIT for such a task some years ago, and it didn't work as well as BLAST. The clustering would miss some hits in an m:n orthology situation.

ADD REPLY • link 13.8 years ago by Michael Kuhn 5.0k

0

Entering edit mode

interesting. I don't think I would expect it to be particularly great at orthology detection.

That being said, it is a useful tool for identifying a) identical sequences b) very similar sequences very quickly with out having to apply extra criteria as one would have to do with a BLAST like approach.

ADD REPLY • link 13.8 years ago by Iain ▴ 260