Hi, I have several proteins datasets which I want to compare. I want to use sequence identity (not IDs) in order to count the number of shared proteins between each dataset pair.
I want to identify proteins which match perfectly and also proteins which are nearly identical (isoforms) and different proteins which are highly similar over a large region of both proteins.
So far, I have created a local blast database for each dataset and blasted each dataset against all others. I then parsed the XML output and have been able to find the highest scoring proteins.
I'm not sure which score is the best measurement of similarity for this task. If I look for high scoring proteins (above a cutoff) I often miss some near perfect matches and If I have similar problems with evalues. I'm writing my own filter (based on length of match vs query length, e-value and score) which is working reasonably well. Is this a suitable solution or am I missing something obvious? Thanks
I tried to use CD-HIT for such a task some years ago, and it didn't work as well as BLAST. The clustering would miss some hits in an m:n orthology situation.
interesting. I don't think I would expect it to be particularly great at orthology detection.
That being said, it is a useful tool for identifying a) identical sequences b) very similar sequences very quickly with out having to apply extra criteria as one would have to do with a BLAST like approach.