Question

Blast/Reblast To Verify Homologies

9

Entering edit mode

14.2 years ago

Mawe ▴ 90

Hi!

I am an informatics student from Germany attending a practical course on bioinformatics. I am new to the field of bioinformatics but I am willing to learn a lot. My first task is the following: I have some protein sequences from one organism provided. These proteins build a class of similar proteins. Additionally I have one database for this organism and a second database for another organism.

The goal is to find homologies to the protein sequences in the second database. So what I have to do is to blast the protein sequences against the second database and then "reblast" the results against the the first database to verify the hits. To automatize this task I write a program in python, which locally blasts given sequences against databases.

My questions are not about the programming but the way to find homologies: is this the normal way to do this? What exactly am I verifying with a reblast? And how do I know which blast results are significant? What output is interesting for a bioinformatic? The course focuses on programming, which is no problem for me, but I'd like to understand what I am doing and why.

Thanks in advance!

blast homology orthologues • 11k views

ADD COMMENT • link updated 14.2 years ago by Larry_Parnell 16k • written 14.2 years ago by Mawe ▴ 90

3

Entering edit mode

For a bioinformatics "newbie", you asked this question very well.

ADD REPLY • link 14.2 years ago by Neilfws 49k

score 24 · Answer 1 · 2011-03-05

From the description of the task, it sounds very much like you are implementing the often used "reciprocal best hits" strategy for identifying orthologs (as opposed to paralogs).

The idea of the reciprocal blast (or reblasting) is as follows. If you search with protein A1 from organism A and find B1 to be the best hit in organism B, it might nonetheless be that there is a different protein in organism A that looks more like B1 than A1 does. By performing the reverse blast, you ensure that A1 and B1 are each others best hits. When that is the case, they are very likely to be orthologs. It should be noted, though, that A1 and B1 being reciprocal best hits does not ensure that they are orthologs. Similarly, you cannot conclude that A1 and B1 are not orthologs because they are not reciprocal best hits.

Typical cases where the reciprocal best hits strategy fails involve either a gene duplication that has taken place since the speciation event separating the two organisms, or a gene duplication that has taken place prior to the speciation event and which was followed by gene losses in both lineages. In the former case, the reciprocal best hits strategy will identify only one of several orthologs. In the latter case, it will wrongly suggest that two paralogous genes are orthologs.

With respect to which significance cutoffs to use with BLAST, it depends very much on the evolutionary distance between the two organisms as well as on the protein family in question. The more evolutionarily distant the two organisms are from each other, the more relaxed cutoffs you need to used. However, you can generally get away with using very relaxed cutoffs when using the reciprocal best hits strategy, since most of the false positive hits that come out of a BLAST search will not fulfill the reciprocality criterion and hence be filtered out anyway.

In terms of which output would be relevant, I would suggest the following: the bit score obtained from BLAST, the percent identity of the alignment, the aligned fractions (i.e. how many percent of the total length could be aligned), and possibly also the lengths of the two proteins.

score 0 · Answer 2 · 2011-09-01

The assumption is orthologous genes have identical or highly related functions and this sharing is greater than for paralogs. But Nehrt, Hahn et al challenge this by offering that "the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act."

They combined experimentally derived function with gene expression data on nearly 9000 proteins.

This is certainly a controversial statement, but in a thought-provoking manner. After all, it is the integration of diverse data that are driving a lot of genomics. One example is GWAS (genome-wide association studies) + gene expression = better identification of likely causal variant. The same might be applied to the ortholog/paralog definition.