Question

How To Find Paralogs Using Blast

8

Entering edit mode

13.3 years ago

ngsgene ▴ 380

I am trying to find paralogous genes in a model organism (Arabidopsis). To do that am using blast with the sequence for its own database. The first hit I get is the protein sequence aligned with it self (using peptide - amino acid sequence) which is mostly 100%. I am trying to understand which genes are highly homologous to each other and I come across similar examples where there are multiple alignments with the sequence of a single protein for example

Protein1 Protein1 Score410

Protein1 Protein2A Score300

Protein1 Protein2B Score210

Protein1 Protein3 Score280

Protein1 Protein4 Score250

The alignment to itself is an 'ofcourse' here but is it the norm to consider the best alignment from all other alignments like Protein1 Protein2A in this case and ignore Protein1 Protein2B (due to overlap between these alignments) Or are there some calculations/options to take that into account.

And then do I consider Protein1 Protein3, Protein1 Protein4 as close homologs of Protein 1 - are there some commonly used thresholds for such blast results?

Thanks!

homology blast • 42k views

ADD COMMENT • link updated 12.7 years ago by Larry_Parnell 16k • written 13.3 years ago by ngsgene ▴ 380

4

Entering edit mode

Take a look at this question. Most of it is applicable to yours as well.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 13.3 years ago by Michael Schubert ★ 7.1k

1

Entering edit mode

+1 @Michael Schubert

ADD REPLY • link 13.3 years ago by Leszek 4.2k

score 14 · Answer 1 · 2011-08-09

14

Entering edit mode

13.3 years ago

Lars Juhl Jensen 11k

There are several points to be made regarding your question. I will try my best to address them below.

I know it may sound like nitpicking, but I think it is important to make clear that there is no such thing as "highly homologous". If two proteins share a common evolutionary origin, they are homologous; if they do not, they are not homologous. Homology is binary - two proteins either are or are not homologous. What you can talk about is whether two homologous proteins are highly similar (i.e. have high sequence similarity) or not, but in that case it is entirely your choice how similar two proteins must be to be "highly similar".

Like Stefano said, because homology (and paralogy) is defined by evolution, you normally would have to create evolutionary trees to be able to say anything for sure. However, the reason why one has to do so is to be able to distinguish between orthologs and paralogs (both of which are homologs). However, you are trying to find paralogs within a single organism. Because all your proteins are from the same organism, they by definition cannot be orthologs. In other words, within an organism all homologs are by definition paralogs. I would thus argue that in your specific case there is no need to make phylogenetic trees.

Your question can thus be rephrased to "How high a BLAST score must I have in my database search to infer that two proteins are homologous?". This is a purely statistical question, which E-values were specifically designed to answer. So my advice would be to simply pick an E-value cutoff on the BLAST search in this case. Which one to pick is a tradeoff between sensitivity and specificity - I would normally use something in the range from 1e-3 to 1e-6.

An alternative to this is to not do it yourself. You mention that you work on a model organism, which means that the genome is presumably included in the various orthology databases. You should thus consider extracting paralogs from, for example, the COG/KOG or eggNOG databases.

ADD COMMENT • link 13.3 years ago by Lars Juhl Jensen 11k

1

Entering edit mode

@Lars, I would think e-val is not enough and length of the alignment should also be considered. e-val of 1e-10 for a 30bp stretch when looking at 300bp sequences would not imply paralogs

ADD REPLY • link 13.3 years ago by Bioinfosm ▴ 620

1

Entering edit mode

@bioinfosm I agree that taking into account also alignment length (and in fact also percent identity) would be better. However, the moment you go down that path and introduce multiple different parameters to identify paralogs, you should probably first be asking yourself why you are reinventing the wheel? In other words, if an E-value is not good enough for your purpose, you should probably resort to using orthology databases rather than your own home-cooked solution ;-)

ADD REPLY • link 13.3 years ago by Lars Juhl Jensen 11k

1

Entering edit mode

@ngsgene 50% identity over 90% of the length sounds very strict to me; you can easily end up having less than 90% coverage due to 1) masked low complexity regions, 2) gene annotation errors, and 3) differences in which (putative) splice isoforms have been annotated for each gene. So I would suggest lowering the 90% alignment length cutoff a lot. For comparison, Inparanoid uses a cutoff of 50%.

ADD REPLY • link 13.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Thanks for your input Lars

nitpicking appreciated - helps me understand what I am doing better. For BLAST I had used 1e-10 as the cutoff.

But the concept I would like to understand is, should only the best hit be considered (right now I am looking at the next highest score after that of the protein with itself) as you suggest 'highly similar' is the keyword I am trying to get at - and the thresholds that are perhaps commonly used like >50 percent identity or >90% length alignment to generate 'highly similar' data.

ADD REPLY • link 13.3 years ago by ngsgene ▴ 380

score 3 · Answer 2 · 2011-08-11

To consider the score obtained in the alignment to itself as the reference could be a good option. Each protein aligned to itself obtains in BLAST an specific score. To know if the similarity is sufficiently significant to consider the two aligned proteins as paralogs it would be more precise if you refer the obtained score to the score obtained in the comparison with itself. You can calculate for each alignment the ratio between the score of the protein / score of the query protein to itself and then, you can fix a threshold of 0.7, by instance, for selecting the paralogs. With that threshold, in your example, only Protein 2A with Score 300 (300/410=0.73) would be selected as paralog protein for Protein1. To fix an appropriate threshold you can calculate these values for real known paralog proteins and extrapolate a threshold for your analysis.

The other point about the selection of only one alignment for each pair of proteins you can select only the alignment corresponding to the first HSP (HSP_num=1) which is the better because in BLAST(by default) the HSPs of each protein are ordered by E value from better alignment (minor E value) to worst alignment (higher E value) . The HSP with the minor E- value is the HSP with the higher score (perhaps in some very exceptional cases it could not be true).

Ram · Answer 3 · 2011-08-08

To find paralogs, you need a phylogenetic program such as Phylip or Paup. With BLAST, collect all sequences with enough similarity, plus an outgroup, a protein that diverged before all the others (the homologue in a non related species like yeast, Arabidopsis or a bacteria if your model organism is mouse) select the conserved motif, use ClustalW and then Phylip/Paup.

Once you have learned how to do it, remember you need bootstrapping to add some statistical significance.

I also answered a related question here that might give you further ideas and keyword to look up.

score 0 · Answer 4 · 2011-09-01

The assumption is orthologous genes have identical or highly related functions and this sharing is greater than for paralogs. But Nehrt, Hahn et al challenge this by offering that "the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act."

They combined experimentally derived function with gene expression data on nearly 9000 proteins.

This is certainly a controversial statement, but in a thought-provoking manner. After all, it is the integration of diverse data that are driving a lot of genomics. One example is GWAS (genome-wide association studies) + gene expression = better identification of likely causal variant. The same might be applied to the ortholog/paralog definition.