How to perform reciprocal best hit (RBH) when there are multiple versions of a protein sequence
1
4
Entering edit mode
3 months ago
Chilly ★ 1.3k

I am doing a reciprocal best hit (RBH) analysis between two closely related species. I used the protein sequence fasta files of each species. I used the mmseqs easy-rbh tool for analysis:

mmseqs easy-rbh sci.pep.ann.fasta sca.pep.fasta Sca_Sci_RBH.pairs.tab ~/

The whole pipeline is very simple and runs well. But then I realised the problem:

The two species I studied (sca and sci) are non-classical species. In their protein sequence fasta files, each protein has one or more versions. In layman's terms, when performing genome assembly, for a gene (e.g. scip1.0054322), we may have multiple versions of transcripts (e.g. scip1.0054322.1, scip1.0054322.2, and scip1.0054322.3), which correspond to multiple versions of protein sequences. For example, scip1.0054322 has up to 19 versions of protein sequences. When I BLASTp scip1.0054322.9 sequence to sci itself, most other versions of scip1.0054322 sequences will be hit, but a few versions will not.

BLASTp scip1.0054322.9 sequence to 'sci' itself

When using two multi-sequence versions of protein fasta files (sci.pep.ann.fasta and sca.pep.fasta): if the scap2.0102435.1 sequence (from sca species) is input for BLASTp, the hit with scip1.0054322.2 (from sci species) is the best hit, and the hit with scap2.0102435.3 ranks second; and when the scip1.0054322.2 sequence is input for BLASTp, the hit with scap2.0102435.3 is the best hit, and the hit with scap2.0102435.1 ranks second. This will cause scap2.0102435 and scip1.0054322 to fail to form a reciprocal best hit (RBH) pair; but in fact this is a false negative error caused by different protein versions.

I tried to fix this problem. I currently have two ideas:

  • Merge the different sequence versions of each protein in each protein fasta, but I don't know how to do it. Not every version has a common sequence or overlaps with each other, that is, protein consensus sequence.

  • Improve the algorithm of the reciprocal best hit (RBH) pipeline so that the best hit between different versions can also be attributed to the reciprocal best hit (RBH) pairs at the protein level rather than the protein version level. But this seems to be tricky. Because there are many forms of false positive errors.

Please let me know if anyone has encountered a similar situation and has a solution. I really appreciate any help you can provide.

reciprocal.best.hit protein sequence RBH blast • 515 views
ADD COMMENT
2
Entering edit mode
3 months ago
LChart 4.6k

It seems to me the fundamental issues are:

(1) Hits between two isoforms of the same protein. This can be dealt with simply by annotating isoforms with gene names, and removing all hits between a gene at itself.

(2) Quasi-reciprocity: gene1 and gene2 are reciprocal hits, but gene1.iso1 best hit is gene2.iso1, gene2.iso1 best hit is gene1.iso2. You can fix this by defining a reciprocal hit as the following: two genes have reciprocal hits if there exist isoforms A (gene 1) and B (gene 2) such that the bitscore of A against B exceeds the bitscore of A against all other genes, and the bitscore of B against A exceeds the bitscore of B against all other genes.

In cases where there are many redundant sequences shared between various isoforms, there will be a number of essentially identical reciprocal matches. You can basically filter these to an exemplar based on isoform preference.

ADD COMMENT
4
Entering edit mode

I'm very interested in your answer, it sounds like it's easy to implement. But I don't quite understand what you mean. Can you explain it in more detail?

(1) Hits between two isoforms of the same protein. This can be dealt with simply by annotating isoforms with gene names, and removing all hits between a gene at itself.

Do you mean to change all isoform IDs to gene IDs so that different isoforms share the same gene ID? I am not sure whether duplicate sequence names will cause errors in the pipeline.

(2) Quasi-reciprocity: gene1 and gene2 are reciprocal hits, but gene1.iso1 best hit is gene2.iso1, gene2.iso1 best hit is gene1.iso2. You can fix this by defining a reciprocal hit as the following: two genes have reciprocal hits if there exist isoforms A (gene 1) and B (gene 2) such that the bitscore of A against B exceeds the bitscore of A against all other genes, and the bitscore of B against A exceeds the bitscore of B against all other genes.

I agree with you, but I have no idea how to program based on BLAST. In fact, I am only proficient in R programming, and for Python, I am only at running some pipelines and packages through Linux bash. Any further suggestions or examples?

ADD REPLY
1
Entering edit mode

You can simply have BLAST output all hits below a given e-score/bitscore, and post-process them in R. You can add the gene annotations to the fasta sequence name like:

scip1.00056.1 GENE_NAME

and read the fasta into R (or only the name lines) to associate genes with isoforms. All the logic above you can do with an exhaustive blast output and R.

ADD REPLY
4
Entering edit mode

My PBH protocol (the mmseqs easy-rbh tool) keeps using all the protein isoforms and is not subject to representative sequences. Is it possible that I name my protein sequences in the fasta file imperfect?

I have listed my example protein fasta head below:

    >scip1.0000056.1
    KDENFTDIEKPARQADANTHFTRTAKRHTS*
    >scip1.0000364.1
    DYFTSKFQQRVLNLLGYMSQVVCSPS*
    >scip1.0000364.2
    DYFTSKFQQRVLNMFSVSFQADQPSKEHLGFPARRQRL*
    >scip1.0000364.3
    FQADQPSKEHLGFPARRQRFRNLRLFVRRQRSKWPQKTVTKTISP*

How can I add a new protein ID to each sequence so that the protocol can prioritise obtaining or calculate the overall unique protein sequence to find the best hit and ultimately test the best-hit pair in units of protein rather than in units of each isoform?

ADD REPLY

Login before adding your answer.

Traffic: 2533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6