I am doing a reciprocal best hit (RBH) analysis between two closely related species. I used the protein sequence fasta files of each species. I used the mmseqs easy-rbh
tool for analysis:
mmseqs easy-rbh sci.pep.ann.fasta sca.pep.fasta Sca_Sci_RBH.pairs.tab ~/
The whole pipeline is very simple and runs well. But then I realised the problem:
The two species I studied (sca and sci) are non-classical species. In their protein sequence fasta files, each protein has one or more versions. In layman's terms, when performing genome assembly, for a gene (e.g. scip1.0054322), we may have multiple versions of transcripts (e.g. scip1.0054322.1, scip1.0054322.2, and scip1.0054322.3), which correspond to multiple versions of protein sequences. For example, scip1.0054322 has up to 19 versions of protein sequences. When I BLASTp scip1.0054322.9 sequence to sci itself, most other versions of scip1.0054322 sequences will be hit, but a few versions will not.
When using two multi-sequence versions of protein fasta files (sci.pep.ann.fasta
and sca.pep.fasta
): if the scap2.0102435.1 sequence (from sca species) is input for BLASTp, the hit with scip1.0054322.2 (from sci species) is the best hit, and the hit with scap2.0102435.3 ranks second; and when the scip1.0054322.2 sequence is input for BLASTp, the hit with scap2.0102435.3 is the best hit, and the hit with scap2.0102435.1 ranks second. This will cause scap2.0102435 and scip1.0054322 to fail to form a reciprocal best hit (RBH) pair; but in fact this is a false negative error caused by different protein versions.
I tried to fix this problem. I currently have two ideas:
Merge the different sequence versions of each protein in each protein fasta, but I don't know how to do it. Not every version has a common sequence or overlaps with each other, that is, protein consensus sequence.
Improve the algorithm of the reciprocal best hit (RBH) pipeline so that the best hit between different versions can also be attributed to the reciprocal best hit (RBH) pairs at the protein level rather than the protein version level. But this seems to be tricky. Because there are many forms of false positive errors.
Please let me know if anyone has encountered a similar situation and has a solution. I really appreciate any help you can provide.
I'm very interested in your answer, it sounds like it's easy to implement. But I don't quite understand what you mean. Can you explain it in more detail?
Do you mean to change all isoform IDs to gene IDs so that different isoforms share the same gene ID? I am not sure whether duplicate sequence names will cause errors in the pipeline.
I agree with you, but I have no idea how to program based on BLAST. In fact, I am only proficient in R programming, and for Python, I am only at running some pipelines and packages through Linux bash. Any further suggestions or examples?
You can simply have BLAST output all hits below a given e-score/bitscore, and post-process them in R. You can add the gene annotations to the fasta sequence name like:
scip1.00056.1 GENE_NAME
and read the fasta into R (or only the name lines) to associate genes with isoforms. All the logic above you can do with an exhaustive blast output and R.
My PBH protocol (the
mmseqs easy-rbh
tool) keeps using all the protein isoforms and is not subject to representative sequences. Is it possible that I name my protein sequences in the fasta file imperfect?I have listed my example protein fasta head below:
How can I add a new protein ID to each sequence so that the protocol can prioritise obtaining or calculate the overall unique protein sequence to find the best hit and ultimately test the best-hit pair in units of protein rather than in units of each isoform?