Entering edit mode
2.9 years ago
Teo
▴
10
Hello every one, I'm trying to write a pipeline that identifies the RBH for the “subject” genes by means of Blast+. I have to download any two Nostoc bacterial genomes from NCBI and use one complete genome (the “reference”) and of the other only a small fraction e.g. one hundred genes (the “subject”).
Can you help me to suggest a workflow/code for this pipeline?
Forget
BLAST
. UseMMseqs2
'seasy-rbh
module. The command is a simple one-liner, and you get the results in aBLAST
-style output table. Here's a link to the relevant documentation: https://github.com/soedinglab/mmseqs2/wiki#reciprocal-best-hit-using-mmseqs-rbh.Thanks you for your comment. However I have to start from two FASTA file: In a FASTA file there is the reference genome of a species, and in the second FASTA (subject) there is a small fraction of the genome of another species (this second FASTA would be a multi FASTA) . I would like to identifies the RBH for the subject genes. How can I start the pipeline from these data?
Assuming your genomic sequences from the two species are in
FASTA
files namedspecies1.fq
andspecies2.fq
, all you have to do is run:mmseqs easy-rbh species1.fq species2.fq results.tab tmpdir
from the command line. (I am assuming here that you have access to a
UNIX
-like operating system; likeUbuntu
, for example.)All your pairwise RBH results will be in the file
results.tab
.Note: you can install
MMseqs2
viaconda
into an environment namedmmseqs2
like so:conda create -n mmseqs2 -c bioconda mmseqs2
And then you can activate the environment with
conda activate mmseqs2
.Ok, I've another two questions:
i) I start from these two file FASTA (genome of a species and genes of another species). To obtain the right RBH is better to aline amino acidic sequences or nucleotide sequences? or to do both?
ii) i have obtain this output by means of MMseqs2's easy-rbh module:
Which is the RBH value?
i) So amino acid sequences are better conserved, so in theory, yes. But this really depends on how sensitive you want the search to be and what you're looking for. I think for the generic use-case of trying to identify orthologs between two genomes, protein sequences work really well.
ii) There is no such thing as an
RBH
value, at least not that I know of. It's just two searches, with the queries and targets swapping places for the second search. You just get ane-value
like you get with a regularMMseqs2
/BLAST
search. I assume you ran the search without specifying an output format (via--format-output
), so it should have defaulted to theBLAST
tabular format. The output columns in this case correspond toquery, target, fident, alnlen, mismatch, gapopen, qstart, qend, tstart, tend, evalue, bits
. In this example here, thee-value
appears to be0.000E+00
.Edit: I see you added an image of the output. So yeah, all of those are reciprocal best hits to one another, and the
e-values
are in the second-to-last column.ahhh ok!!! I thought that RBH is a value. Instead it's a list of values of best hits. Ok perfect. But If I would like to use Blast+... is more complicate?
It's not just a list of best hits. It's a specific subset of the best hits you'd get from a normal one-way search.
When you do a one-way search you get something like this:
And so forth.
If you swap the datasets around and search, you'd get something like this:
Not how only
q1
andt1
have found each other in both tables. Now if in addition they both also have the loweste-values
for one another, they would be each other's reciprocal best hits.If you're interested in understanding this better, I think this here is a fairly good step-by-step explanation of
RBH
s and their relationship with orthology.And well, with
BLAST
you'd have to manually do everythingeasy-rbh
has automated for you, so yes it would be more complicated. And it'll also be significantly slower. Tools likeMMseqs2
have been developed specifically with the objective of supersedingBLAST
in mind.