I have a fairly large BLAST output from a metagenomics project with millions of lines per sample. The input sequences for the blast search were prodigal annotated proteins. I have been looking for a while now to find a suitable program that calculates the LCA of the BLAST outputs of the sequences.
I have looked into the following softwares:
MEGAN6: This is based on a UI (there is a command line interface, which is only available for certain editions). I am very hesitant to ever use any GUI, so this is not optimal.
BASTA: In principle, this tool would be suitable. However, the implementation is not ideal, since it doesn't allow for parallel processing of files (it locks the database folder) and it is incredibly slow when running it on larger files.
What alternatives are out there? I am debating whether it's worth writing everything from scratch, but usually it isn't.
Any inputs would be highly appreciated.
What did you use as your reference database? It's really a rather simple process, like just a few LOC. This is assuming that you have the lineage information of your hits at hand..
I used the
nr
diamond database (which is essentially a modified NCBI database). Thus I have the NCBI TaxIDs available in the second column of the output.So essentially all you need to do then is to decide the threshold for your hits, join your blast output with lineage information (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/), and come up with a good awk oneliner..
Edit. for relatively simple things like this, I recommend doing it yourself vs. using some black(ish) box and not fully understanding what is happening. For example, if you were to use Megan, it could just skip many of your hits because they're not in its txid - taxinfo map file. Of course it wouldn't even tell you that something like that occurred (at least this used to be the case many years ago)
Hi @5heikki I realize this post is rather old. However I am really interested in this last comment from you. I wonder if you have already written the codes which would do as you describe and whether you would be willing to share it? i.e. identify LCA from a list of blast results. I am not good enough at coding to write such code from scratch but if I could see an example I might be able to modify it to fit my specific data.
Here's an example:
A list of NCBI tax ids:
The lineages of these tax ids:
A simple loop which stops after there's no more common "column":
So here the last common ancestor would be an ancestral Firmucutes bacterium. This is one way to do it. I'm sure that there are more sophisticated methods available. Note that a single horizontal gene transfer event (or e.g. including a tax id from a blast hit which shouldn't be included) to some unrelated bacterial group would take the ancestor down to Bacteria..
I never saw this reply ! And here I am years later looking for the same thing - Thank you ! I'll give it a try now.
Does anyone know an R function/code to do the LCA from a blast output that already contains all the ranks on each column? Something like this:
qseqid superkingdom phylum class order family genus species
id1 Eukaryota Chordata Mammalia Rodentia Muridae Mus Mus musculus
id1 Eukaryota Chordata Mammalia Rodentia Caviidae Cavia Cavia porcellus
We want to get the following:
id1 Eukaryota Chordata Mammalia Rodentia
This is two lines, but we would have multiple ids and multiple lines for each.
If you know the lineage information is complete you could compare which columns are identical, however, there may be unranked levels in between and I would always calculate the LCA from the species name only or much better the TaxId using the NCBI taxonomy and some LCA tool or library. There are many that can calculate the LCA: - https://taxonomy.jgi-psf.org/tax/ - https://github.com/apcamargo/taxopy - https://bioinf.shenwei.me/taxonkit/ - the get_lca function in BioPerl https://bioperl.org/howtos/Trees_HOWTO.html
Inferring the taxon from the scientific name is not as robust as using the taxon id and not all tools can infer the taxon from the name, therefore you should always output the taxonid in your blast results..