LCA from BLAST output
1
2
Entering edit mode
4.1 years ago
makrez ▴ 60

I have a fairly large BLAST output from a metagenomics project with millions of lines per sample. The input sequences for the blast search were prodigal annotated proteins. I have been looking for a while now to find a suitable program that calculates the LCA of the BLAST outputs of the sequences.

I have looked into the following softwares:

  • MEGAN6: This is based on a UI (there is a command line interface, which is only available for certain editions). I am very hesitant to ever use any GUI, so this is not optimal.

  • BASTA: In principle, this tool would be suitable. However, the implementation is not ideal, since it doesn't allow for parallel processing of files (it locks the database folder) and it is incredibly slow when running it on larger files.

What alternatives are out there? I am debating whether it's worth writing everything from scratch, but usually it isn't.

Any inputs would be highly appreciated.

lca blast • 3.1k views
ADD COMMENT
0
Entering edit mode

What did you use as your reference database? It's really a rather simple process, like just a few LOC. This is assuming that you have the lineage information of your hits at hand..

ADD REPLY
0
Entering edit mode

I used the nr diamond database (which is essentially a modified NCBI database). Thus I have the NCBI TaxIDs available in the second column of the output.

ADD REPLY
0
Entering edit mode

So essentially all you need to do then is to decide the threshold for your hits, join your blast output with lineage information (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/), and come up with a good awk oneliner..

Edit. for relatively simple things like this, I recommend doing it yourself vs. using some black(ish) box and not fully understanding what is happening. For example, if you were to use Megan, it could just skip many of your hits because they're not in its txid - taxinfo map file. Of course it wouldn't even tell you that something like that occurred (at least this used to be the case many years ago)

ADD REPLY
0
Entering edit mode

Hi @5heikki I realize this post is rather old. However I am really interested in this last comment from you. I wonder if you have already written the codes which would do as you describe and whether you would be willing to share it? i.e. identify LCA from a list of blast results. I am not good enough at coding to write such code from scratch but if I could see an example I might be able to modify it to fit my specific data.

ADD REPLY
2
Entering edit mode

Here's an example:

A list of NCBI tax ids:

$ cat list
138119
1507783
1602889
1856264
221807
527029
529122
537377
696750
712173

The lineages of these tax ids:

$ join -1 1 -2 1 <(sort list) <(sort -k1,1 fullnamelineage.dmp) | awk 'BEGIN{FS=" \\| "}{print $3}'
cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Eubacteriales; Peptococcaceae; Desulfitobacterium; Desulfitobacterium hafniense; |
cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; unclassified Bacillus (in: Bacteria); |
cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; unclassified Bacillus (in: Bacteria); |
cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Oceanobacillus; unclassified Oceanobacillus; |
cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; unclassified Bacillus (in: Bacteria); |
cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; Bacillus cereus group; Bacillus thuringiensis; Bacillus thuringiensis serovar pondicheriensis; |
cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; Bacillus cereus group; Bacillus thuringiensis; |
cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Eubacteriales; unclassified Eubacteriales; |
cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus; Streptococcus dysgalactiae group; Streptococcus dysgalactiae; Streptococcus dysgalactiae subsp. equisimilis; |
cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; unclassified Bacillus (in: Bacteria); |

A simple loop which stops after there's no more common "column":

$ for i in $(seq 1 10); do join -1 1 -2 1 <(sort list) <(sort -k1,1 fullnamelineage.dmp) | awk 'BEGIN{FS=" \\| "}{print $3}' | cut -f"$i" -d ";" | sort -u>tmpf; if [ $(wc -l < tmpf) = "1" ]; then awk '{printf "%s ;", $0}' tmpf; else break; fi; done
cellular organisms ; Bacteria ; Terrabacteria group ; Firmicutes ;

So here the last common ancestor would be an ancestral Firmucutes bacterium. This is one way to do it. I'm sure that there are more sophisticated methods available. Note that a single horizontal gene transfer event (or e.g. including a tax id from a blast hit which shouldn't be included) to some unrelated bacterial group would take the ancestor down to Bacteria..

ADD REPLY
0
Entering edit mode

I never saw this reply ! And here I am years later looking for the same thing - Thank you ! I'll give it a try now.

ADD REPLY
0
Entering edit mode

Does anyone know an R function/code to do the LCA from a blast output that already contains all the ranks on each column? Something like this:

qseqid superkingdom phylum class order family genus species

id1 Eukaryota Chordata Mammalia Rodentia Muridae Mus Mus musculus

id1 Eukaryota Chordata Mammalia Rodentia Caviidae Cavia Cavia porcellus


We want to get the following:

id1 Eukaryota Chordata Mammalia Rodentia

This is two lines, but we would have multiple ids and multiple lines for each.

ADD REPLY
1
Entering edit mode

If you know the lineage information is complete you could compare which columns are identical, however, there may be unranked levels in between and I would always calculate the LCA from the species name only or much better the TaxId using the NCBI taxonomy and some LCA tool or library. There are many that can calculate the LCA: - https://taxonomy.jgi-psf.org/tax/ - https://github.com/apcamargo/taxopy - https://bioinf.shenwei.me/taxonkit/ - the get_lca function in BioPerl https://bioperl.org/howtos/Trees_HOWTO.html

Inferring the taxon from the scientific name is not as robust as using the taxon id and not all tools can infer the taxon from the name, therefore you should always output the taxonid in your blast results..

ADD REPLY
0
Entering edit mode
12 months ago
Dunois ★ 2.8k

You should probably be using MMseqs2's taxonomy module for this: https://github.com/soedinglab/MMseqs2#taxonomy .

ADD COMMENT

Login before adding your answer.

Traffic: 2508 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6