Question

taxonomic identification least common ancestor approach

1

Entering edit mode

6.1 years ago

Moses ▴ 150

Hi all,

I have 3,300 binned contigs (bacterial sequences) that I would like to know the species (where possible) or the least common ancestor explaining what clades each of these bins would be coming from. I understand that MEGAN is designed to do that, however I have my build my own phylogenetic tree and would like to annotate as much as possible clade information for these bins (as specific as I can get). To do that I have predicted all the protein coding genes for each bin and extracted marker genes (ribosomal proteins and elongation factors) from these bins (anywhere between 30 to 80 genes/ bin depending on the size of the genome or completeness of the binning process), and I am blasting these marker genes/proteins (sicne i'm using their protein sequences, I'm pblasting it against the nr database) against the nr database so that later I can use MEGAN or a a tool of that sort to infer taxonomy.

My questions are the following: is there a better tool than MEGAN out there that can infer taxonomy from my sequences? Blasting these marker genes against the nr database would take up a month I think at this rate. Does anyone know any other methods/techniques to do this? I just want to come up with a list that maps my bins to it's most specific taxonomy, be it at the species level, genus level or higher levels, however specific it could get.

Many thanks in advance.

phylogeny taxonomy phylum genus • 2.0k views

ADD COMMENT • link updated 6.1 years ago by Asaf 10k • written 6.1 years ago by Moses ▴ 150

score 0 · Answer 1 · 2019-05-09

0

Entering edit mode

6.1 years ago

Asaf 10k

To answer some of your questions, you can use diamond instead of BLAST, it will accelerate running time dramatically. I don't know why you compare your proteins against nr, the reference dataset should be much smaller - only the specific orthology groups from bacteria.

To suggest other tools - you can use checkm lineage_wf which you can supply your list of genes with (--genes) and it will give you the best taxonomic identity. I'm not sure but I guess there is a way to alter the database to use your taxonomic tree (which I haven't fully understood what it contains)

ADD COMMENT • link 6.1 years ago by Asaf 10k

0

Entering edit mode

Hi, thank you for your comments and sorry for the late reply. I actually was looking into checkm and trying to get it running. It uses Python2 unfortunately and I had to roll back my python version to 2 and install dependencies etc. I just issued a run and see what it results. Thank you for your suggestion, I hope I can get it to work.

ADD REPLY • link 6.1 years ago by Moses ▴ 150

0

Entering edit mode

The way to go is to create a virtualenv using python2 and install checkm on that virtualenv. You should try taxonomy_wf as well, this is the one you're looking for

ADD REPLY • link 6.1 years ago by Asaf 10k

0

Entering edit mode

taxonomy_wf sounds more like identifying a particular phylum, and involves pre-specifying the phylum beforehand, The bins that I want to do taxonomic identification are very diverse and I don't see how I can run the taxonomy_wf command, instead lineage_wf is just extracting general marker genes I think and is inferring various taxonomies. Although I'm quite new to this and not entirely sure if I'm understanding this correctly,.

ADD REPLY • link 6.1 years ago by Moses ▴ 150

0

Entering edit mode

You're right. I took a look at my code. I first run lineage_wf and then tree_qa with the lineage_wf output dir as input. The tree_qa will give the best assignment on the tree

ADD REPLY • link 6.1 years ago by Asaf 10k

0

Entering edit mode

oh I see, right now I'm running the lineage_wf command over my bins, maybe I should also try and run tree_qa after it's done. Thanks for the advice.

ADD REPLY • link 6.1 years ago by Moses ▴ 150