Hi all,
I have 3,300 binned contigs (bacterial sequences) that I would like to know the species (where possible) or the least common ancestor explaining what clades each of these bins would be coming from. I understand that MEGAN is designed to do that, however I have my build my own phylogenetic tree and would like to annotate as much as possible clade information for these bins (as specific as I can get). To do that I have predicted all the protein coding genes for each bin and extracted marker genes (ribosomal proteins and elongation factors) from these bins (anywhere between 30 to 80 genes/ bin depending on the size of the genome or completeness of the binning process), and I am blasting these marker genes/proteins (sicne i'm using their protein sequences, I'm pblasting it against the nr database) against the nr database so that later I can use MEGAN or a a tool of that sort to infer taxonomy.
My questions are the following: is there a better tool than MEGAN out there that can infer taxonomy from my sequences? Blasting these marker genes against the nr database would take up a month I think at this rate. Does anyone know any other methods/techniques to do this? I just want to come up with a list that maps my bins to it's most specific taxonomy, be it at the species level, genus level or higher levels, however specific it could get.
Many thanks in advance.
Hi, thank you for your comments and sorry for the late reply. I actually was looking into checkm and trying to get it running. It uses Python2 unfortunately and I had to roll back my python version to 2 and install dependencies etc. I just issued a run and see what it results. Thank you for your suggestion, I hope I can get it to work.
The way to go is to create a virtualenv using python2 and install checkm on that virtualenv. You should try taxonomy_wf as well, this is the one you're looking for
taxonomy_wf sounds more like identifying a particular phylum, and involves pre-specifying the phylum beforehand, The bins that I want to do taxonomic identification are very diverse and I don't see how I can run the taxonomy_wf command, instead lineage_wf is just extracting general marker genes I think and is inferring various taxonomies. Although I'm quite new to this and not entirely sure if I'm understanding this correctly,.
You're right. I took a look at my code. I first run lineage_wf and then tree_qa with the lineage_wf output dir as input. The tree_qa will give the best assignment on the tree
oh I see, right now I'm running the lineage_wf command over my bins, maybe I should also try and run tree_qa after it's done. Thanks for the advice.