I have a similar question as posted here. I would like to check whether an organism has certain pathway modules present given a genbank file with CDS information.
The way I do it is that I create a .fasta file from the genbank file, feed it to BlastKoala which then returns me a list of KOs present in the organism. This information I can then use to check for particular modules e.g. the module M00010 which is described by
Definition K01647 (K01681,K01682) (K00031,K00030)
which translates to
K01647 AND (K01681 OR K01682) AND (K00031 OR K00030)
So if this expression evaluates to True
i.e. the required KOs are among the ones returned by BlastKoala, the module is complete and is therefore be present in the organism.
Is this the way to do it? I ask because I am unsure about how the output of BlastKoala is created. As far as I understand, the output is only based on similarity between sequences which might cause issues. Because just because sequences are similar does not mean that also the associated proteins are identical. So the risk is that I end up with a lot of false-positives because I might conclude that a cherry is the same as a strawberry because both of them are red.
Is my concern justified? Can I assume that a module is (not) presented just based on the output of BlastKoala or do I need to run something afterwards as e.g. Inparanoid?
Blast implements a local alignment heuristics so it could easily find that sequences are similar because they share a domain. If you're concerned about non-orthologous sequences being caught in the process then you should rely on a phylogenetic analysis (you own or one provided by a database).
Which database would you recommend? So the way, I currently infer whether a module is present or not is error-prone?
Identifying an orthologous group based on a single blast alignment is definitely error-prone. Which database of orthologs to use depends first on which organism you're working with since it has to be represented in the database. For vertebrates, I would recommend working with Ensembl and its compara database.
Ok, thanks! Then I still wonder what this output of BlastKoala actually means; what can this information about KOs actually be used for!? I always assumed that it is more than a single blast alignment but I am not sure (that's why I asked here). Any suggestions for prokaryotes?
I think COG has prokaryotes. Also a quick search revealed:
Quartets-DB
ATGC
I don't work with prokaryotes, except for cloning plasmids :) so I can't advise on which one has the best features.