Hi,
I'm new to bioinformatics and am still learning how all the public databases interconnect so please bear with me :)
I have de novo assembled a set of bacterial genomes, located and extracted putative ORFs and want to do a very simple overview of "my" genomes compared to the refseq genomes for this bacteria. I have used discontinuous megablast on the NCBI webpage to compare all extracted genes and got a nice result, and using the web-interface I can select the matched genes and look at the alignments and see putative annotations of matching or overlapping "known" genes. What I mean is, the stuff like "flagellar motor protein".
The problem is when I try to download the results from the web-page I lose the annotations. They are simply not in the XML or ASN anywhere, but they obviously are on the web-page. So my question is, from which database and how did the NCBI BLAST result web-page extract this information? I want to integrate that step in my pipeline, so I'd rather not use blast2go or some other big program for it, but I could write some python. There are so many different id numbers associated with the results but I can't find any single one that maps to this information.
I'm aware of that these annotations should be taken with a huge grain of salt and that they can be misleading etc etc. I just want some cursory glance at the data that is more interesting to look at than 1500 base start/stop numbers.
I do have the blast+ suite locally and the refseq_genomic db as well. Maybe it is possible to look this info up using some of those blastdb commands?
Thanks