Related to my previous post, I have a list of genes that have been pooled out from around 60 different bacterial species based on some criteria. Now, I would like to see if there is any functional enrichment in this group. My idea was to use DAVID, however none of the locus_tag inputs are recognized. When I manually try some of them in the NCBI's Gene database, they are found but often marked as "discontinued" or "updated".
Any idea how to work around this problem? locus_tag is the only common gene flag among all these different bacterial species. Is there another tool for functional clustering that will either accept locus_tag or is there a way around to translate those locus_tag 's into something more readable by DAVID?
Also, I was thinking of extracting the amino acid sequence of the hits and use that as query to get protein domain signatures. Is there a tool that performs functional clustering based on protein signatures and domains?
Thank you, TP
Certainly! Here are some representative examples:
Thank you for taking a look at this.
I was able to recognise some of these using DAVID, but only by using the previous version (DAVID 6.7): https://david-d.ncifcrf.gov/
I was also trying to search for them in Entrez using the following Python script:
Execute this as
python LocusTagSearch.py -f 1 -e myname@gmail.com locustags.list
(locustags.list just contains a single list of your locus_tags).Using
esearch
, this script is capable of finding the species in which each locus_tag is found, but only in the Entrez nucleotide or nuccore databases, and I have been struggling to then extract the gene name for each usingefetch
. I have noted that these locus_tags appear to have mostly been discontinued.Note however, that with the species ID you can then download a txt file of the species and possibly parse out the info of interest. Here's an example for CTLon_0753: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=352951305&retmode=txt
I'm sure that there's still a way to do this, but I have ran out of time.
Thank you Kevin, I really appreciate the effort. One thing I didn't mention is that I have or rather know all bacterial species associated with the locus_tag's. The problem is that some of them contain more relevant info (such as GeneID or "old locus_tag" that is sometimes recognized by DAVID), but this is a minority. The only consistent and unique feature that was common to all of them is locus_tag which is unfortunately not recognized by DAVID.
My next step is to pull out amino acid sequences from each of the loci (which I've done) and then use InterPro to get all associated conserved domains. I am hoping I can find a tool that can do functional clustering based on these or I'll just use it as a proxy to get some enriched functions and work with these limitations in mind.
PS - your script will come handy for some other projects I have, so thank you!
Don't thank me! These guys here have some neat scripts for interrogating Entrez/RefSeq: https://wikis.utexas.edu/display/bioiteam/NCBI+Entrez+Interface
It's easy to modify these for custom use though: A: Need help to retrive sequences
Good luck!