Entering edit mode
5.7 years ago
Lille My
▴
30
Hi all, I have a list of protein ids, of the WP_ type. I need to find the assembly (GCA_, GCF_ ) they come from. Any ideas on how to do it? Thanks :-)
Have you checked NCBI's page on non-redundant RefSeq protein accession numbers here? As far as I see these accession numbers are going to be tied to genomic records that are not G*. See the example of WP_003547430. If you click on the
genomic records
underrelated information
you can see the constituent genomes.On command line you can use Entrezdirect to get this information:
this will give you (truncated for space)
Interesting. When I manually checked a few on the website, I found a link through the Identical Protein Groups database. An example is WP_043107373.1, which through some digging is associated with GCF_000801295.1. it also has an NZ number (NZ_AP012978.1), but I think this might be the accession number of the gene.
I can get the NZ* ID but not GCF* yet.
Thank a lot, it works. Do you know how to do this query in batch? I have a list of ids in a file and at the second attempt there is an error message.
Post a few examples here. Idea would be to do something like this:
Please use
ADD REPLY/ADD COMMENT
when responding to existing answers to keep threads logically organized.Thanks! this works. I'm also trying to have it return the original query prtotein ID. could you also help with that? thanks
What do you mean by that? Just the ID that is in your own file/you are using for search?
You can accept @Sej's answer below to provide closure to this thread at some point.
I'm using the following:
I get a list of GCA numbers that is longer than the list of accessions. I would like to have the final result in the format of:
<query> GCA_number or: 12345 GCA_12345
I don't think there is a way to do this within Entrezdirect. Since we are cross-linking to different databases the information about original query is not carried forward.