finding the genome of origin of a protein on genbank
1
0
Entering edit mode
5.8 years ago
Lille My ▴ 30

Hi all, I have a list of protein ids, of the WP_ type. I need to find the assembly (GCA_, GCF_ ) they come from. Any ideas on how to do it? Thanks :-)

ncbi Assembly • 1.5k views
ADD COMMENT
0
Entering edit mode

Have you checked NCBI's page on non-redundant RefSeq protein accession numbers here? As far as I see these accession numbers are going to be tied to genomic records that are not G*. See the example of WP_003547430. If you click on the genomic records under related information you can see the constituent genomes.

On command line you can use Entrezdirect to get this information:

 esearch -db protein -query "WP_003547430.1 " | elink -db assembly -target nuccore  | efetch -format docsum | xtract -pattern DocumentSummary -element Caption,Title

this will give you (truncated for space)

NZ_ATYZ01000003 Rhizobium leguminosarum bv. viciae UPM1131 A19QDRAFT_scaffold_2.3_C, whole genome shotgun sequence
NZ_ATTP01000009 Rhizobium leguminosarum bv. viciae GB30 A3A3DRAFT_scaffold_8.9_C, whole genome shotgun sequence
NC_021905       Rhizobium etli bv. mimosae str. Mim1, complete genome
NZ_ARRT01000006 Rhizobium leguminosarum bv. viciae 248 RLEG17DRAFT_Scaffold1.7_C, whole genome shotgun sequence
NZ_MRDL01000032 Rhizobium leguminosarum bv. viciae USDA 2370 scaffold22, whole genome shotgun sequence
NZ_MRDM01000002 Rhizobium laguerreae strain FB206 scaffold16, whole genome shotgun sequence
ADD REPLY
0
Entering edit mode

Interesting. When I manually checked a few on the website, I found a link through the Identical Protein Groups database. An example is WP_043107373.1, which through some digging is associated with GCF_000801295.1. it also has an NZ number (NZ_AP012978.1), but I think this might be the accession number of the gene.

ADD REPLY
0
Entering edit mode

I can get the NZ* ID but not GCF* yet.

esearch -db protein -query "WP_043107373.1" | elink -db assembly -target nuccore  | efetch -format acc
NZ_AP012978.1
ADD REPLY
0
Entering edit mode

Thank a lot, it works. Do you know how to do this query in batch? I have a list of ids in a file and at the second attempt there is an error message.

ADD REPLY
0
Entering edit mode

Post a few examples here. Idea would be to do something like this:

epost -input your_file_w_id | elink -target nuccore -db protein | elink -target assembly| esummary | xtract -pattern AssemblyAccession -element AssemblyAccession

Please use ADD REPLY/ADD COMMENT when responding to existing answers to keep threads logically organized.

ADD REPLY
0
Entering edit mode

Thanks! this works. I'm also trying to have it return the original query prtotein ID. could you also help with that? thanks

ADD REPLY
0
Entering edit mode

What do you mean by that? Just the ID that is in your own file/you are using for search?

You can accept @Sej's answer below to provide closure to this thread at some point.

ADD REPLY
0
Entering edit mode

I'm using the following:

epost -input file-with-gi-numbers -db protein | elink -target nuccore -db protein | elink -target assembly | esummary  | xtract -pattern AssemblyAccession -element AssemblyAccession

I get a list of GCA numbers that is longer than the list of accessions. I would like to have the final result in the format of:

<query> GCA_number or: 12345 GCA_12345

ADD REPLY
0
Entering edit mode

I don't think there is a way to do this within Entrezdirect. Since we are cross-linking to different databases the information about original query is not carried forward.

ADD REPLY
2
Entering edit mode
5.8 years ago
Sej Modha 5.3k

The following command would return the assembly accession number:

elink -target nuccore -db protein -id "WP_043107373.1"|elink -target assembly|esummary |xtract -pattern AssemblyAccession -element AssemblyAccession

GCF_000801295.1
ADD COMMENT
0
Entering edit mode

Truncated for space.

$ elink -target nuccore -db protein -id "WP_003547430.1"|elink -target assembly|esummary |xtract -pattern AssemblyAccession -element AssemblyAccession
GCF_004307185.1
GCF_004307195.1
GCF_004307135.1
GCF_004307165.1
GCF_004307125.1
GCF_004303745.1
GCF_004307045.1
GCF_004307035.1
GCF_004307025.1
GCF_004306835.1
GCF_004306925.1
GCF_004306885.1
ADD REPLY

Login before adding your answer.

Traffic: 1917 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6