Question

finding the genome of origin of a protein on genbank

0

Entering edit mode

6.1 years ago

Lille My ▴ 40

Hi all, I have a list of protein ids, of the WP_ type. I need to find the assembly (GCA_, GCF_ ) they come from. Any ideas on how to do it? Thanks :-)

ncbi Assembly • 1.6k views

ADD COMMENT • link 6.1 years ago by Lille My ▴ 40

0

Entering edit mode

Have you checked NCBI's page on non-redundant RefSeq protein accession numbers here? As far as I see these accession numbers are going to be tied to genomic records that are not G*. See the example of WP_003547430. If you click on the genomic records under related information you can see the constituent genomes.

On command line you can use Entrezdirect to get this information:

 esearch -db protein -query "WP_003547430.1 " | elink -db assembly -target nuccore  | efetch -format docsum | xtract -pattern DocumentSummary -element Caption,Title

this will give you (truncated for space)

NZ_ATYZ01000003 Rhizobium leguminosarum bv. viciae UPM1131 A19QDRAFT_scaffold_2.3_C, whole genome shotgun sequence
NZ_ATTP01000009 Rhizobium leguminosarum bv. viciae GB30 A3A3DRAFT_scaffold_8.9_C, whole genome shotgun sequence
NC_021905       Rhizobium etli bv. mimosae str. Mim1, complete genome
NZ_ARRT01000006 Rhizobium leguminosarum bv. viciae 248 RLEG17DRAFT_Scaffold1.7_C, whole genome shotgun sequence
NZ_MRDL01000032 Rhizobium leguminosarum bv. viciae USDA 2370 scaffold22, whole genome shotgun sequence
NZ_MRDM01000002 Rhizobium laguerreae strain FB206 scaffold16, whole genome shotgun sequence

ADD REPLY • link 6.1 years ago by GenoMax 150k

0

Entering edit mode

Interesting. When I manually checked a few on the website, I found a link through the Identical Protein Groups database. An example is WP_043107373.1, which through some digging is associated with GCF_000801295.1. it also has an NZ number (NZ_AP012978.1), but I think this might be the accession number of the gene.

ADD REPLY • link 6.1 years ago by Lille My ▴ 40

0

Entering edit mode

I can get the NZ* ID but not GCF* yet.

esearch -db protein -query "WP_043107373.1" | elink -db assembly -target nuccore  | efetch -format acc
NZ_AP012978.1

ADD REPLY • link 6.1 years ago by GenoMax 150k

0

Entering edit mode

Thank a lot, it works. Do you know how to do this query in batch? I have a list of ids in a file and at the second attempt there is an error message.

ADD REPLY • link 6.1 years ago by Lille My ▴ 40

0

Entering edit mode

Post a few examples here. Idea would be to do something like this:

epost -input your_file_w_id | elink -target nuccore -db protein | elink -target assembly| esummary | xtract -pattern AssemblyAccession -element AssemblyAccession

Please use ADD REPLY/ADD COMMENT when responding to existing answers to keep threads logically organized.

ADD REPLY • link 6.1 years ago by GenoMax 150k

0

Entering edit mode

Thanks! this works. I'm also trying to have it return the original query prtotein ID. could you also help with that? thanks

ADD REPLY • link 6.1 years ago by Lille My ▴ 40

0

Entering edit mode

What do you mean by that? Just the ID that is in your own file/you are using for search?

You can accept @Sej's answer below to provide closure to this thread at some point.

ADD REPLY • link 6.1 years ago by GenoMax 150k

0

Entering edit mode

I'm using the following:

epost -input file-with-gi-numbers -db protein | elink -target nuccore -db protein | elink -target assembly | esummary  | xtract -pattern AssemblyAccession -element AssemblyAccession

I get a list of GCA numbers that is longer than the list of accessions. I would like to have the final result in the format of:

<query> GCA_number or: 12345 GCA_12345

ADD REPLY • link 6.1 years ago by Lille My ▴ 40

0

Entering edit mode

I don't think there is a way to do this within Entrezdirect. Since we are cross-linking to different databases the information about original query is not carried forward.

ADD REPLY • link 6.1 years ago by GenoMax 150k

score 2 · Answer 1 · 2019-03-12

2

Entering edit mode

6.1 years ago

Sej Modha 5.3k

The following command would return the assembly accession number:

elink -target nuccore -db protein -id "WP_043107373.1"|elink -target assembly|esummary |xtract -pattern AssemblyAccession -element AssemblyAccession

GCF_000801295.1

ADD COMMENT • link 6.1 years ago by Sej Modha 5.3k

0

Entering edit mode

Truncated for space.

$ elink -target nuccore -db protein -id "WP_003547430.1"|elink -target assembly|esummary |xtract -pattern AssemblyAccession -element AssemblyAccession
GCF_004307185.1
GCF_004307195.1
GCF_004307135.1
GCF_004307165.1
GCF_004307125.1
GCF_004303745.1
GCF_004307045.1
GCF_004307035.1
GCF_004307025.1
GCF_004306835.1
GCF_004306925.1
GCF_004306885.1

ADD REPLY • link 6.1 years ago by GenoMax 150k