Question

Extract NCBI's refseq assembly accession number from nuccore IDs

0

Entering edit mode

4.0 years ago

genomes_and_MGEs ▴ 10

Hey guys,

I have a list of nuccore IDs in a text file (let's call it file.txt), and want to append the NCBI's refseq assembly accession number next to the nuccore ID, such as this

GCF_000006765.1_NC_002516.2

I've tried with the following command, but only the NCBI's refseq assembly accession number shows up

for file in $(cat file.txt) ; do esearch -db nuccore -query "$file" | elink -db assembly -target assembly | esummary | xtract -pattern DocumentSummary -element Caption,AssemblyAccession,BioSample >> GCFs_nucl_accessions.txt; done

Can you help me out? Thanks!

sequence • 2.4k views

ADD COMMENT • link updated 4.0 years ago by Ram 45k • written 4.0 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

What if there was no assembly for a nucleotide sequence?

ADD REPLY • link 4.0 years ago by Michael 55k

0

Entering edit mode

All the nucleotide IDs I have correspond to either the chromosome or plasmids from complete bacterial genomes, so I expect each ID will have a corresponding assembly accession.

ADD REPLY • link 4.0 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Ok, your query was fine, I think I fixed the shell code so it works as expected now.

ADD REPLY • link 4.0 years ago by Michael 55k

0

Entering edit mode

This is the assembly database record for the ID included above. Are you looking to get NC* id based on the GCF ID? GCF ID's are RefSeq ID's by the way they are not nuccore ID's.

ADD REPLY • link 4.0 years ago by GenoMax 151k

0

Entering edit mode

I have a list of NC* IDs, and want to append the corresponding GCF ID to each one.

ADD REPLY • link 4.0 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Please post more than one example. It is always good to do this when you ask questions about ID's. You can simply do this to get the GCF ID:

$ esearch -db assembly -query "NC_002516.2"  | esummary | xtract -pattern DocumentSummary -element RefSeq
GCF_000006765.1

ADD REPLY • link 4.0 years ago by GenoMax 151k

0

Entering edit mode

Sure, here's the top 5 IDs

NC_002774.1
NC_003140.1
NC_005951.1
NC_006625.1
NC_007790.1

ADD REPLY • link updated 4.0 years ago by Ram 45k • written 4.0 years ago by genomes_and_MGEs ▴ 10

score 1 · Answer 1 · 2021-05-27

Using EntrezDirect:

$ more id

NC_002774.1
NC_003140.1
NC_005951.1
NC_006625.1
NC_007790.1

$ for i in `cat id`; do printf ${i}"\t"; esearch -db assembly -query ${i}  | esummary | xtract -pattern DocumentSummary -element RefSeq; done
NC_002774.1 GCF_000009665.1
NC_003140.1 GCF_000009645.1
NC_005951.1 GCF_000011525.1
NC_006625.1 GCF_000009885.1
NC_007790.1 GCF_000013465.1

You can change the printed output as needed to concatenate the ID's the way you want them.

Ram · Answer 2 · 2021-05-27

0

Entering edit mode

4.0 years ago

Michael 55k

Your shell code was almost correct, try the following:

for file in $(cat file.txt) ; do
   echo $(esearch -db nuccore -query "$file" | \
   elink -db assembly -target assembly | \
   esummary | xtract -pattern DocumentSummary -element \ 
   Caption,AssemblyAccession,BioSample)_$file >> 
   GCFs_nucl_accessions.txt;
done

ADD COMMENT • link 4.0 years ago by Michael 55k

0

Entering edit mode

Thanks for sharing this, but at least for me the code doesn't work exactly as expected; it outputs

-bash: GCF_000009665.1_NC_002774.1: command not found
-bash: GCF_000009645.1_NC_003140.1: command not found
...

ADD REPLY • link updated 4.0 years ago by Ram 45k • written 4.0 years ago by genomes_and_MGEs ▴ 10