Retrieve all sequence ids from a master record
1
3
Entering edit mode
6.7 years ago
john ▴ 130

Consider the following genome entry in NCBI.

https://www.ncbi.nlm.nih.gov/nuccore/NZ_ABAX00000000.3

This is a master entry for the assembly project of a bacteria. As you can see on the top its sais that is does not contain any sequence. The genomic sequence is distributed across multiple other entries (NZ_DS499719-NZ_DS499744) one entry per assembled scaffold.

Hence if I use for example the following entrez querry:

esearch -db nuccore -query 'NZ_ABAX00000000.3' | efetch -format fasta

Returns a fasta file with just "N".

My question are the following:

  • How can I identify automatically if a NCBI entry is such a master entry?
  • How can I get automatically all entries for a master entry? (In this case NZ_DS499719-NZ_DS499744)
NCBI Assembly • 2.6k views
ADD COMMENT
4
Entering edit mode
6.7 years ago

How can I identify automatically if a NCBI entry is such a master entry?

Master records have distinguishable accession numbers. Each master record consists of a four-letter prefix followed by zeroes. The number of zeroes can be different - it increases to nine for Whole Genome Shotgun projects with one million or more contigs. In order to programatically distinguish master records from normal records you can use regular expressions. For example, here is a Python function that takes as input an accession number and returns True if it belongs to the master record.

import re

def is_master_record(accession):
    return bool(re.search('[A-Z]{4}0+(\.\d){0,}$', accession))

Little validaton:

NZ_ABAX000000000.2 True
NZ_ABAX000000000 True
NZ_ABAX00003200 False
NZ_YYYY00000 True
NZ_ABAX0000.1 True
NZ_DS499731.1 False
NZ_AAAAAA0000 True

How can I get automatically all entries for a master entry? (In this case NZ_DS499719-NZ_DS499744)

The following command will give you all Genbank and Refseq records related to the master entry NZ_ABAX00000000.3.

esearch -db genome -query NZ_ABAX00000000.3 | elink -target assembly | elink -target nuccore | efetch -format fasta

If you want Refseq entries only (NZ_DS499719-NZ_DS499744), you can filter the list using efilter.

esearch -db genome -query NZ_ABAX00000000.3 | elink -target assembly | elink -target nuccore | efilter -query "refseq[Filter]" | efetch -format fasta

As a result, you will get FASTA sequences for the following entries:

NZ_DS499744.1 Anaerostipes caccae DSM 14662 Scfld_03_25, whole genome shotgun sequence
NZ_DS499743.1 Anaerostipes caccae DSM 14662 Scfld_03_24, whole genome shotgun sequence
NZ_DS499742.1 Anaerostipes caccae DSM 14662 Scfld_03_23, whole genome shotgun sequence
NZ_DS499741.1 Anaerostipes caccae DSM 14662 Scfld_03_22, whole genome shotgun sequence
NZ_DS499740.1 Anaerostipes caccae DSM 14662 Scfld_03_21, whole genome shotgun sequence
NZ_DS499739.1 Anaerostipes caccae DSM 14662 Scfld_03_20, whole genome shotgun sequence
NZ_DS499738.1 Anaerostipes caccae DSM 14662 Scfld_03_19, whole genome shotgun sequence
NZ_DS499737.1 Anaerostipes caccae DSM 14662 Scfld_03_18, whole genome shotgun sequence
NZ_DS499736.1 Anaerostipes caccae DSM 14662 Scfld_03_17, whole genome shotgun sequence
NZ_DS499735.1 Anaerostipes caccae DSM 14662 Scfld_03_16, whole genome shotgun sequence
NZ_DS499734.1 Anaerostipes caccae DSM 14662 Scfld_03_15, whole genome shotgun sequence
NZ_DS499733.1 Anaerostipes caccae DSM 14662 Scfld_03_14, whole genome shotgun sequence
NZ_DS499732.1 Anaerostipes caccae DSM 14662 Scfld_03_13, whole genome shotgun sequence
NZ_DS499731.1 Anaerostipes caccae DSM 14662 Scfld_03_12, whole genome shotgun sequence
NZ_DS499730.1 Anaerostipes caccae DSM 14662 Scfld_03_11, whole genome shotgun sequence
NZ_DS499729.1 Anaerostipes caccae DSM 14662 Scfld_03_10, whole genome shotgun sequence
NZ_DS499728.1 Anaerostipes caccae DSM 14662 Scfld_03_9, whole genome shotgun sequence
NZ_DS499727.1 Anaerostipes caccae DSM 14662 Scfld_03_8, whole genome shotgun sequence
NZ_DS499726.1 Anaerostipes caccae DSM 14662 Scfld_03_7, whole genome shotgun sequence
NZ_DS499725.1 Anaerostipes caccae DSM 14662 Scfld_03_6, whole genome shotgun sequence
NZ_DS499724.1 Anaerostipes caccae DSM 14662 Scfld_03_5, whole genome shotgun sequence
NZ_DS499723.1 Anaerostipes caccae DSM 14662 Scfld_03_4, whole genome shotgun sequence
NZ_DS499722.1 Anaerostipes caccae DSM 14662 Scfld_03_3, whole genome shotgun sequence
NZ_DS499721.1 Anaerostipes caccae DSM 14662 Scfld_03_2, whole genome shotgun sequence
NZ_DS499720.1 Anaerostipes caccae DSM 14662 Scfld_03_1, whole genome shotgun sequence
NZ_DS499719.1 Anaerostipes caccae DSM 14662 Scfld_03_0, whole genome shotgun sequence
ADD COMMENT
0
Entering edit mode

Sweet.

That is a nice answer for the second question. But I am still not completly satisfied.

The problem is that some of my entries are fully assembled genomes, like NC_004663.1. Others like NZ_ABAX00000000.3 are just the record for the assembly project. Hence I cant use both with the same entrez call.

It seems like that all accessions of assembly projects contain eight zeros in there accession but I am not sure if this is consistent. To build a script that handles both I need a way to discriminate them from each other.

ADD REPLY
1
Entering edit mode

Sorry, somehow I missed your first question. I've just updated my answer.

ADD REPLY
0
Entering edit mode

Hey I found a master record (NZ_ARET00000000.1) which does not work with your querry.

esearch -db genome -query "NZ_ARET00000000.1" | elink -target assembly | elink -target nuccore | efilter -query "refseq[Filter]" | efetch -format acc

In the master record it says just the accession NZ_KB892637-NZ_KB892704 are the correct scaffolds of the project. But the query returns much more accessions.

NZ_AUUC01000, NZ_KE3922, NZ_JPJF01000, NZ_PQGB0100

Is the cause for this problem the not correct maintained db of ncbi?

ADD REPLY
0
Entering edit mode

Hmm.. that's interesting. I truly don't know how to interpret this case. I would also guess that this may have something to do with the database maintenance. Can we somehow deduce which of these two results is true - information in the master record (NZ_KB892637-NZ_KB892704) or results returned by esearch (NZ_KB892637-NZ_KB892704 + extra records)? The worst case scenario is to contact NCBI on this issue.

ADD REPLY
0
Entering edit mode

That sounds really complicated. I will message NCBI, will see what they say.

ADD REPLY

Login before adding your answer.

Traffic: 2507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6