Question

Retrieve all sequence ids from a master record

3

Entering edit mode

6.7 years ago

john ▴ 130

Consider the following genome entry in NCBI.

https://www.ncbi.nlm.nih.gov/nuccore/NZ_ABAX00000000.3

This is a master entry for the assembly project of a bacteria. As you can see on the top its sais that is does not contain any sequence. The genomic sequence is distributed across multiple other entries (NZ_DS499719-NZ_DS499744) one entry per assembled scaffold.

Hence if I use for example the following entrez querry:

esearch -db nuccore -query 'NZ_ABAX00000000.3' | efetch -format fasta

Returns a fasta file with just "N".

My question are the following:

How can I identify automatically if a NCBI entry is such a master entry?
How can I get automatically all entries for a master entry? (In this case NZ_DS499719-NZ_DS499744)

NCBI Assembly • 2.6k views

ADD COMMENT • link 6.7 years ago by john ▴ 130

score 4 · Accepted Answer · 2018-03-22

How can I identify automatically if a NCBI entry is such a master entry?

Master records have distinguishable accession numbers. Each master record consists of a four-letter prefix followed by zeroes. The number of zeroes can be different - it increases to nine for Whole Genome Shotgun projects with one million or more contigs. In order to programatically distinguish master records from normal records you can use regular expressions. For example, here is a Python function that takes as input an accession number and returns True if it belongs to the master record.

import re

def is_master_record(accession):
    return bool(re.search('[A-Z]{4}0+(\.\d){0,}$', accession))

Little validaton:

NZ_ABAX000000000.2 True
NZ_ABAX000000000 True
NZ_ABAX00003200 False
NZ_YYYY00000 True
NZ_ABAX0000.1 True
NZ_DS499731.1 False
NZ_AAAAAA0000 True

How can I get automatically all entries for a master entry? (In this case NZ_DS499719-NZ_DS499744)

The following command will give you all Genbank and Refseq records related to the master entry NZ_ABAX00000000.3.

esearch -db genome -query NZ_ABAX00000000.3 | elink -target assembly | elink -target nuccore | efetch -format fasta

If you want Refseq entries only (NZ_DS499719-NZ_DS499744), you can filter the list using efilter.

esearch -db genome -query NZ_ABAX00000000.3 | elink -target assembly | elink -target nuccore | efilter -query "refseq[Filter]" | efetch -format fasta

As a result, you will get FASTA sequences for the following entries:

NZ_DS499744.1 Anaerostipes caccae DSM 14662 Scfld_03_25, whole genome shotgun sequence
NZ_DS499743.1 Anaerostipes caccae DSM 14662 Scfld_03_24, whole genome shotgun sequence
NZ_DS499742.1 Anaerostipes caccae DSM 14662 Scfld_03_23, whole genome shotgun sequence
NZ_DS499741.1 Anaerostipes caccae DSM 14662 Scfld_03_22, whole genome shotgun sequence
NZ_DS499740.1 Anaerostipes caccae DSM 14662 Scfld_03_21, whole genome shotgun sequence
NZ_DS499739.1 Anaerostipes caccae DSM 14662 Scfld_03_20, whole genome shotgun sequence
NZ_DS499738.1 Anaerostipes caccae DSM 14662 Scfld_03_19, whole genome shotgun sequence
NZ_DS499737.1 Anaerostipes caccae DSM 14662 Scfld_03_18, whole genome shotgun sequence
NZ_DS499736.1 Anaerostipes caccae DSM 14662 Scfld_03_17, whole genome shotgun sequence
NZ_DS499735.1 Anaerostipes caccae DSM 14662 Scfld_03_16, whole genome shotgun sequence
NZ_DS499734.1 Anaerostipes caccae DSM 14662 Scfld_03_15, whole genome shotgun sequence
NZ_DS499733.1 Anaerostipes caccae DSM 14662 Scfld_03_14, whole genome shotgun sequence
NZ_DS499732.1 Anaerostipes caccae DSM 14662 Scfld_03_13, whole genome shotgun sequence
NZ_DS499731.1 Anaerostipes caccae DSM 14662 Scfld_03_12, whole genome shotgun sequence
NZ_DS499730.1 Anaerostipes caccae DSM 14662 Scfld_03_11, whole genome shotgun sequence
NZ_DS499729.1 Anaerostipes caccae DSM 14662 Scfld_03_10, whole genome shotgun sequence
NZ_DS499728.1 Anaerostipes caccae DSM 14662 Scfld_03_9, whole genome shotgun sequence
NZ_DS499727.1 Anaerostipes caccae DSM 14662 Scfld_03_8, whole genome shotgun sequence
NZ_DS499726.1 Anaerostipes caccae DSM 14662 Scfld_03_7, whole genome shotgun sequence
NZ_DS499725.1 Anaerostipes caccae DSM 14662 Scfld_03_6, whole genome shotgun sequence
NZ_DS499724.1 Anaerostipes caccae DSM 14662 Scfld_03_5, whole genome shotgun sequence
NZ_DS499723.1 Anaerostipes caccae DSM 14662 Scfld_03_4, whole genome shotgun sequence
NZ_DS499722.1 Anaerostipes caccae DSM 14662 Scfld_03_3, whole genome shotgun sequence
NZ_DS499721.1 Anaerostipes caccae DSM 14662 Scfld_03_2, whole genome shotgun sequence
NZ_DS499720.1 Anaerostipes caccae DSM 14662 Scfld_03_1, whole genome shotgun sequence
NZ_DS499719.1 Anaerostipes caccae DSM 14662 Scfld_03_0, whole genome shotgun sequence