How can I identify automatically if a NCBI entry is such a master entry?
Master records have distinguishable accession numbers. Each master record consists of a four-letter prefix followed by zeroes. The number of zeroes can be different - it increases to nine for Whole Genome Shotgun projects with one million or more contigs. In order to programatically distinguish master records from normal records you can use regular expressions. For example, here is a Python function that takes as input an accession number and returns True
if it belongs to the master record.
import re
def is_master_record(accession):
return bool(re.search('[A-Z]{4}0+(\.\d){0,}$', accession))
Little validaton:
NZ_ABAX000000000.2 True
NZ_ABAX000000000 True
NZ_ABAX00003200 False
NZ_YYYY00000 True
NZ_ABAX0000.1 True
NZ_DS499731.1 False
NZ_AAAAAA0000 True
How can I get automatically all entries for a master entry? (In this
case NZ_DS499719-NZ_DS499744)
The following command will give you all Genbank and Refseq records related to the master entry NZ_ABAX00000000.3
.
esearch -db genome -query NZ_ABAX00000000.3 | elink -target assembly | elink -target nuccore | efetch -format fasta
If you want Refseq entries only (NZ_DS499719
-NZ_DS499744
), you can filter the list using efilter
.
esearch -db genome -query NZ_ABAX00000000.3 | elink -target assembly | elink -target nuccore | efilter -query "refseq[Filter]" | efetch -format fasta
As a result, you will get FASTA sequences for the following entries:
NZ_DS499744.1 Anaerostipes caccae DSM 14662 Scfld_03_25, whole genome shotgun sequence
NZ_DS499743.1 Anaerostipes caccae DSM 14662 Scfld_03_24, whole genome shotgun sequence
NZ_DS499742.1 Anaerostipes caccae DSM 14662 Scfld_03_23, whole genome shotgun sequence
NZ_DS499741.1 Anaerostipes caccae DSM 14662 Scfld_03_22, whole genome shotgun sequence
NZ_DS499740.1 Anaerostipes caccae DSM 14662 Scfld_03_21, whole genome shotgun sequence
NZ_DS499739.1 Anaerostipes caccae DSM 14662 Scfld_03_20, whole genome shotgun sequence
NZ_DS499738.1 Anaerostipes caccae DSM 14662 Scfld_03_19, whole genome shotgun sequence
NZ_DS499737.1 Anaerostipes caccae DSM 14662 Scfld_03_18, whole genome shotgun sequence
NZ_DS499736.1 Anaerostipes caccae DSM 14662 Scfld_03_17, whole genome shotgun sequence
NZ_DS499735.1 Anaerostipes caccae DSM 14662 Scfld_03_16, whole genome shotgun sequence
NZ_DS499734.1 Anaerostipes caccae DSM 14662 Scfld_03_15, whole genome shotgun sequence
NZ_DS499733.1 Anaerostipes caccae DSM 14662 Scfld_03_14, whole genome shotgun sequence
NZ_DS499732.1 Anaerostipes caccae DSM 14662 Scfld_03_13, whole genome shotgun sequence
NZ_DS499731.1 Anaerostipes caccae DSM 14662 Scfld_03_12, whole genome shotgun sequence
NZ_DS499730.1 Anaerostipes caccae DSM 14662 Scfld_03_11, whole genome shotgun sequence
NZ_DS499729.1 Anaerostipes caccae DSM 14662 Scfld_03_10, whole genome shotgun sequence
NZ_DS499728.1 Anaerostipes caccae DSM 14662 Scfld_03_9, whole genome shotgun sequence
NZ_DS499727.1 Anaerostipes caccae DSM 14662 Scfld_03_8, whole genome shotgun sequence
NZ_DS499726.1 Anaerostipes caccae DSM 14662 Scfld_03_7, whole genome shotgun sequence
NZ_DS499725.1 Anaerostipes caccae DSM 14662 Scfld_03_6, whole genome shotgun sequence
NZ_DS499724.1 Anaerostipes caccae DSM 14662 Scfld_03_5, whole genome shotgun sequence
NZ_DS499723.1 Anaerostipes caccae DSM 14662 Scfld_03_4, whole genome shotgun sequence
NZ_DS499722.1 Anaerostipes caccae DSM 14662 Scfld_03_3, whole genome shotgun sequence
NZ_DS499721.1 Anaerostipes caccae DSM 14662 Scfld_03_2, whole genome shotgun sequence
NZ_DS499720.1 Anaerostipes caccae DSM 14662 Scfld_03_1, whole genome shotgun sequence
NZ_DS499719.1 Anaerostipes caccae DSM 14662 Scfld_03_0, whole genome shotgun sequence
Sweet.
That is a nice answer for the second question. But I am still not completly satisfied.
The problem is that some of my entries are fully assembled genomes, like NC_004663.1. Others like NZ_ABAX00000000.3 are just the record for the assembly project. Hence I cant use both with the same entrez call.
It seems like that all accessions of assembly projects contain eight zeros in there accession but I am not sure if this is consistent. To build a script that handles both I need a way to discriminate them from each other.
Sorry, somehow I missed your first question. I've just updated my answer.
Hey I found a master record (NZ_ARET00000000.1) which does not work with your querry.
In the master record it says just the accession NZ_KB892637-NZ_KB892704 are the correct scaffolds of the project. But the query returns much more accessions.
NZ_AUUC01000, NZ_KE3922, NZ_JPJF01000, NZ_PQGB0100
Is the cause for this problem the not correct maintained db of ncbi?
Hmm.. that's interesting. I truly don't know how to interpret this case. I would also guess that this may have something to do with the database maintenance. Can we somehow deduce which of these two results is true - information in the master record (NZ_KB892637-NZ_KB892704) or results returned by esearch (NZ_KB892637-NZ_KB892704 + extra records)? The worst case scenario is to contact NCBI on this issue.
That sounds really complicated. I will message NCBI, will see what they say.