Question

How to identify locus_tag by using RefSeq protein info (WP_*)

0

Entering edit mode

3.0 years ago

beginner123 ▴ 30

Hi, I would like to know the locus tag of a protein annotated with RefSeq (WP_*).

For example, I would like to identify the genomic location of a protein (WP_073031595.1) and also know its adjacent proteins.

The GenBank file has a locus tag and I can easily identify the gene location, but is there any way to identify the location on the genome in the RefSeq file?

NCBI tag RefSeq locus • 1.2k views

ADD COMMENT • link updated 3.0 years ago by MirianT_NCBI ▴ 800 • written 3.0 years ago by beginner123 ▴ 30

1

Entering edit mode

Hi, You can use NCBI Datasets to answer some of your questions (download instructions here)

For example: you can download a data package based on a protein accession and find the gene location for any genomes that have that annotation. For example (using the WP number you posted):

datasets download gene accession WP_073031595 --filename wp_073031595.zip
unzip wp_073031595.zip -d wp_073031595

Archive:  wp_073031595.zip
  inflating: wp_073031595/README.md  
  inflating: wp_073031595/ncbi_dataset/data/data_report.jsonl  
  inflating: wp_073031595/ncbi_dataset/data/annotation_report.jsonl  
  inflating: wp_073031595/ncbi_dataset/data/gene.fna  
  inflating: wp_073031595/ncbi_dataset/data/protein.faa  
  inflating: wp_073031595/ncbi_dataset/data/dataset_catalog.json

The Refseq location can be found in the file annotation_report.jsonl. To see the info, you can use jq.

jq . wp_073031595/ncbi_dataset/data/annotation_report.jsonl 
{
  "genbankGenomicLocation": {
    "assemblyAccession": "GCA_900129935.1",
    "sequenceRange": {
      "accessionVersion": "FQXJ01000017.1",
      "range": [
        {
          "begin": "76448",
          "end": "77617",
          "orientation": "plus"
        }
      ]
    }
  },
  "organism": {
    "organismName": "Desulfosporosinus lacus DSM 15449",
    "strain": "DSM 15449",
    "taxId": 1121420
  },
  "proteinAccession": "WP_073031595.1",
  "refseqGenomicLocation": {
    "assemblyAccession": "GCF_900129935.1",
    "sequenceRange": {
      "accessionVersion": "NZ_FQXJ01000017.1",
      "range": [
        {
          "begin": "76448",
          "end": "77617",
          "orientation": "plus"
        }
      ]
    }
  }
}

In the example you provided, the WP is annotated on one genome. But if you look at the protein accession WP_000997656, the annotation_report.jsonl has 4,338 lines (one per annotated genome).

We currently don't have the locus_tag in our report. Is that information something that you or your research group find useful? Please let us know. NCBI Datasets is in active development, and we love to hear feedback from users.

Feel free to reach out if you have any additional questions.

ADD REPLY • link 3.0 years ago by MirianT_NCBI ▴ 800

0

Entering edit mode

Working with WP* records can be tricky since they can potentially point to multiple species. See: https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/

ADD REPLY • link 3.0 years ago by GenoMax 151k