How to go from GenBank assembly identifier to taxonomic name? (Python or CLI)
2
2
Entering edit mode
2.7 years ago
O.rka ▴ 740

I have a bunch of viral genbank assembly identifiers such as this: ['GCA_001041635.1', 'GCA_002958635.1', 'GCA_000915755.1', 'GCA_001041575.1']

How can I get the actual taxonomic names for these?

My process right now is cumbersome:

  1. Look up assembly ID on NCBI https://www.ncbi.nlm.nih.gov/assembly/GCF_001041635.1/

  2. Click on Related Information -> Taxonomy https://www.ncbi.nlm.nih.gov/taxonomy?LinkName=assembly_taxonomy&from_uid=4780598

  3. Click on the taxon under "Links from Assembly": https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1500814

  4. Copy "Lineage": Viruses; Duplodnaviria; Heunggongvirae; Uroviricota; Caudoviricetes; Caudovirales; Siphoviridae; Pahexavirus; Propionibacterium virus PHL095N00

How can I automate this process? I made a crude webscraping function but I get locked out and it's not reliable nor is it robust.

Essentially I'm looking for one of the following approaches:

  1. Python module that is either pip or conda installable that has a function: f(genbank_assembly_id) -> taxonomic_lineage_string which is probably unlikely.

  2. A flat file that has at least 2 columns where one is Genbank Accession ID and another is lineage.

taxonomy metagenomics database genbank • 1.1k views
ADD COMMENT
2
Entering edit mode
2.7 years ago
GenoMax 147k

Using EntrezDirect:

$ esearch -db assembly -query "GCA_001041635" | elink -target taxonomy | efetch -format native -mode xml | grep ScientificName | awk -F ">|<" 'BEGIN{ORS=", ";}{print $3;}'
Propionibacterium phage PHL095N00, Viruses, Duplodnaviria, Heunggongvirae, Uroviricota, Caudoviricetes, Caudovirales, Siphoviridae, Pahexavirus, Propionibacterium virus PHL095N00,

I will leave crafting a loop to get output described in #2 up to you.

ADD COMMENT
0
Entering edit mode

for ID in $(cat identifiers.list); do [YOUR CODE]; done

Awesome, thank you so much.

ADD REPLY
2
Entering edit mode
2.7 years ago

with bio (https://www.bioinfo.help/) you can do:

bio search GCA_001041635

that prints:

{
    "assembly_accession": "GCA_001041635.1",
    "bioproject": "",
    "biosample": "",
    "wgs_master": "",
    "refseq_category": "na",
    "taxid": "1500814",
    "species_taxid": "1982283",
    "organism_name": "Propionibacterium phage PHL095N00",
    "infraspecific_name": "",
    "isolate": "PHL095N00",
    "version_status": "latest",
    "assembly_level": "Complete Genome",
    "release_type": "Major",
    "genome_rep": "Full",
    "seq_rel_date": "2015/04/20",
    "asm_name": "ViralProj288014",
    "submitter": "",
    "gbrs_paired_asm": "GCF_001041635.1",
    "paired_asm_comp": "identical",
    "ftp_path": "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/041/635/GCA_001041635.1_ViralProj288014",
    "excluded_from_refseq": "",
    "relation_to_type_materialasm_not_live_date": "ICTV species exemplar"
}

]

you can turn the output in to tab delimited or csv output, with or without headers

bio search GCA_001041635 -tab | cut -f 6

will print the taxid

1500814

you can chain that into bio taxon with:

 bio search GCA_001041635 -tab | cut -f 6 | bio taxon --lineage

it prints:

superkingdom,10239,Viruses
  clade,2731341,Duplodnaviria
    kingdom,2731360,Heunggongvirae
      phylum,2731618,Uroviricota
        class,2731619,Caudoviricetes
          order,28883,Caudovirales
            family,10699,Siphoviridae
              genus,1982251,Pahexavirus
                species,1982283,Propionibacterium virus PHL095N00
                  no rank,1500814,Propionibacterium phage PHL095N00

having that all worked out you you can list all of your accesssions in one shot like so:

bio search GCA_001041635.1 GCA_002958635.1 GCA_000915755.1 GCA_001041575.1 -tab | cut -f 6 | bio taxon --lineage

that will print:

superkingdom,10239,Viruses
  clade,2731341,Duplodnaviria
    kingdom,2731360,Heunggongvirae
      phylum,2731618,Uroviricota
        class,2731619,Caudoviricetes
          order,28883,Caudovirales
            family,10699,Siphoviridae
              genus,1982251,Pahexavirus
                species,1982283,Propionibacterium virus PHL095N00
                  no rank,1500814,Propionibacterium phage PHL095N00
superkingdom,10239,Viruses
  clade,2731341,Duplodnaviria
    kingdom,2731360,Heunggongvirae
      phylum,2731618,Uroviricota
        class,2731619,Caudoviricetes
          order,28883,Caudovirales
            family,10699,Siphoviridae
              genus,1982251,Pahexavirus
                no rank,2079398,unclassified Pahexavirus
                  species,2079407,Propionibacterium phage pa35
superkingdom,10239,Viruses
  clade,2731341,Duplodnaviria
    kingdom,2731360,Heunggongvirae
      phylum,2731618,Uroviricota
        class,2731619,Caudoviricetes
          order,28883,Caudovirales
            family,10699,Siphoviridae
              genus,1922243,Sextaecvirus
                species,1922247,Staphylococcus virus SEP9
                  no rank,1434319,Staphylococcus phage vB_SepS_SEP9
superkingdom,10239,Viruses
  clade,2731341,Duplodnaviria
    kingdom,2731360,Heunggongvirae
      phylum,2731618,Uroviricota
        class,2731619,Caudoviricetes
          order,28883,Caudovirales
            family,10699,Siphoviridae
              genus,1982251,Pahexavirus
                species,1982301,Propionibacterium virus PHL301M00
                  no rank,1500831,Propionibacterium phage PHL301M00
ADD COMMENT

Login before adding your answer.

Traffic: 3303 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6