API for NCBI Accession ID (GenBank or RefSeq ) generation from a list of species names?
2
0
Entering edit mode
2.8 years ago
Rijan ▴ 30

TL;DR: How can I convert a list of species names (common or scientific) into a corresponding list of NCBI database accession IDs to download the respective species’ reference genome from NCBI. The NCBI urls’ are not common or scientific name compatible so a regular python script for web scrapping that uses base url + item from list does not work. (My list of species is at least 200 species long, so I want to do this for automation’s sake)

What I am trying to do: Download the reference genomes/transcriptomes of some 200 species. My preferred source for this is the NCBI datasets (URL: https://www.ncbi.nlm.nih.gov/datasets/genomes/). I am trying to automate this just for convenience.

What I was hoping to do to automate the task: NCBI has a software package called “datasets” that can take the accession number ( the GCA or the RefSeq ID) and download a zipped data package that contains the genome, it is fairly easy to use so long as you have the accession number. To generate the accession IDs, I thought I would write a python web-scraping script. I would have written a script that takes a base url and then loops through a list of species names, then take the new list of urls and pass it to NCBI’s servers, take the html from the NCBI servers and use BeautifulSoup to look for the accession IDs. But alas, turns out the NCBI servers don’t take species names in the url but a specific taxon id. I guess this makes sense because the same species can have many different data files associated with it. So, for example if you want to get data from NCBI on common mice you will have to pass “taxon=10090” instead of “mouse” in the url. This is a problem I have not been able to work around because I do not know what the taxon ID for my 200 species are. I know mouse is mouse, I have no way of knowing that mouse is taxon=10090. I am looking for a resource to generate these taxon IDs.

What I tried: I tried to see if the Entrez Esearch utility (https://www.ncbi.nlm.nih.gov/books/NBK25501/) is the API that will help me get these IDs but it seems I might be barking up the wrong tree. The Entrez Esearch utility use url-based data retrieval, so you will take a base url like eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? and you can add code/terms to it like db=genome (database) and term=mouse to look for mouse to get a xml file on the list of mouse related files in the NCBI database so the final url for the xml file looks like this: eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=genome&term=house+mouse[orgn]. I can get a xml file for any species but these xml files do not seem to have Accession IDs for reference genomes. I tried looking at this table (https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/?report=objectonly) to see what might the database for reference genomes might be without much luck. How can I solve this issues?

Any help is much appreciated!

NCBI genomes refrence • 2.2k views
ADD COMMENT
4
Entering edit mode
2.8 years ago
MirianT_NCBI ▴ 760

Hi Rijan,

Just to be sure I understand what you want: you have a list of species and you would like to get the accession numbers for the reference genomes for each one of them, is that correct?

If so, you could try using the command datasets summary in combination with the flag --reference and loop through your list of species and print out the accession IDs. Here's an example:

species.txt

cat
dog
mouse

Command:

cat species.txt | while read SPECIES; do 
datasets summary genome taxon "${SPECIES}" --reference --assmaccs; 
done

This will print the list of accessions for all the species in your list in JSON format:

{"assemblies": [{"assembly": {"assembly_accession":"GCF_000001635.27"}}],"total_count": 1}{"assemblies": [{"assembly": {"assembly_accession":"GCF_011764305.1"}}],"total_count": 1}{"assemblies": [{"assembly": {"assembly_accession":"GCF_018350175.1"}}],"total_count": 1}{"assemblies": [{"assembly": {"assembly_accession":"GCF_014441545.1"}}],"total_count": 1}

Particularly, I prefer to use jq in this case, do the list looks easier to read:

cat species.txt | while read SPECIES; do 
  datasets summary genome taxon "${SPECIES}" --reference |\
  jq -r '[.assemblies[].assembly 
  | .org.sci_name,.org.tax_id,.assembly_accession] 
  | @tsv'; 
done
Mus musculus    10090   GCF_000001635.27
Mustela putorius furo   9669    GCF_011764305.1
Felis catus     9685    GCF_018350175.1
Canis lupus familiaris  9615    GCF_014441545.1

If you want only the accession numbers, you can omit .org.sci_name,.org.tax_id, from the jq part of the command. And if you absolutely need the taxids, you can use the same command and get it from .org.tax_id and print it out as a list.

I hope this helps! :)
Please feel free to reach out if you run into any issues :)

ADD COMMENT
1
Entering edit mode

Hi, Mirian. Yes, you understood my query right. Your answer is exactly what I was looking for, so thank you so much. You even wrote out the bash script that I was planning to write. This makes my day so much easier! I cannot thank you enough.

ADD REPLY
0
Entering edit mode

Hey, Mirian. I have a follow-up question regarding a couple of the EntrezDirect tools.

TL;DR: When you provide a query item to Entrez's direct's Esearch, is there a way to search for similar terms (spelling-wise or context-wise) in NCBI's taxonomy database?

For example, esearch -db taxonomy -query "physcomitrella [orgn]" | esummary gives the following response,

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE DocumentSummarySet>
<DocumentSummarySet status="OK">
  <DbBuild>Build220217-1440.1</DbBuild>
  <DocumentSummary>
    <Id>3217</Id>
    <Status>active</Status>
    <Rank>genus</Rank>
    <Division>mosses</Division>
    <ScientificName>Physcomitrella</ScientificName>
    <TaxId>3217</TaxId>
    <AkaTaxId>0</AkaTaxId>
    <ModificationDate>2002/05/08 00:00</ModificationDate>
    <GenBankDivision>Plants and Fungi</GenBankDivision>
  </DocumentSummary>
</DocumentSummarySet>

So, it seems that the query response is case insensitive but is there a way to make it less sensitive to spelling( perhaps look for the closest spelling from the command-line) or less sensitive to the exact name? I tried to look for an answer here https://www.ncbi.nlm.nih.gov/books/NBK179288/ , but could not find anything close to what I was looking for.

The elink tool has this description "Elink looks up precomputed neighbors within a database, or finds associated records in other databases", but I cannot get it to work on db = taxonomy based queries. I was trying to see if I could look for things in the NCBI taxonomy database based closes of names to the provided query. Or perhaps submit one of the common names of a species and get back the closest possible match in the taxonomy database. Going back to the above example, "Physcomitrella" is also known as "Physcomitrium" but I can only download it under the name "Physcomitrium" from the database.

Is there a way around this?

ADD REPLY
1
Entering edit mode

Hi Rijan,

I'm not sure if this is exactly what you're looking for, but we have a taxon-suggest in our REST API service. It works with scientific and common names, as well as taxids. It's a case-insentive, substring search, so it won't find anything that's misspelled. If you search for physcomitr with the higher taxon option, here's the result:

{
  "sci_name_and_ids": [
    {
      "sci_name": "Physcomitrium patens",
      "tax_id": "3218",
      "matched_term": "physcomitrium patens"
    },
    {
      "sci_name": "Physcomitrium",
      "tax_id": "37414",
      "matched_term": "physcomitrium"
    },
    {
      "sci_name": "Physcomitrella",
      "tax_id": "3217",
      "matched_term": "physcomitrella"
    },
    {
      "sci_name": "Paenibacillus physcomitrellae",
      "tax_id": "1619311",
      "matched_term": "paenibacillus physcomitrellae"
    },
    {
      "sci_name": "Physcomitrellopsis",
      "tax_id": "1031683",
      "matched_term": "physcomitrellopsis"
    },
    {
      "sci_name": "Physcomitrellopsis africana",
      "tax_id": "2050917",
      "matched_term": "physcomitrellopsis africana"
    },
    {
      "sci_name": "Physcomitrium acutifolium",
      "tax_id": "1921131",
      "matched_term": "physcomitrium acutifolium"
    },
    {
      "sci_name": "Physcomitrium collenchymatum",
      "tax_id": "1130754",
      "matched_term": "physcomitrium collenchymatum"
    },
    {
      "sci_name": "Physcomitrium eurystomum",
      "tax_id": "1130755",
      "matched_term": "physcomitrium eurystomum"
    },
    {
      "sci_name": "Physcomitrium hookeri",
      "tax_id": "2050918",
      "matched_term": "physcomitrium hookeri"
    },
    {
      "sci_name": "Physcomitrium immersum",
      "tax_id": "1094768",
      "matched_term": "physcomitrium immersum"
    },
    {
      "sci_name": "Physcomitrium japonicum",
      "tax_id": "2050919",
      "matched_term": "physcomitrium japonicum"
    },
    {
      "sci_name": "Physcomitrium lorentzii",
      "tax_id": "130440",
      "matched_term": "physcomitrium lorentzii"
    },
    {
      "sci_name": "Physcomitrium magdalenae",
      "tax_id": "487794",
      "matched_term": "physcomitrium magdalenae"
    },
    {
      "sci_name": "Physcomitrium pyriforme",
      "tax_id": "37415",
      "matched_term": "physcomitrium pyriforme"
    },
    {
      "sci_name": "Physcomitrium readeri",
      "tax_id": "1405089",
      "matched_term": "physcomitrium readeri"
    },
    {
      "sci_name": "Physcomitrium serratum",
      "tax_id": "130306",
      "matched_term": "physcomitrium serratum"
    },
    {
      "sci_name": "Physcomitrium spathulatum",
      "tax_id": "2050923",
      "matched_term": "physcomitrium spathulatum"
    },
    {
      "sci_name": "Physcomitrium sphaericum",
      "tax_id": "1094769",
      "matched_term": "physcomitrium sphaericum"
    },
    {
      "sci_name": "Physcomitrium subsphaericum",
      "tax_id": "2050924",
      "matched_term": "physcomitrium subsphaericum"
    }
  ]
}

Let me know if that helps, or if you have any questions. :)

ADD REPLY
1
Entering edit mode
2.8 years ago
GenoMax 147k

Using EntrezDirect (output truncated for space):

% esearch -db assembly -query "mouse [orgn]" | esummary | xtract -pattern DocumentSummary -element AssemblyAccession
GCA_921997125.1
GCA_921999005.1
GCA_921998335.1
GCA_921997145.1
GCA_921998635.1
GCA_921999865.1

By using names

% esearch -db assembly -query "Gallus gallus [orgn]" | esummary | xtract -pattern DocumentSummary -element AssemblyAccession
GCA_016700215.2
GCF_016699485.2
GCF_016700215.1
GCF_000002315.6
GCA_000002315.4

If you specifically want RefSeq:

% esearch -db assembly -query "Gallus gallus [orgn]" | esummary | xtract -pattern DocumentSummary -element RefSeq
GCF_016699485.2
GCF_016700215.1
GCF_000002315.6
GCF_000002315.4
ADD COMMENT
0
Entering edit mode

Awesome thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1405 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6