Getting taxid from accession number
1
1
Entering edit mode
6.7 years ago
auninsaw ▴ 10

Hi - I have a question that is related to this thread:

Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier

I assumed if I had a new question, I needed to open a new thread. I apologize if I should have posted under the original question.

I have a file with a list of accession numbers in it:

KU587513
KU587514
KU605633

I have a bash shell script I am using from that referenced thread to get the taxid from each accession using efetch:

#!/bin/bash
file="accessions_out.txt"
while read -r ACC
do
     touch taxid_out.txt
     curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id={ACC}&rettype=fasta&retmode=xml" |\
     grep TSeq_taxid |\
     cut -d '>' -f 2 |\
     cut -d '>' -f 1 >> taxid_out.txt
done <"$file"

This works and gives me an output file taxid_out.txt containing:

1844429
1636871
129076

The problem I have is when the file with the accession numbers includes an accession number that has no corresponding taxid, I can't figure out how to ouptut something like "no_match". For example, this accessions file:

KU587513
KUBOGUS
KU605633

gives the following output file:

1844429
129076

I want it to give:

1844429
no_match
129076

I don't know how to code this correctly. This is what I have tried:

 #!/bin/bash
file="accessions_out.txt"
while read -r ACC
do
     touch taxid_out.txt
     VAR = $(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id={ACC}&rettype=fasta&retmode=xml")
     if [ -z "$VAR" ]
     then
        "no_match"  >> taxid_out.txt 
     else
         taxid="$(grep 'TSeq_taxid' $VAR)" #at this point I'm happy to just print the whole TSeq_taxid
         "$taxid" >> taxid_out.txt
     fi
done <"$file"

The problem with this is if the accession number has no corresponding taxid, the curl returns something with <error> but I am not sure how to grab that. My real list of accession numbers is ~10000, and there are a few "bad apples" in the list that I am having trouble finding. Getting "no_match" printed would help me find those. I am not very familiar with shell scripting, but someone who is could probably figure this out very quickly. I would sincerely appreciate any help. Thanks -

taxid taxonomy genbank eutils efetch • 6.0k views
ADD COMMENT
1
Entering edit mode
6.7 years ago
Joe 21k

Without testing this, my first guess would be that your VAR test is failing. Also, you shouldn't have spaces around the VAR =.

You'll want to try something like this instead, to tell the differece:

if [ -z ${VAR+x} ]; then echo "VAR is unset"; else echo "var is set to '$VAR'"; fi

There is a difference between a variable that is unset, versus empty.

See:

https://stackoverflow.com/questions/3601515/how-to-check-if-a-variable-is-set-in-bash

As an alternative approach, you might be interested in parsing the tabular assembly_summary files which contain the accessions and the TaxIDs:

e.g.

$ wget --no-check-certificate https://ftp.ncbi.nih.gov/genomes/genbank/assembly_summary_genbank.txt

or

$ wget --no-check-certificate https://ftp.ncbi.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Provides the following (see columns 1 and 6-7)

$ head assembly_summary_genbank.txt
#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material
GCA_000001215.4 PRJNA13812  SAMN02803731        reference genome    7227    7227    Drosophila melanogaster         latest  Chromosome  Major   Full    2014/08/01  Release 6 plus ISO1 MT  The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics   GCF_000001215.4 identicalftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT
GCA_000001405.27    PRJNA31257          reference genome    9606    9606    Homo sapiens            latest  Chromosome  Patch   Full    2017/12/21  GRCh38.p12  Genome Reference Consortium GCF_000001405.38    different   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.27_GRCh38.p12
GCA_000001515.5 PRJNA13184  SAMN02981217    AACZ00000000.4  na  9598    9598    Pan troglodytes     Yerkes chimp pedigree #C0471 (Clint)    latest  Chromosome  Major   Full    2016/05/03  Pan_tro 3.0 Chimpanzee Sequencing and Analysis Consortium   GCF_000001515.7 different   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/515/GCA_000001515.5_Pan_tro_3.0
GCA_000001545.3 PRJNA20869  SAMN02981238    ABGA00000000.1  na  9601    9601    Pongo abelii        ISIS 71 latest  Chromosome  Major   Full    2008/11/13  P_pygmaeus_2.0.2    Orangutan Genome Sequencing Consortium  na  na  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/545/GCA_000001545.3_P_pygmaeus_2.0.2
GCA_000001635.8 PRJNA20689          reference genome    10090   10090   Mus musculus            latest  Chromosome  Patch   Full    2017/09/15  GRCm38.p6   Genome Reference Consortium GCF_000001635.26    identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.8_GRCm38.p6
ADD COMMENT
0
Entering edit mode

Thanks for taking the time to respond to me, and the information about the unset versus empty variable, plus the error in the spacing in my code. I think I have it figured out and will post the updated code below.

I will also look into the assembly summary files as well as an alternative to my approach. Thanks again -

ADD REPLY
0
Entering edit mode

This seems to get the job done, but there are probably still errors present or ways to make it shorter:

 #!/bin/bash
file="accessions_out.txt"
while read -r ACC
do
     touch taxid_out.txt
     VAR=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id {ACC}&rettype=fasta&retmode=xml")
     if  echo "$VAR" | grep -q "<ERROR>"; then
          echo "Not present" >> taxid_out.txt         
    elif echo "$VAR" | grep -q "TSeq_taxid"; then
        taxid=$(echo "$VAR" | grep TSeq_taxid)
        echo "$taxid" |\
        cut -d '>' -f 2 |\
        cut -d '<' -f 1 >> taxid_out.txt
 fi
done <"$file"
ADD REPLY

Login before adding your answer.

Traffic: 2047 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6