Hi - I have a question that is related to this thread:
Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier
I assumed if I had a new question, I needed to open a new thread. I apologize if I should have posted under the original question.
I have a file with a list of accession numbers in it:
KU587513
KU587514
KU605633
I have a bash shell script I am using from that referenced thread to get the taxid from each accession using efetch:
#!/bin/bash
file="accessions_out.txt"
while read -r ACC
do
touch taxid_out.txt
curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id={ACC}&rettype=fasta&retmode=xml" |\
grep TSeq_taxid |\
cut -d '>' -f 2 |\
cut -d '>' -f 1 >> taxid_out.txt
done <"$file"
This works and gives me an output file taxid_out.txt containing:
1844429
1636871
129076
The problem I have is when the file with the accession numbers includes an accession number that has no corresponding taxid, I can't figure out how to ouptut something like "no_match". For example, this accessions file:
KU587513
KUBOGUS
KU605633
gives the following output file:
1844429
129076
I want it to give:
1844429
no_match
129076
I don't know how to code this correctly. This is what I have tried:
#!/bin/bash
file="accessions_out.txt"
while read -r ACC
do
touch taxid_out.txt
VAR = $(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id={ACC}&rettype=fasta&retmode=xml")
if [ -z "$VAR" ]
then
"no_match" >> taxid_out.txt
else
taxid="$(grep 'TSeq_taxid' $VAR)" #at this point I'm happy to just print the whole TSeq_taxid
"$taxid" >> taxid_out.txt
fi
done <"$file"
The problem with this is if the accession number has no corresponding taxid, the curl returns something with <error> but I am not sure how to grab that. My real list of accession numbers is ~10000, and there are a few "bad apples" in the list that I am having trouble finding. Getting "no_match" printed would help me find those. I am not very familiar with shell scripting, but someone who is could probably figure this out very quickly. I would sincerely appreciate any help. Thanks -
Thanks for taking the time to respond to me, and the information about the unset versus empty variable, plus the error in the spacing in my code. I think I have it figured out and will post the updated code below.
I will also look into the assembly summary files as well as an alternative to my approach. Thanks again -
This seems to get the job done, but there are probably still errors present or ways to make it shorter: