The question is whether, given a (long) list of Genbank identifiers, is possible to get the ncbi taxonomy identifier for each one. I know it may seem very easy, but I have not found any web service which makes this, and I wouldn't like to do this manually.
Man, the script kick asses. I only was looking for some orientation, but the script is _exactly_ what I want. Thanks a lot!
I modified your script slightly to retrieve these data for some accession IDs that I failed to recover when running blastdbcmd -entry_batch against my local blast database install, even though I could get information on them when using Entrez (my local data was the latest release from the FTP). I ran the following, where failed_accessions.txt was a list of 662 accession IDs, one per line:
The get_failed_acc.sh script was a modification of yours as follows:
I added the sleep in so I didn't hammer the server too much!
NCBI is now using https:// instead of http://
Make sure you include the "s" in your link!
Hi thanks for the script! I am new to linux and I got a question. I wanna use my own accession numbers to replace "A00002 X53307 BB145968 CAA42669 V00181 AH002406 HQ844023". But my accession numbers are in a .CSV file. There are several hundreds of them. How can I copy them to the for loop? Or is it possible to read the .csv file directly? Thanks a lot!
Just do
tr ',' ' ' < filename.csv | xargs ./get_failed_acc.sh
. Thetr
command will replace all comma's by spaces an convert them to arguments.This may work only if the url goes https not http, as ncbi has turned https only in 2017.
I've fixed this, thanks.
How to use this script for multiple file as input, I want to extract the locus_tag by using start and stop genomic position
For example
ACC=accession.txt
START=start.txt
STOP=stop.txt
curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&seq_start=$(START)&$(STOP)&rettype=gb" |\ grep locus_tag |\
I could only manage with -
for ACC in $(cat accession.txt)
, and could not make a nested for loop to take other variables from input files.If the loop works, i am getting duplicate/triplicates hits like, without accession number. So, I could not use the retrieved data :(
curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AE006468.2&seq_start=8718&seq_stop=9319&rettype=gb" | grep "/locus_tag "
What to do ?