Depending on what kind of accessions you have (GenBank, RefSeq or Swiss-Prot), you should be able to tell which database the accession is coming from. Take a look at this page. For example, all RefSeq proteins will have accessions in the format [NAXWY]P_\d+\.\d+
. Once you know that, you can filter your input list and use the same EntrezDirect command.
Once you have a filtered list, you can use EntrezDirect as follows:
for acc in `cat accs_list.txt` ; do
echo -ne "${acc}\t" ;
elink -db protein -target taxonomy -id ${acc} \
| efetch -format native -mode xml \
| xtract -pattern TaxaSet -sep ',' -element ScientificName ;
done
## example output
NP_002817 Homo sapiens,cellular organisms,Eukaryota,Opisthokonta,Metazoa,Eumetazoa,...
WP_003131952 Lactococcus,cellular organisms,Bacteria,Terrabacteria group,Firmicutes,...
This will generate a two column table with accession and the lineage in tab-delimited format. However, if you don't particularly care about the accession to lineage map and just want to see a list of lineages for your entire set of accessions then you can use epost
as follows:
epost -db protein -format acc -input accs_list.txt \
| elink -db protein -target taxonomy \
| efetch -format native -mode xml \
| xtract -pattern TaxaSet -sep ',' -element ScientificName
## example output
Homo sapiens,cellular organisms,Eukaryota,Opisthokonta,Metazoa,Eumetazoa,...
Lactococcus,cellular organisms,Bacteria,Terrabacteria group,Firmicutes,...
Would it work if you add the possible dbs to a file (db.list) each db in one line and use something like this?
This command is not efficient because if searches all the databases in the list, so I think you can write a script with
if and else/elseif
conditions to check each db and only continue if not found (for example continue as long as found==0).