Question

Entrez Direct: epost removes duplicates

0

Entering edit mode

5.1 years ago

jaydu • 0

Dear community,

I am trying to fetch the TaxId from a large file with accession numbers. I found a great solution here NCBI Accession Number to Taxonomy ID.

Unfortunately there is a problem. epost seems to remove duplicates, but I would like to keep them.

My text file looks like that:

AAF56856.1
NP_001263054.1
NP_001036754.1
XP_026832094.1
XP_001973014.1
XP_015010324.1
AWA82273.1
WP_010946064.1
VDK17967.1
VDM28455.1
VDO54167.1
VDO54167.1
VDM28456.1
CDW61002.1
VDM28456.1

The epost command:

$ epost -db protein -input file.txt | esummary -db protein | xtract -pattern DocumentSummary -element Caption,TaxId
NP_001263054    7227
NP_001036754    7227
AAF56856    7227
VDO54167    42155
VDM28456    6265
VDM28455    6265
VDK17967    6269
XP_026832094    7220
XP_015010324    7220
XP_001973014    7220
AWA82273    2170591
CDW61002    36087
WP_010946064    446

The two duplicates (VDO54167.1, VDM28456.1) are just returned once. I would like to keep them separate if possible. The order of the list is changed too, but that is just a minor issue.

Thank you for any suggestions.

Cheers, JD

Entrez Direct NCBI epost • 1.3k views

ADD COMMENT • link updated 5.1 years ago by vkkodali_ncbi ★ 3.8k • written 5.1 years ago by jaydu • 0

score 2 · Accepted Answer · 2019-11-12

One option is to use a bash loop as follows:

for acc in `cat file.txt`; do 
  esummary -db protein -id ${acc} \
    | xtract -pattern DocumentSummary -element Caption,TaxId ; 
done

This will retain the order of the accessions as well. However, this will be slower than using epost, more so if you have a long list of accessions to work with. Alternatively, you can first fetch the taxids using epost method first and then use unix join to add the taxid to your original list. Perhaps something like this:

join -j 1 \
  <(sort temp.txt) \
  <(epost -db protein -input temp.txt | esummary | xtract -pattern DocumentSummary -element AccessionVersion,TaxId | sort -k1,1 -t $'\t') -t $'\t'

Note, I have used AccessionVersion instead of Caption in my xtract to fetch the entire accession.version string instead of just the accession. Also, the join method will require you to sort your data as well so this will destroy the sort order of accessions but there are other methods that you can use to keep the sort order of your original data.