Entrez Direct: epost removes duplicates
1
0
Entering edit mode
5.1 years ago
jaydu • 0

Dear community,

I am trying to fetch the TaxId from a large file with accession numbers. I found a great solution here NCBI Accession Number to Taxonomy ID.

Unfortunately there is a problem. epost seems to remove duplicates, but I would like to keep them.

My text file looks like that:

AAF56856.1
NP_001263054.1
NP_001036754.1
XP_026832094.1
XP_001973014.1
XP_015010324.1
AWA82273.1
WP_010946064.1
VDK17967.1
VDM28455.1
VDO54167.1
VDO54167.1
VDM28456.1
CDW61002.1
VDM28456.1

The epost command:

$ epost -db protein -input file.txt | esummary -db protein | xtract -pattern DocumentSummary -element Caption,TaxId
NP_001263054    7227
NP_001036754    7227
AAF56856    7227
VDO54167    42155
VDM28456    6265
VDM28455    6265
VDK17967    6269
XP_026832094    7220
XP_015010324    7220
XP_001973014    7220
AWA82273    2170591
CDW61002    36087
WP_010946064    446

The two duplicates (VDO54167.1, VDM28456.1) are just returned once. I would like to keep them separate if possible. The order of the list is changed too, but that is just a minor issue.

Thank you for any suggestions.

Cheers, JD

Entrez Direct NCBI epost • 1.3k views
ADD COMMENT
2
Entering edit mode
5.1 years ago
vkkodali_ncbi ★ 3.8k

One option is to use a bash loop as follows:

for acc in `cat file.txt`; do 
  esummary -db protein -id ${acc} \
    | xtract -pattern DocumentSummary -element Caption,TaxId ; 
done

This will retain the order of the accessions as well. However, this will be slower than using epost, more so if you have a long list of accessions to work with. Alternatively, you can first fetch the taxids using epost method first and then use unix join to add the taxid to your original list. Perhaps something like this:

join -j 1 \
  <(sort temp.txt) \
  <(epost -db protein -input temp.txt | esummary | xtract -pattern DocumentSummary -element AccessionVersion,TaxId | sort -k1,1 -t $'\t') -t $'\t'

Note, I have used AccessionVersion instead of Caption in my xtract to fetch the entire accession.version string instead of just the accession. Also, the join method will require you to sort your data as well so this will destroy the sort order of accessions but there are other methods that you can use to keep the sort order of your original data.

ADD COMMENT
0
Entering edit mode

Thank you so much for this nice solution! Just out of curiosity, is there anything known why epost does resort and remove duplicates? I did not find anything about that in the documentation.

Thanks again, JD

ADD REPLY

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6