I need obtain taxonomy information(taxon id) of NCBI NR library by protein accession number. I find two useful files prot.accession2taxid.gz and pdb.accession2taxid.gz in https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. However, some accession numbers still cannot fetch taxonomy information. Those accession numbers mainly are consist of the following categories:
The NCBI show "Record removed", like "AYN07615.1". Why did the records removed appear in the NR library?
Some accession numbers from unknown resources. For example, pir||S69889 and prf||1403304A.
Some accession numbers from PDB, but those cannot be found in pdb.accession2taxid.gz. For example 6F1U_FF
how can I obtain taxonomy information for those special accession numbers?
which version of blast/nr are you using ( local copy?) ? Or are you simply looking for the list of all taxonomy for each protein?
I download the NR library from https://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz.
And it can be said that I am simply looking for the list of all taxonomy for each protein. But I cannot obtain all taxonomy for each protein from the headers in NR fasta file because of some non-standard naming and possible duplicate taxa name (a taxon name can map multiple taxa ids) .
the 'removed' record might be because the version you can download is always a little bit behind compared to the online version (== normally you can check when it has been removed, and I would not be surprised if dates after the time you downloaded nr from NCBI ).
PIR and PRF are not unknown resources, lesser known OK. Normally they both (or at least PIR) is nowadays included in UNIprot
for the PDB one you have to search for 6F1U I think (the _FF denotes the chain )