Hello everyone.
I have a list of 2000+ bacterial proteins with UniProt accession numbers (e.g., Q7CHB5, Q8D0R8, Q9ZC68) and need to download their FASTA sequences. I’ve used the UniProt ID Mapping Tool for download. However, after downloading, I noticed many sequences are assigned new accession numbers starting with "A0A" (e.g., A0A5P8YEA2), instead of the original ones.
I need the sequences with the original IDs for my analysis. Has anyone encountered this? Any suggestions on how to get the sequences with the original accession numbers?
Thank you so much.
GenoMax' answer is correct. I would just like to add that the merge and split operations which cause UniProtKB accession numbers to become secondary can happen to any type of accession number, whether it starts with A or any other letter, and whether it is 6 or 10 characters long.
The accession number format is described here https://www.uniprot.org/help/accession_numbers
Thank you for your comment. I have explained my problem in another comment. Would you kindly help me with it?
Thank you for your reply.
I need the original (primary) accession numbers because my host-pathogen PPI dataset only uses those IDs (e.g., Q7CHB5, Q8D0R8). When I download FASTA sequences, some accessions change to A0A..., making it hard to match them back to my dataset.
To keep my analysis consistent, I need a way to map old accessions to their primary IDs before downloading sequences. Since I have 2000+ proteins, manual checking isn’t an option.
Is there an efficient way to do this in bulk? Any suggestions would be greatly appreciated!
You will need to do some work here. The file I linked above has the mapping for the
A*
accessions to original ID's. Use that information to rename the sequences.Create a "key/value" pair file like the example above for all ID's and then use
seqkit replace
and that file to rename the sequences. See --> https://bioinf.shenwei.me/seqkit/usage/#replace. You need to be aware that cases where a primary ID was split into two secondary ID's you will get ID duplication in your file. You will need to make sure you change those ID's (add a_1
or something to make them unique).