Question

Problem with Accession Number Changes in UniProt FASTA Sequences

0

Entering edit mode

4 months ago

bioinfo_enthusiast • 0

Hello everyone.

I have a list of 2000+ bacterial proteins with UniProt accession numbers (e.g., Q7CHB5, Q8D0R8, Q9ZC68) and need to download their FASTA sequences. I’ve used the UniProt ID Mapping Tool for download. However, after downloading, I noticed many sequences are assigned new accession numbers starting with "A0A" (e.g., A0A5P8YEA2), instead of the original ones.

I need the sequences with the original IDs for my analysis. Has anyone encountered this? Any suggestions on how to get the sequences with the original accession numbers?

Thank you so much.

FASTA Accession Uniprot IdMapping • 937 views

ADD COMMENT • link 4 months ago by bioinfo_enthusiast • 0

score 2 · Answer 1 · 2025-03-25

2

Entering edit mode

4 months ago

GenoMax 153k

Looks like UniProt is using secondary accession numbers starting with A* when some of the entries are changed. The two cases where this can happen is listed on this help page: https://www.uniprot.org/help/accession_numbers

Accession numbers become secondary when entries are merged or split. When two entries are merged into one, the accession numbers from both entries are stored for the new entry.

A complete list of changed accessions is available: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/sec_ac.txt

I need the sequences with the original IDs for my analysis.

You should still be getting the correct sequence based on the change above.

ADD COMMENT • link 4 months ago by GenoMax 153k

1

Entering edit mode

GenoMax' answer is correct. I would just like to add that the merge and split operations which cause UniProtKB accession numbers to become secondary can happen to any type of accession number, whether it starts with A or any other letter, and whether it is 6 or 10 characters long.

The accession number format is described here https://www.uniprot.org/help/accession_numbers

ADD REPLY • link 4 months ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

Thank you for your comment. I have explained my problem in another comment. Would you kindly help me with it?

ADD REPLY • link 4 months ago by bioinfo_enthusiast • 0

0

Entering edit mode

Thank you for your reply.

I need the original (primary) accession numbers because my host-pathogen PPI dataset only uses those IDs (e.g., Q7CHB5, Q8D0R8). When I download FASTA sequences, some accessions change to A0A..., making it hard to match them back to my dataset.

To keep my analysis consistent, I need a way to map old accessions to their primary IDs before downloading sequences. Since I have 2000+ proteins, manual checking isn’t an option.

Is there an efficient way to do this in bulk? Any suggestions would be greatly appreciated!

ADD REPLY • link 4 months ago by bioinfo_enthusiast • 0

1

Entering edit mode

making it hard to match them back to my dataset.

You will need to do some work here. The file I linked above has the mapping for the A* accessions to original ID's. Use that information to rename the sequences.

$ grep Q7CHB5 sec_ac.txt 
Q7CHB5                        A0A2U2GXH7

Create a "key/value" pair file like the example above for all ID's and then use seqkit replace and that file to rename the sequences. See --> https://bioinf.shenwei.me/seqkit/usage/#replace. You need to be aware that cases where a primary ID was split into two secondary ID's you will get ID duplication in your file. You will need to make sure you change those ID's (add a _1 or something to make them unique).