Problem with Accession Number Changes in UniProt FASTA Sequences
1
0
Entering edit mode
23 hours ago

Hello everyone.

I have a list of 2000+ bacterial proteins with UniProt accession numbers (e.g., Q7CHB5, Q8D0R8, Q9ZC68) and need to download their FASTA sequences. I’ve used the UniProt ID Mapping Tool for download. However, after downloading, I noticed many sequences are assigned new accession numbers starting with "A0A" (e.g., A0A5P8YEA2), instead of the original ones.

I need the sequences with the original IDs for my analysis. Has anyone encountered this? Any suggestions on how to get the sequences with the original accession numbers?

Thank you so much.

FASTA Accession Uniprot IdMapping • 171 views
ADD COMMENT
2
Entering edit mode
22 hours ago
GenoMax 150k

Looks like UniProt is using secondary accession numbers starting with A* when some of the entries are changed. The two cases where this can happen is listed on this help page: https://www.uniprot.org/help/accession_numbers

Accession numbers become secondary when entries are merged or split. When two entries are merged into one, the accession numbers from both entries are stored for the new entry.

A complete list of changed accessions is available: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/sec_ac.txt

I need the sequences with the original IDs for my analysis.

You should still be getting the correct sequence based on the change above.

ADD COMMENT
1
Entering edit mode

GenoMax' answer is correct. I would just like to add that the merge and split operations which cause UniProtKB accession numbers to become secondary can happen to any type of accession number, whether it starts with A or any other letter, and whether it is 6 or 10 characters long.

The accession number format is described here https://www.uniprot.org/help/accession_numbers

ADD REPLY
0
Entering edit mode

Thank you for your comment. I have explained my problem in another comment. Would you kindly help me with it?

ADD REPLY
0
Entering edit mode

Thank you for your reply.

I need the original (primary) accession numbers because my host-pathogen PPI dataset only uses those IDs (e.g., Q7CHB5, Q8D0R8). When I download FASTA sequences, some accessions change to A0A..., making it hard to match them back to my dataset.

To keep my analysis consistent, I need a way to map old accessions to their primary IDs before downloading sequences. Since I have 2000+ proteins, manual checking isn’t an option.

Is there an efficient way to do this in bulk? Any suggestions would be greatly appreciated!

ADD REPLY
0
Entering edit mode

making it hard to match them back to my dataset.

You will need to do some work here. The file I linked above has the mapping for the A* accessions to original ID's. Use that information to rename the sequences.

$ grep Q7CHB5 sec_ac.txt 
Q7CHB5                        A0A2U2GXH7

Create a "key/value" pair file like the example above for all ID's and then use seqkit replace and that file to rename the sequences. See --> https://bioinf.shenwei.me/seqkit/usage/#replace. You need to be aware that cases where a primary ID was split into two secondary ID's you will get ID duplication in your file. You will need to make sure you change those ID's (add a _1 or something to make them unique).

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6