Question

PDB residue serial inconsistency and PDB id&UniProt ID mismatch

0

Entering edit mode

15 months ago

Nafi • 0

**

I have analyzed pdbs of many PDB ids and come up with residue serial inconsistencies. The inconsistencies are:

**

The residue in a particular chain does not always start with index 1.
There are some gaps in residue sequences for example 50-58. There is a jump from 50 to 58. There are no residue in between.
There are sometimes residue length dont match with the given FASTA sequence.

I also have questions about PDB ids and UniProtIds.

Why there are multiple pdb ids grouped into a single UniProtID? Why are their protein sequences different?

pdb UniProtID pdbID • 3.8k views

ADD COMMENT • link updated 15 months ago by Wayne ★ 2.1k • written 15 months ago by Nafi • 0

1

Entering edit mode

15 months ago

Jiyao Wang ▴ 380

There are two sets of residue numbers. People usually use the ones provided by authors. These residue numbers are actually strings, not numbers. For example, sometimes residue numbers are 100, “100A”, “100B”, etc. The other set of residue numbers are integers starting from 1 for the first residue in the sequence, not the first residue with coordinates. Some residues in the sequence may not have coordinates as explained by others above.

ADD COMMENT • link 15 months ago by Jiyao Wang ▴ 380

0

Entering edit mode

This reminds me, there can also be alternate positions for sidechains, too. That will further complicate residue consistency relative the counts in the PDB file and in FASTA, see here on the page on Proteopedia for 'Alternate locations' that shows a snippet from a PDB file under 'Visualizing alternate locations'. You'll potentially come across some of those if you are examining a lot of PDB structure files.

ADD REPLY • link 15 months ago by Wayne ★ 2.1k

score 3 · Accepted Answer · 2024-02-28

3

Entering edit mode

15 months ago

Wayne ★ 2.1k

The residue in a particular chain does not always start with index 1.

There are some gaps in residue sequences for example 50-58. There is a jump from 50 to 58. There are no residue in between.

There are sometimes residue length dont match with the given FASTA sequence."

Welcome to structural biology!

Unless I am misunderstanding this is all common. (I'll get back to that.).
One thing to make your post more clear and better able to be answered is to point to examples. For example you bring up, "for example 50-58. There is a jump from 50 to 58." Which PDB id code are you referring to here? A specific one? A lot? Cite the PDB id records if it is just one or a few. They may all be from the same protein?

And that citing becomes more important in the next section. I suspect it is a simple explanation but we cannot respond well without examples.

Back to what is going on here in the three bullet points you posted...

Please read the Proteopedia entry 'Unusual sequence numbering'. That should definitely cover bullet points #1 and #2. In particular the 'Gaps In Sequence Numbering' section addresses number 2.
Without examples, I am having trouble getting at what you mean by bullet point #3 there. Is it PDB differing from Uniprot? Then see 'Renumbering PDB files' under 'See Also' section there. Or something else? It used to be the PDB pages were poor for relating the missing residues, and you'd need to check out specifically the information at PDBsum. (You can see more about this here with an example with images.) There was a similar thing about the FAST files, too. The ones at PDB would show the sequence of what the scientist used in construction and PDBsum would give you the FASTA sequence only represented in the resulting structure. And so I'm wondering if you are asking about that?

ADD COMMENT • link 15 months ago by Wayne ★ 2.1k

0

Entering edit mode

Thank you for answering. I will see the links you provided. You mostly answered with the right answers. I will be careful in giving examples in my future posts.

For the question#3: You can see 3CBH as an example. FASTA has 104 residues but pdb has 103. It skipped the first residue of FASTA. For the uniprotID and pdbID question, pdbIDs 5N9G,8ITY,8IUE,8IUH are grouped in UniprotID A6H8Y1. What is the difference between Uniprot proteins and rcsb pdb proteins? Where there are multiple pdbIDs grouped into single uniprotID? Why their FASTA do not match.

ADD REPLY • link 15 months ago by Nafi • 0

2

Entering edit mode

You can see 3CBH as an example. FASTA has 104 residues but pdb has 103.

Which FASTA? From where? And what you mean by pdb has 103? The structure file?
I suspect the discrpancy is from the initial methionine? If you scroll down at https://www.rcsb.org/structure/3CHB , you see the gray box all the way over next to each 'UNMODELED' entry under the initial M(Met).

For the uniprotID and pdbID question, pdbIDs 5N9G,8ITY,8IUE,8IUH are grouped in UniprotID A6H8Y1. What is the difference between Uniprot proteins and rcsb pdb proteins? Where there are multiple pdbIDs grouped into single uniprotID?

(Please use links as links to make things specific so others don't need to look up what you already did presumably.).
UniprotID A6H8Y1 is Human Transcription factor TFIIIB component B'' homolog.

For PDB entry 5n9g it is present as chains designated C,H, as you can see there by scrolling down to under the 'Macromolecules' section.

Similarly for 8ITY where it is present as chain W.

Similarly for 8IUE where it is present as chain Y (or possibly W if it is using Author designation).

Similarly for 8IUH where it is present as chain W.

Indeed for the last three, if you simply look under the 'Literature' section of an entry, such as for 8IUH, you'll see that paper contained those three structures they solved as indicated by them being listed under 'Primary Citation of Related Structures'.

Why their FASTA do not match.

That is probably because they got different experimental results for the different structures solved in different ways and combinations. Very common things to find in course of solving a structure experimentally and documenting it in the literature. It is common to solve a structure involving a complex of proteins in various conditions and combinations.

ADD REPLY • link 15 months ago by Wayne ★ 2.1k

0

Entering edit mode

Thank you for taking your time replying. It helped a lot.

ADD REPLY • link 15 months ago by Nafi • 0

score 3 · Accepted Answer · 2024-02-28

3

Entering edit mode

15 months ago

Mensur Dlakic ★ 29k

The residue numbering refers to the protein that was used for crystallization. Meaning, that was the intended sequence of a protein that was cloned and purified, and eventually used for structure determination. It often happens that residues at either end are not visible. There are at least two reasons for it: those residues are flexible (don't have a single conformation in crystals) and their diffraction pattern is not strong enough to fit anything into the map. The other reason is that parts of the protein may get degraded during purification or crystallization. Either way, those residues can't be modelled, and are omitted from the PDB file.
Same answer as above, except that flexible residues can occur anywhere in the chain.
Same answer as above, depending on where you got the FASTA sequences. PDB sequence records usually contain the expected protein sequence. If you look for the word missing in PDB files, it will indicate when some residues are absent with regard to that expectation.

The main point is that everything you observed is common and normal, even though it may have startled you if this is the first time you looked at many PDB structures at once.

ADD COMMENT • link 15 months ago by Mensur Dlakic ★ 29k

0

Entering edit mode

By the way, in some PDB structures you may also find negative residue numbers. That's usually the case when a purification tag was added to the protein, so tag residues get negative numbers and the actual protein sequence starts at 1.

ADD REPLY • link 15 months ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thank you for answering my questions. But you missed the PDBId and uniprotID question. If I elaborate it with examples: For the uniprotID and pdbID question, pdbIDs 5N9G,8ITY,8IUE,8IUH are grouped in UniprotID A6H8Y1. What is the difference between Uniprot proteins and rcsb pdb proteins? Where there are multiple pdbIDs grouped into single uniprotID? Why their FASTA do not match.

ADD REPLY • link 15 months ago by Nafi • 0

0

Entering edit mode

But you missed the PDBId and uniprotID question.

I didn't miss anything - I just can't answer all the questions. Assuming that anyone owes you answers to all the questions - and you asked many - is the surest way not to get any answers. That goes double when asking a question that is short on details.

ADD REPLY • link 15 months ago by Mensur Dlakic ★ 29k