Question

How can I list all amino acids in a protein using PyMOL?

0

Entering edit mode

3.3 years ago

user366312 ▴ 20

What is the command in PyMOL for listing all the amino acids in a specific protein, say, 1a62.pdb?

pdb pymol • 6.0k views

ADD COMMENT • link updated 3.3 years ago by Wayne ★ 2.1k • written 3.3 years ago by user366312 ▴ 20

score 0 · Answer 1 · 2022-03-03

0

Entering edit mode

3.3 years ago

Julian ▴ 20

get_fastastr(). See https://pymolwiki.org/index.php/Get_fastastr:

PyMOL>fetch 1a62
TITLE     CRYSTAL STRUCTURE OF THE RNA-BINDING DOMAIN OF THE TRANSCRIPTIONAL TERMINATOR PROTEIN RHO
 ExecutiveLoad-Detail: Detected mmCIF
 CmdLoad: "./1a62.cif" loaded as "1a62".
PyMOL>print(cmd.get_fastastr('all'))
>1a62_A
?NLTELKNTPVSELITLGEN?GLENLAR?RKQDIIFAILKQHAKSGEDIFGDGVLEILQDGFGFLRSADS
SYLAGPDDIYVSPSQIRRFNLRTGDTISGKIRPPKEGERYFALLKVNEVNFDKPENARNK

ADD COMMENT • link 3.3 years ago by Julian ▴ 20

0

Entering edit mode

Is there any way to see the 3-letter names rather than 1-letter symbols?

ADD REPLY • link 3.3 years ago by user366312 ▴ 20

1

Entering edit mode

You can use a web tool to convert the sequence, such as One to three. Or inside PyMOL, use this code for your chain if it is chain A:

secondary_structure_list_by_aa = []
iterate (chain A and name ca), secondary_structure_list_by_aa.append((resn))
aas = " ".join(secondary_structure_list_by_aa)
print(aas)

Or if you prefer no spaces and only the first letter capitlized, like One to three yields, use this in PyMOL:

secondary_structure_list_by_aa = []
iterate (chain A and name ca), secondary_structure_list_by_aa.append((resn))
aas = "".join([aa.title() for aa in secondary_structure_list_by_aa])
print(aas)

Those are based on my answer here about iterating.
Change the chain designation to match the one of interest to you if it isn't chain A.

ADD REPLY • link 3.3 years ago by Wayne ★ 2.1k

0

Entering edit mode

I am not working on the web. I am working on a terminal and running Python scripts.

ADD REPLY • link 3.3 years ago by user366312 ▴ 20

0

Entering edit mode

I also linked in my answer an example how you can use the PyMOL iterate code inside a Python script. With that as a guide you could adapt the code I supplied getting resn for each residue to the PyMOL API.
Other alternatives are provided the pymol wiki aa page that would allow taking what is returned by get_fastastr and converting. An example based on that using a Python dictionary:

three_letter ={'V':'VAL', 'I':'ILE', 'L':'LEU', 'E':'GLU', 'Q':'GLN', \
'D':'ASP', 'N':'ASN', 'H':'HIS', 'W':'TRP', 'F':'PHE', 'Y':'TYR',    \
'R':'ARG', 'K':'LYS', 'S':'SER', 'T':'THR', 'M':'MET', 'A':'ALA',    \
'G':'GLY', 'P':'PRO', 'C':'CYS', '?':'UNK'}
from pyfaidx import Fasta
seqs = Fasta('test_seq.fa')
for sq in seqs:
    aas = "".join([three_letter[aa].title() for aa in str(sq)])
    print(aas)

That gives from the original example with 1a62_A where the FASTA was saved as test.fa:

UnkAsnLeuThrGluLeuLysAsnThrProValSerGluLeuIleThrLeuGlyGluAsnUnkGlyLeuGluAsnLeuAlaArgUnkArgLysGlnAspIleIlePheAlaIleLeuLysGlnHisAlaLysSerGlyGluAspIlePheGlyAspGlyValLeuGluIleLeuGlnAspGlyPheGlyPheLeuArgSerAlaAspSerSerTyrLeuAlaGlyProAspAspIleTyrValSerProSerGlnIleArgArgPheAsnLeuArgThrGlyAspThrIleSerGlyLysIleArgProProLysGluGlyGluArgTyrPheAlaLeuLeuLysValAsnGluValAsnPheAspLysProGluAsnAlaArgAsnLys

ADD REPLY • link 3.3 years ago by Wayne ★ 2.1k

score 0 · Answer 2 · 2022-03-03

If you do want an actual list of the amino acids represented in the structure while in PyMOL, or via accessing its API, an alternative to get_fastastr() pointed out by Julian, is to iterate on the residues using PyMOL's iterate command. This has the added ability in that the iterate command exposes additional variables you can access while iterating to get additional details about individual residues. The additional variables are listed here. For example, to iterate over each residue in chain A and get a list of the amino acid, residue number, and the type of secondary structure the residue occurs in, use:

secondary_structure_list_by_aa_resnumber = []
iterate (chain A and name ca), secondary_structure_list_by_aa_resnumber.append((oneletter,resv,ss))
print (secondary_structure_list_by_aa_resnumber)

Iterating via PyMOL's API is demonstrated here. You can get an active form of that Jupyter notebook in a temporary session served via MyBinder.org by going here, clicking the launch binder badge, and then selecting the notebook entitled 'Demo of Iterating over residue secondary structure' from the list of available notebooks after the session launches.

A couple of things to bear in mind:

There are easier ways to get such a list without opening PyMOL, getting a structure, and running a command. And if you are already in the PyMOL's graphical user interface, you can toggle on display of the sequence by choosing from the program's menubar Display > Sequence.
Be aware that you are listing only the amino acids that happen to be represented in the chain in the structural model of that protein.

PDBsum will provide this information for every structure in the Protein Data Bank and at the same time make it much clearer what is or isn't represented in both the sequence & structure context.

From the main PDBsum page of your example 1a62, you can see that although the E. coli transcription termination factor rho is 419 amino acids, only 125 amino acids at the N-terminus is represented in this structure. The 'Protein' tab will provide details of the specific residues represented. If there was gaps caused by unrepresented residues internal to the chain, the gaps would show in this view by a break in the secondary structure representation and an absence of the letters for the amino acids in that region. See the 'Protein' tab of 2ace here for an example where 485-489 are missing. See here for more about such gaps. When you open a structure in the viewer FirstGlance in Jmol, it immediately highlights regions with missing with mesh baskets as they are often easy to miss when scanning the 3D structure initially. You can click on 'Missing Residues' for a report on them. For example, the model 2ace is missing 10 residues of the protein: 1-3, 485-489, 536-537 in chain A and that includes 3 negatively charged amino acids.

While on the 'Protein' tab for a chain at PDBsum, the FASTA file for the sequence represented in structure model for that chain can be obtained by clicking on the file icon to the right of the top line of the secondary structure, just to the left of the wire diagram of the topology. The URL of that page can be used to parse out the pattern to submitting and retrieving this information computationally using wget or curl on the command line without using a browser.