Question

How to extract subset of protein structure (PDB format) file based on a subsequence of protein

1

Entering edit mode

3.5 years ago

gundalav ▴ 380

I looking at a particular protein structure called 2LY4 accessible from RSCB PDB website. The corresponding fasta sequence for that structure is this:

>2LY4_1|Chain A|High mobility group protein B1|Homo sapiens (9606)
GKGDPKKPRGKMSSYAFFVQTCREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE
>2LY4_2|Chain B|Cellular tumor antigen p53|Homo sapiens (9606)
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPL

And the PDB format file can be downloaded here.

What I want to do is to extract the subset of PDB format based on the subsequence in fasta above. Namely Chain A starting from 1st residue to 30th residue

GKGDPKKPRGKMSSYAFFVQTCREEHKKKH

How can I do that in R or Python?

protein pdb python r • 1.7k views

ADD COMMENT • link updated 3.5 years ago by jgreener ▴ 390 • written 3.5 years ago by gundalav ▴ 380

score 0 · Answer 1 · 2021-06-06

0

Entering edit mode

3.5 years ago

jgreener ▴ 390

I would use Biopython. See the tutorial PDF page 192 for information on how to write out part of the structure. In this case you'll want to write an accept_residue function based on the residue number.