Question

How To Extract Just The Coordinate Values From A Pdb File, Using Perl Only?

2

Entering edit mode

14.1 years ago

Sakshi ▴ 20

I have a pdb id of a protein, need to get the 3d co-ordinate values of its atoms only. Can anyone help me in writing the code to grip out just the co-ordinate values.

pdb protein coordinates parsing perl • 19k views

ADD COMMENT • link updated 8.0 years ago by Biostar 20 • written 14.1 years ago by Sakshi ▴ 20

score 10 · Answer 1 · 2011-03-31

Rather than write the code for you, I thought it would be useful to explore in general terms how we go about solving this type of problem.

What you want to do is parse a file: a very common operation in bioinformatics. So common in fact, that you will often find that someone has done the work for you. I'll come back to that briefly at the end of the answer.

Here are the steps involved in developing a parser.

First, understand the structure of the file - in your case, a PDB file. This is achieved by reading documentation and by examining the file yourself in e.g. a text editor. Knowing that PDB files come from the Protein Data Bank, a web search should lead you to the appropriate documentation.

Second, identify the sub-parts of the file that you want to extract. Try to determine what makes them distinct from other parts of the file. When you look at a PDB file, you'll see that the parts containing coordinates look like this:

ATOM    834  CD2 TYR A 108       2.688  48.014   2.423  1.00 13.01           C
HETATM  968 CU   CU1 A 201      16.622  34.752   4.635  0.90 18.04          CU

Try to explain in words what you see. You might say something like "line starts with ATOM or HETATM, followed by several space-separated fields, of which columns 7, 8 and 9 are X, Y and Z coordinates."

Third, without writing any code, describe how you would process the file to extract only those parts that interest you. Be precise and detailed. For example:

open the file
read one line at a time
does the line begin with ATOM or HETATM?
if so, split the line on space
print out only the fields with X, Y and Z

Finally, you are ready to write your code. What you wrote in step 3 is a guide to the methods that you need to learn (if you don't know them already) in your chosen programming language. So you need to know how to open a file for reading, scan one line at a time, match patterns to a line, split lines into arrays and print out the correct array elements.

And that is how you write a parser.

However - there's often no need to reinvent the wheel. There is a format called "XYZ" which is essentially just the coordinates and a web search for "convert PDB to XYZ" is sure to throw up a few tools. And don't be tied to one language if a good solution is available in another - such as this Python PDB->XYZ converter.

score 3 · Answer 2 · 2011-03-31

3

Entering edit mode

14.1 years ago

Jorge Amigo 14k

I honestly didn't revise this for a very long time, but I wrote this code back in 2004 that seems to do exactly what you need. feel free to use it, and let me know if it helps you in any way.

#!/usr/bin/perl -w

#
# coordExtract.pl
#
# parser that allows editing a pdb
# file to extract the coordinates
#
# Jorge Amigo Lechuga
#

#######################
# READ THE INPUT FILE #
#######################

# input file query
print "\nEnter the input file: ";
$inputFile = <STDIN>;
chomp $inputFile;

unless (open(INPUTFILE, $inputFile)) {
    print "Cannot read from '$inputFile'.\nProgram closing.\n";
    <STDIN>;
    exit;
}

# load the file into an array
chomp(@dataArray = <INPUTFILE>);

# close the file
close(INPUTFILE);

###############
# LET'S WORK! #
###############

# parse the input file saving only backbone atoms coordinates
# format: [string "ATOM"] [number] [atom] [aa] whateva [3 decimal numbers] whateva with two dots in between
for ($line = 0; $line < scalar @dataArray; $line++) {
    if ($dataArray[$line] =~ m/ATOM\s+\d+\s+(\w+)\s+\w{3}\s+.+\s+(\S+\.\S+)\s+(\S+\.\S+)\s+(\S+\.\S+)\s+.+\..+\..+/ig) {
        if ($1 eq "N" || $1 eq "CA" || $1 eq "C") {
            $parsedData{$line} = $2."\t".$3."\t".$4;
        }
    }
}

# create the output file name
$outputFile = "coordinates_".$inputFile;

# open the output file
open (OUTFILE, ">$outputFile");

# print the data lines
foreach $line (sort {$a <=> $b} keys %parsedData) {
    print OUTFILE $parsedData{$line}."\n";
}

# close the output file
close (OUTFILE);

# end message
print "The coordinates of '$inputFile' were saved into '$outputFile'.\n";

# end the program
exit;

ADD COMMENT • link 14.1 years ago by Jorge Amigo 14k

1

Entering edit mode

Why save it all in a hash first? You might as well have the "print OUTFILE" statement right where you parse the data :-)

ADD REPLY • link 14.1 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

hahaha... this were days when data was light, and the state of the art was to load everything onto RAM to increase speed. sure dealing with the input lines as they are being read, without storing them into memory, would be more elegant. as I said, I just dumped my original code created when I only had like 1 year experience with Perl ;)

ADD REPLY • link 14.1 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanks Jorge. Thanks so much. yes, it worked for me.

ADD REPLY • link 14.1 years ago by Sakshi ▴ 20

0

Entering edit mode

this code prints the coordinates in tab delimited form which is not useful when we have to open the structure in visualizing tools..

ADD REPLY • link 14.0 years ago by Kanika ▴ 10

0

Entering edit mode

true, but as I stated this was a piece of code I had from long ago, so I just wanted to share it here in case it helped in any way. if you don't like the tabulated format, you may modify the printing hash $parsedData{$line} = $2."t".$3."t".$4; as needed.

ADD REPLY • link 14.0 years ago by Jorge Amigo 14k

0

Entering edit mode

@Jorge this is a great illustration of why reading the format documentation is so important. The coordinates in the PDB format are in fixed width fields which do not have a separator between them, thus positional parsing needs to be used for the coordinates.

ADD REPLY • link 13.2 years ago by Hamish ★ 3.3k

Ram · Answer 3 · 2011-04-11

Using Bioperl with some code lifted from this post

use strict;
use warnings;
use Bio::Structure::IO;
my $filename="2AAI.pdb";
#use Data::Dumper;
my $stream = Bio::Structure::IO->new(-file => $filename,
                                    -format => 'PDB');

while (my $struc = $stream->next_structure) {   
    for my $chain ($struc->get_chains) {
        my $chainid = $chain->id;
        # one-letter chaincode if present, 'default' otherwise
        for my $res ($struc->get_residues($chain)) {
            my $resid = $res->id;
            # format is 3-lettercode - dash - residue number, e.g. PHE-20
            my @atoms = $struc->get_atoms($res); # Atom objects, given a Residue
            map{my @xyz = $_->xyz;print join("\t",@xyz),"\n";}@atoms;  
        }
    }
}

Ram · Answer 4 · 2011-03-31

1

Entering edit mode

14.1 years ago

Jordeu ▴ 20

Check this other question: A: How To Extract Just The Coordinate Values From A Pdb File Converted To A Text Fi

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.1 years ago by Jordeu ▴ 20

0

Entering edit mode

although the answers on that thread should be read since a couple of interesting ideas are mentioned, I understood that Sakshi was interested on a "perl only" solution and not java based.

ADD REPLY • link 14.1 years ago by Jorge Amigo 14k

0

Entering edit mode

yeah.... Jordeu... thats in Java. I need the code in perl only. thanks anyways.

ADD REPLY • link 14.1 years ago by Sakshi ▴ 20

Ram · Answer 5 · 2011-04-10

1

Entering edit mode

14.1 years ago

Woa ★ 2.9k

Using Perlmol :

use strict;
use warnings;
use Chemistry::Mol;
use Chemistry::File::PDB;
my $pdb = Chemistry::Mol->read("2AAI.pdb");

map{my $serial_no=$_;my $name=$_->name;my @xyz = $_->coords->array;
    print $serial_no,"\t",$_->name,"\t",join("\t",@xyz),"\n";
}($pdb->atoms);

A very good introduction on perl based PDB parsing and a OpenGl based molecule viewer can be found from THIS issue of Perl Journal, authored by Simon Cozens

ADD COMMENT • link 13.2 years ago by Woa ★ 2.9k

0

Entering edit mode

For more details and other attributes see this tutorial.

There is a Bioperl module (Bio::Structure::IO::pdb) too, but I didn't try it so far.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Woa ★ 2.9k

0

Entering edit mode

A very good introduction on perl based PDB parsing and a OpenGl based molecule viewer can be found from the following issue of Perl Journal, authored by Simon Cozens http://japhy.perlmonk.org/personal/0408tpj.pdf

ADD REPLY • link 14.1 years ago by Woa ★ 2.9k