How To Extract Just The Coordinate Values From A Pdb File, Using Perl Only?
5
2
Entering edit mode
13.8 years ago
Sakshi ▴ 20

I have a pdb id of a protein, need to get the 3d co-ordinate values of its atoms only. Can anyone help me in writing the code to grip out just the co-ordinate values.

pdb protein coordinates parsing perl • 19k views
ADD COMMENT
10
Entering edit mode
13.8 years ago
Neilfws 49k

Rather than write the code for you, I thought it would be useful to explore in general terms how we go about solving this type of problem.

What you want to do is parse a file: a very common operation in bioinformatics. So common in fact, that you will often find that someone has done the work for you. I'll come back to that briefly at the end of the answer.

Here are the steps involved in developing a parser.

First, understand the structure of the file - in your case, a PDB file. This is achieved by reading documentation and by examining the file yourself in e.g. a text editor. Knowing that PDB files come from the Protein Data Bank, a web search should lead you to the appropriate documentation.

Second, identify the sub-parts of the file that you want to extract. Try to determine what makes them distinct from other parts of the file. When you look at a PDB file, you'll see that the parts containing coordinates look like this:

ATOM    834  CD2 TYR A 108       2.688  48.014   2.423  1.00 13.01           C
HETATM  968 CU   CU1 A 201      16.622  34.752   4.635  0.90 18.04          CU

Try to explain in words what you see. You might say something like "line starts with ATOM or HETATM, followed by several space-separated fields, of which columns 7, 8 and 9 are X, Y and Z coordinates."

Third, without writing any code, describe how you would process the file to extract only those parts that interest you. Be precise and detailed. For example:

  • open the file
  • read one line at a time
  • does the line begin with ATOM or HETATM?
  • if so, split the line on space
  • print out only the fields with X, Y and Z

Finally, you are ready to write your code. What you wrote in step 3 is a guide to the methods that you need to learn (if you don't know them already) in your chosen programming language. So you need to know how to open a file for reading, scan one line at a time, match patterns to a line, split lines into arrays and print out the correct array elements.

And that is how you write a parser.

However - there's often no need to reinvent the wheel. There is a format called "XYZ" which is essentially just the coordinates and a web search for "convert PDB to XYZ" is sure to throw up a few tools. And don't be tied to one language if a good solution is available in another - such as this Python PDB->XYZ converter.

ADD COMMENT
1
Entering edit mode

This is one of those answers that makes you wish you could up-vote more than once.

ADD REPLY
0
Entering edit mode

Thank you Sir. You explained in such a nice way. Thank you so much.

ADD REPLY
0
Entering edit mode

Wicked! I really prefer such conceptual answers rather than the usual code copy & pasting. The learning effect is much greater with this kind of response, even though it might not be the quickest route to take.

ADD REPLY
0
Entering edit mode

While the pseudocode will work for most cases the split on space bit won't give the correct results for all PDB entries. This is due to the fields being position based rather than delimited. The appropriate positions to find each field in the line can be obtained from the PDB format documentation.

ADD REPLY
3
Entering edit mode
13.8 years ago

I honestly didn't revise this for a very long time, but I wrote this code back in 2004 that seems to do exactly what you need. feel free to use it, and let me know if it helps you in any way.

#!/usr/bin/perl -w

#
# coordExtract.pl
#
# parser that allows editing a pdb
# file to extract the coordinates
#
# Jorge Amigo Lechuga
#

#######################
# READ THE INPUT FILE #
#######################

# input file query
print "\nEnter the input file: ";
$inputFile = <STDIN>;
chomp $inputFile;

unless (open(INPUTFILE, $inputFile)) {
    print "Cannot read from '$inputFile'.\nProgram closing.\n";
    <STDIN>;
    exit;
}

# load the file into an array
chomp(@dataArray = <INPUTFILE>);

# close the file
close(INPUTFILE);

###############
# LET'S WORK! #
###############

# parse the input file saving only backbone atoms coordinates
# format: [string "ATOM"] [number] [atom] [aa] whateva [3 decimal numbers] whateva with two dots in between
for ($line = 0; $line < scalar @dataArray; $line++) {
    if ($dataArray[$line] =~ m/ATOM\s+\d+\s+(\w+)\s+\w{3}\s+.+\s+(\S+\.\S+)\s+(\S+\.\S+)\s+(\S+\.\S+)\s+.+\..+\..+/ig) {
        if ($1 eq "N" || $1 eq "CA" || $1 eq "C") {
            $parsedData{$line} = $2."\t".$3."\t".$4;
        }
    }
}

# create the output file name
$outputFile = "coordinates_".$inputFile;

# open the output file
open (OUTFILE, ">$outputFile");

# print the data lines
foreach $line (sort {$a <=> $b} keys %parsedData) {
    print OUTFILE $parsedData{$line}."\n";
}

# close the output file
close (OUTFILE);

# end message
print "The coordinates of '$inputFile' were saved into '$outputFile'.\n";

# end the program
exit;
ADD COMMENT
1
Entering edit mode

Why save it all in a hash first? You might as well have the "print OUTFILE" statement right where you parse the data :-)

ADD REPLY
0
Entering edit mode

hahaha... this were days when data was light, and the state of the art was to load everything onto RAM to increase speed. sure dealing with the input lines as they are being read, without storing them into memory, would be more elegant. as I said, I just dumped my original code created when I only had like 1 year experience with Perl ;)

ADD REPLY
0
Entering edit mode

Thanks Jorge. Thanks so much. yes, it worked for me.

ADD REPLY
0
Entering edit mode

this code prints the coordinates in tab delimited form which is not useful when we have to open the structure in visualizing tools..

ADD REPLY
0
Entering edit mode

true, but as I stated this was a piece of code I had from long ago, so I just wanted to share it here in case it helped in any way. if you don't like the tabulated format, you may modify the printing hash $parsedData{$line} = $2."t".$3."t".$4; as needed.

ADD REPLY
0
Entering edit mode

@Jorge this is a great illustration of why reading the format documentation is so important. The coordinates in the PDB format are in fixed width fields which do not have a separator between them, thus positional parsing needs to be used for the coordinates.

ADD REPLY
2
Entering edit mode
13.8 years ago
Woa ★ 2.9k

Using Bioperl with some code lifted from this post

use strict;
use warnings;
use Bio::Structure::IO;
my $filename="2AAI.pdb";
#use Data::Dumper;
my $stream = Bio::Structure::IO->new(-file => $filename,
                                    -format => 'PDB');

while (my $struc = $stream->next_structure) {   
    for my $chain ($struc->get_chains) {
        my $chainid = $chain->id;
        # one-letter chaincode if present, 'default' otherwise
        for my $res ($struc->get_residues($chain)) {
            my $resid = $res->id;
            # format is 3-lettercode - dash - residue number, e.g. PHE-20
            my @atoms = $struc->get_atoms($res); # Atom objects, given a Residue
            map{my @xyz = $_->xyz;print join("\t",@xyz),"\n";}@atoms;  
        }
    }
}
ADD COMMENT
1
Entering edit mode
ADD COMMENT
0
Entering edit mode

although the answers on that thread should be read since a couple of interesting ideas are mentioned, I understood that Sakshi was interested on a "perl only" solution and not java based.

ADD REPLY
0
Entering edit mode

yeah.... Jordeu... thats in Java. I need the code in perl only. thanks anyways.

ADD REPLY
1
Entering edit mode
13.8 years ago
Woa ★ 2.9k

Using Perlmol :

use strict;
use warnings;
use Chemistry::Mol;
use Chemistry::File::PDB;
my $pdb = Chemistry::Mol->read("2AAI.pdb");

map{my $serial_no=$_;my $name=$_->name;my @xyz = $_->coords->array;
    print $serial_no,"\t",$_->name,"\t",join("\t",@xyz),"\n";
}($pdb->atoms);

A very good introduction on perl based PDB parsing and a OpenGl based molecule viewer can be found from THIS issue of Perl Journal, authored by Simon Cozens

ADD COMMENT
0
Entering edit mode

For more details and other attributes see this tutorial.

There is a Bioperl module (Bio::Structure::IO::pdb) too, but I didn't try it so far.

ADD REPLY
0
Entering edit mode

A very good introduction on perl based PDB parsing and a OpenGl based molecule viewer can be found from the following issue of Perl Journal, authored by Simon Cozens http://japhy.perlmonk.org/personal/0408tpj.pdf

ADD REPLY

Login before adding your answer.

Traffic: 2789 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6