I have a pdb id of a protein, need to get the 3d co-ordinate values of its atoms only. Can anyone help me in writing the code to grip out just the co-ordinate values.
I have a pdb id of a protein, need to get the 3d co-ordinate values of its atoms only. Can anyone help me in writing the code to grip out just the co-ordinate values.
Rather than write the code for you, I thought it would be useful to explore in general terms how we go about solving this type of problem.
What you want to do is parse a file: a very common operation in bioinformatics. So common in fact, that you will often find that someone has done the work for you. I'll come back to that briefly at the end of the answer.
Here are the steps involved in developing a parser.
First, understand the structure of the file - in your case, a PDB file. This is achieved by reading documentation and by examining the file yourself in e.g. a text editor. Knowing that PDB files come from the Protein Data Bank, a web search should lead you to the appropriate documentation.
Second, identify the sub-parts of the file that you want to extract. Try to determine what makes them distinct from other parts of the file. When you look at a PDB file, you'll see that the parts containing coordinates look like this:
ATOM 834 CD2 TYR A 108 2.688 48.014 2.423 1.00 13.01 C
HETATM 968 CU CU1 A 201 16.622 34.752 4.635 0.90 18.04 CU
Try to explain in words what you see. You might say something like "line starts with ATOM or HETATM, followed by several space-separated fields, of which columns 7, 8 and 9 are X, Y and Z coordinates."
Third, without writing any code, describe how you would process the file to extract only those parts that interest you. Be precise and detailed. For example:
Finally, you are ready to write your code. What you wrote in step 3 is a guide to the methods that you need to learn (if you don't know them already) in your chosen programming language. So you need to know how to open a file for reading, scan one line at a time, match patterns to a line, split lines into arrays and print out the correct array elements.
And that is how you write a parser.
However - there's often no need to reinvent the wheel. There is a format called "XYZ" which is essentially just the coordinates and a web search for "convert PDB to XYZ" is sure to throw up a few tools. And don't be tied to one language if a good solution is available in another - such as this Python PDB->XYZ converter.
I honestly didn't revise this for a very long time, but I wrote this code back in 2004 that seems to do exactly what you need. feel free to use it, and let me know if it helps you in any way.
#!/usr/bin/perl -w
#
# coordExtract.pl
#
# parser that allows editing a pdb
# file to extract the coordinates
#
# Jorge Amigo Lechuga
#
#######################
# READ THE INPUT FILE #
#######################
# input file query
print "\nEnter the input file: ";
$inputFile = <STDIN>;
chomp $inputFile;
unless (open(INPUTFILE, $inputFile)) {
print "Cannot read from '$inputFile'.\nProgram closing.\n";
<STDIN>;
exit;
}
# load the file into an array
chomp(@dataArray = <INPUTFILE>);
# close the file
close(INPUTFILE);
###############
# LET'S WORK! #
###############
# parse the input file saving only backbone atoms coordinates
# format: [string "ATOM"] [number] [atom] [aa] whateva [3 decimal numbers] whateva with two dots in between
for ($line = 0; $line < scalar @dataArray; $line++) {
if ($dataArray[$line] =~ m/ATOM\s+\d+\s+(\w+)\s+\w{3}\s+.+\s+(\S+\.\S+)\s+(\S+\.\S+)\s+(\S+\.\S+)\s+.+\..+\..+/ig) {
if ($1 eq "N" || $1 eq "CA" || $1 eq "C") {
$parsedData{$line} = $2."\t".$3."\t".$4;
}
}
}
# create the output file name
$outputFile = "coordinates_".$inputFile;
# open the output file
open (OUTFILE, ">$outputFile");
# print the data lines
foreach $line (sort {$a <=> $b} keys %parsedData) {
print OUTFILE $parsedData{$line}."\n";
}
# close the output file
close (OUTFILE);
# end message
print "The coordinates of '$inputFile' were saved into '$outputFile'.\n";
# end the program
exit;
hahaha... this were days when data was light, and the state of the art was to load everything onto RAM to increase speed. sure dealing with the input lines as they are being read, without storing them into memory, would be more elegant. as I said, I just dumped my original code created when I only had like 1 year experience with Perl ;)
Using Bioperl with some code lifted from this post
use strict;
use warnings;
use Bio::Structure::IO;
my $filename="2AAI.pdb";
#use Data::Dumper;
my $stream = Bio::Structure::IO->new(-file => $filename,
-format => 'PDB');
while (my $struc = $stream->next_structure) {
for my $chain ($struc->get_chains) {
my $chainid = $chain->id;
# one-letter chaincode if present, 'default' otherwise
for my $res ($struc->get_residues($chain)) {
my $resid = $res->id;
# format is 3-lettercode - dash - residue number, e.g. PHE-20
my @atoms = $struc->get_atoms($res); # Atom objects, given a Residue
map{my @xyz = $_->xyz;print join("\t",@xyz),"\n";}@atoms;
}
}
}
Check this other question: A: How To Extract Just The Coordinate Values From A Pdb File Converted To A Text Fi
Using Perlmol :
use strict;
use warnings;
use Chemistry::Mol;
use Chemistry::File::PDB;
my $pdb = Chemistry::Mol->read("2AAI.pdb");
map{my $serial_no=$_;my $name=$_->name;my @xyz = $_->coords->array;
print $serial_no,"\t",$_->name,"\t",join("\t",@xyz),"\n";
}($pdb->atoms);
A very good introduction on perl based PDB parsing and a OpenGl based molecule viewer can be found from THIS issue of Perl Journal, authored by Simon Cozens
For more details and other attributes see this tutorial.
There is a Bioperl module (Bio::Structure::IO::pdb
) too, but I didn't try it so far.
A very good introduction on perl based PDB parsing and a OpenGl based molecule viewer can be found from the following issue of Perl Journal, authored by Simon Cozens http://japhy.perlmonk.org/personal/0408tpj.pdf
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This is one of those answers that makes you wish you could up-vote more than once.
Thank you Sir. You explained in such a nice way. Thank you so much.
Wicked! I really prefer such conceptual answers rather than the usual code copy & pasting. The learning effect is much greater with this kind of response, even though it might not be the quickest route to take.
While the pseudocode will work for most cases the split on space bit won't give the correct results for all PDB entries. This is due to the fields being position based rather than delimited. The appropriate positions to find each field in the line can be obtained from the PDB format documentation.