Question

How to extract and calculate all protein atom to ligand atom distances from pdb files in PERL

0

Entering edit mode

7.4 years ago

srdjanmasirevic2 ▴ 10

Hello, I am trying to find the way how to calculate all atom distances from pdb file ? CAn anybody help me. I have a txt files that look like this :

ATOM       1279 C    ALA    81      -1.925 -11.270   1.404
ATOM       1280 O    ALA    81      -0.279   9.355  15.557
ATOM       1281 OXT  ALA    81      -2.188  10.341  15.346  
HETATM   1282 N    THR    82      29.632   5.205   5.525
HETATM   1283 H1   THR    82      30.175   4.389   5.768
HETATM   1284 H2   THR    82      28.816   4.910   5.008

I wrote a script for calculating distances but still I am having some problem, it doesnt print any distances. I was trying to understand why but I am not sure sinse doesnt give me any error.

#!/usr/local/bin/perl 

use strict;
use warnings;

open(IN, $ARGV[0]) or die "$!"; 
my (@refer, @points);
my $part = 0;
my $dist;
while (my $line = <IN>) { 
    chomp($line);
    if ($line =~ /^HETATM/) {
        $part++;
        next;
    }
    my @array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
#    print "@array\n";
    if ($part == 0) {
        push @refer, [ @array ]; 
    } elsif ($part ==1){
        push @points, [ @array ]; 
    }
}

    foreach my $ref(@refer) {
    my ($x1, $y1, $z1) = @{$ref};
    foreach my $atom(@points) {
        my ($x, $y, $z) = @{$atom};
        my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
    print $dist;

    }

}

perl • 4.4k views

ADD COMMENT • link updated 7.4 years ago by fishgolden ▴ 520 • written 7.4 years ago by srdjanmasirevic2 ▴ 10

0

Entering edit mode

This looks like homework, what have you tried?

ADD REPLY • link 7.4 years ago by Michael 55k

0

Entering edit mode

It's some kind of personal homework, not for school but form me to make some dataset. So what I have done so far is wrote a script in perl that from the whole pdb dataset I kept only those pdb files that have ligand in its binding pocket (I included some cut off for number of heavy atoms to exclude ions and solvents, so ended up with pdb files that have ligands). I also extracted from each file protein and ligand coordinates into another txt file. So currently I am having a files that look like this: for example file. 2KRJ.txt:

           Ax     Ay    Az
ATOM(1)   -1.23  4.55  -1.78 
ATOM(2)   -1.23  4.55  -1.78 
ATOM(3)   -1.23  4.55  -1.78 
ATOM(4)   -1.23  4.55  -1.78 
...(and so on)
           Bx     By    Bz
HETATM(1)  -7.5    5.7  -1.4
HETATM(2)  -7.5    5.7  -1.4
HETATM(3)  -7.5    5.7  -1.4
HETATM(4)  -7.5    5.7  -1.4
...(and so on)

The values for these coordinates are not the same of course. So I also found the formula fo calculating distances, but I am not sure how to calculate it: I want each atom (x,y,z coordinates) from protein (that starts with ^ATOM) to be calculated with each atom from ligand (that starts with ^HETATM). If the number of each calculatation is more then 5 (I should assigned some cut off) then move the file or something like that.

This is the formula: distance = sqrt(($Ax - $Bx)**2 + ($Ay - $By)**2 + ($Az - $Bz)**2)

I am not that experienced in programming so trying to figure out how to approach to this problem. Thanks

ADD REPLY • link 7.4 years ago by srdjanmasirevic2 ▴ 10

0

Entering edit mode

So basically I am trying to calculate ATOM(1) with all HETATM(1,2,3,4) and the same for each atom of the protein. if I end up with result that is <= to five then I should keep this file. Overall from all files I am trying to keep only those in which the distance between ligand and some protein atoms is less or equal to 5 Armstrongs.

ADD REPLY • link 7.4 years ago by srdjanmasirevic2 ▴ 10

0

Entering edit mode

Ok, that is totally feasible in Perl with two nested loops or maps, but why don't you use R on the file you made? read the file using atoms <- read.delim(my.file, sep="\t", header=T, row.names=1) ; dist(atoms); that what be about 1/10 of the code required in Perl.

ADD REPLY • link 7.4 years ago by Michael 55k

score 1 · Answer 1 · 2018-01-08

1

Entering edit mode

7.4 years ago

Michael 55k

see Algorithm::DistanceMatrix package in Perl or check out the R-solution:

atoms <- read.delim(my.file, sep="\t", header=T, row.names=1) ; dist(atoms);

ADD COMMENT • link 7.4 years ago by Michael 55k

Michael · Answer 2 · 2018-01-08

0

Entering edit mode

7.4 years ago

srdjanmasirevic2 ▴ 10

Im using perl because whole project I started to do in perl and want to finish it in the same manner. I have some basics of Python and Perl but I have never used R; overall to be honest my programming sucks :P Perl seems to be fine for data proccessing but it is also a bit confusing in some aspects. I will take a look at this package, I have never used to install any of the additional packages for perl. This should provide me with some additional commands in perl?

Overall I know that this task is pretty straight forward but I am having a problem because I'm actually not sure how to assign a value for each coordinate and to give command to calculate distance between protein and ligand atoms. Maybe I can make use subruotines but Im not completely sure about it :P

ADD COMMENT • link 7.4 years ago by srdjanmasirevic2 ▴ 10

0

Entering edit mode

I have made some progress, but I cant figure it out why script doesnt work. I have wrote two loops and pushed values of coordinates into seperate arrays and ttried to print distances but it doesnt work. Can You please take a look cause I am trying to change it numerous times. This is the script :

#!/usr/local/bin/perl 

use strict;
use warnings;

open(IN, $ARGV[0]) or die "$!"; 
my (@refer, @points);
my $part = 0;
my $dist;
while (my $line = <IN>) { 
    chomp($line);
    if ($line =~ /^HETATM/) {
        $part++;
        next;
    }
    my @array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
    ### MD: use split to split by delimiter:
    my @coords = split $line;


   ### MD: not sure what this is for. Do you need to distinguish between 
   ### two types of atoms? I thought you wanted just 
   ### the distances between all atoms?
   ### why not generate a simple nested list of all atom coords first?

#    print "@array\n";
    if ($part == 0) {
        push @refer, [ @array ]; 
    } elsif ($part ==1){
        push @points, [ @array ]; 
    }
}


   ### for calculation of a symmetric distance a nested for loop is more efficient 
   ### than a simple foreach, because you can calculate the the lower triangle 


  ### something like this should do the trick, not tested, and you need to define a sub dist
  my $distMat = [];
   for (my $i=0; $i < $M; $i++) {
        $distMat->[$i] = [];
        for (my $j=$i; $j < $N; $j++) {
            $distMat->[$i][$j] = dist($coords[$i], $coords[$j]);
        }
   }



    foreach my $ref(@refer) {
    my ($x1, $y1, $z1) = @{$ref};
    foreach my $atom(@points) {
        my ($x, $y, $z) = @{$atom};
        my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
    print $dist;

    }

}

ADD REPLY • link updated 7.4 years ago by Michael 55k • written 7.4 years ago by srdjanmasirevic2 ▴ 10

0

Entering edit mode

I have put some edits into your example code, please have a look.

ADD REPLY • link 7.4 years ago by Michael 55k

0

Entering edit mode

I wanted to check the distances between all protein atoms (ATOM) and ligand atoms (HETATM). Something like ATOM1 X HETATM1; ATOM2 X HETATM1 X ATOM3 X HETATM1 and so on.. I tried to put coordinated from the protein atoms ATOM into the one array and same for ligand atoms and then to calculate the distance between each of them. I am also trying to figure out how to assign that if the result of each calculation is less more then 5 Armstrongs then delete these lines that were involved into these calculation. I want to add that because at the end want to know which atoms from ligand and proteins are closest to each other, but still have no idea how to implement it into this script

ADD REPLY • link 7.4 years ago by srdjanmasirevic2 ▴ 10

score 0 · Answer 3 · 2018-01-12

I think the main problems of the script are

if ($line =~ /^HETATM/) {
    $part++;
    next;
}

and

 elsif ($part ==1){
        push @points, [ @array ]; 
    }

If HETATM label is found, it will go to the first line of the "while" loop (because of "next") and no element will be pushed in "@points".

In addition, your sample lines are not written in a valid PDB format. Please check

https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM

Thus the simple pdb entry and code will be as follows

ATOM   1279 C    ALA    81      -1.925 -11.270   1.404
ATOM   1280 O    ALA    81      -0.279   9.355  15.557
ATOM   1281 OXT  ALA    81      -2.188  10.341  15.346
HETATM 1282 N    THR    82      29.632   5.205   5.525
HETATM 1283 H1   THR    82      30.175   4.389   5.768
HETATM 1284 H2   THR    82      28.816   4.910   5.008

====

#!/usr/local/bin/perl 

use strict;
use warnings;
open(IN, $ARGV[0]) or die "$!"; 
my (@refer, @points);
my $dist;
while (my $line = <IN>) { 
    chomp($line);
    my $hetflag = 0;
    if ($line =~ /^HETATM/) {
        $hetflag = 1;
    }elsif($line !~ /^ATOM/){
        next;
    }

    my @array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
    if ($hetflag == 0) {
        push @refer, [ @array ]; 
    } elsif ($hetflag ==1){
        push @points, [ @array ]; 
    }

}



foreach my $ref(@refer) {
    my ($x1, $y1, $z1) = @{$ref};
    foreach my $atom(@points) {
        my ($x, $y, $z) = @{$atom};
        my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
         print $dist."\n";
    }
}

By the way, finding ligand coordinates from PDB format file is a bit difficult. Because HETATM label is used for unusual amino acid (modified residues) like SELENOMETHIONINE

http://www.rcsb.org/pdb/explore/explore.do?structureId=2a0u

Usually, coordinates of ligands come after "TER" line(, I think).

If you use PDBx/mmCIF format, it is easy to distinguish ligands but the format is so difficult to parse.

BioPerl may handle the format. But I don't recommend Perl because it's an old language.