Parse Many/Multiple .Pdb Files Using Perl Script
2
0
Entering edit mode
11.0 years ago

I am final year bioinformatics student, doing final year project on "Protein Database". Im using lots of .pdb files, rely alot. I saved all the .pdb files in a folder. Following that i want parse all the .pdb files simultaneously around (50 - 60 .pdb files) and output in a single output file. I dont want to input file name manualy in 'cmd'. Instead,i want the perl script that can read the stored .pdb files in the folder as input, 1 by 1..

(THIS IS THE PERL SCRIPT THAT HELPED ME ALOT. BUT I WANT TO PARSE AROUND 50-60 .pdb FILES WHICH IS STORED IN A FOLDER 1 BY 1 AND OUTPUT IN A SINGLE TEXT FILE. I DONT WANT ENTER THE STORED .pdb FILE NAMES MANULAY IN COMMAND PROMPT)

(THANKS TO MR.JORGE AMIGO LECHUGA)

#!/usr/bin/perl -w
#
# coordExtract.pl
#     # parser that allows editing a pdb
# file to extract the coordinates
#
# Jorge Amigo Lechuga
#

#######################
# READ THE INPUT FILE #
#######################

# input file query
print "\nEnter the input file: ";
$inputFile = <STDIN>;
chomp $inputFile;

unless (open(INPUTFILE, $inputFile)) {
    print "Cannot read from '$inputFile'.\nProgram closing.\n";
    <STDIN>;
    exit;
}

# load the file into an array
chomp(@dataArray = <INPUTFILE>);

# close the file
close(INPUTFILE);

###############
# LET'S WORK! #
###############

# parse the input file saving only backbone atoms coordinates
# format: [string "ATOM"] [number] [atom] [aa] whateva [3 decimal numbers] whateva with two dots in between
for ($line = 0; $line < scalar @dataArray; $line++) {
    if ($dataArray[$line] =~ m/ATOM\s+\d+\s+(\w+)\s+\w{3}\s+.+\s+(\S+\.\S+)\s+(\S+\.\S+)\s+(\S+\.\S+)\s+.+\..+\..+/ig) {
        if ($1 eq "N" || $1 eq "CA" || $1 eq "C") {
            $parsedData{$line} = $2."\t".$3."\t".$4;
        }
    }
}

# create the output file name
$outputFile = "coordinates_".$inputFile;

# open the output file
open (OUTFILE, ">$outputFile");
# print the data lines
foreach $line (sort {$a <=> $b} keys %parsedData) {
    print OUTFILE $parsedData{$line}."\n";
}

# close the output file
close (OUTFILE);

# end message
print "The coordinates of '$inputFile' were saved into '$outputFile'.\n";

# end the program
exit;
script pdb multiple protein • 7.8k views
ADD COMMENT
0
Entering edit mode

hi, the perl script is running but, the loop is not working. the script give output for only 1 .pdb file. :(

ADD REPLY
1
Entering edit mode

OK I'm taking a look right now. That's encouraging that it's processing one file correctly. Probably just a minor tweak.

ADD REPLY
0
Entering edit mode

stil the same. loop is not working.

unless (open(INPUTFILE, $pdbFile)) { print "Cannot read from '$pdbFile'.\nProgram closing.\n"; ; #exit; (i disable the exit command and its stil the same) }

ADD REPLY
2
Entering edit mode
11.0 years ago
Dan D 7.4k

EDIT: I foolishly left the "exit" command inside the loop. My bad!

OK, try running this in the directory with your PDB files. Back them up before you run this!

#!/usr/bin/perl -w
#
# coordExtract.pl
#     # parser that allows editing a pdb
# file to extract the coordinates
#
# Jorge Amigo Lechuga
#

#######################
# READ THE INPUT FILE #
#######################

my @pdbFiles = glob("*.pdb");
foreach my $pdbFile (@pdbFiles){

    unless (open(INPUTFILE, $pdbFile)) {
        print "Cannot read from '$pdbFile'.\nProgram closing.\n";
        <STDIN>;
        exit;
    }

    # load the file into an array
    chomp(@dataArray = <INPUTFILE>);

    # close the file
    close(INPUTFILE);

    ###############
    # LET'S WORK! #
    ###############

    # parse the input file saving only backbone atoms coordinates
    # format: [string "ATOM"] [number] [atom] [aa] whateva [3 decimal numbers] whateva with two dots in between
    for ($line = 0; $line < scalar @dataArray; $line++) {
        if ($dataArray[$line] =~ m/ATOM\s+\d+\s+(\w+)\s+\w{3}\s+.+\s+(\S+\.\S+)\s+(\S+\.\S+)\s+(\S+\.\S+)\s+.+\..+\..+/ig) {
            if ($1 eq "N" || $1 eq "CA" || $1 eq "C") {
                $parsedData{$line} = $2."\t".$3."\t".$4;
            }
        }
    }

    # create the output file name
    $outputFile = "coordinates_".$pdbFile;

    # open the output file
    open (OUTFILE, ">$outputFile");
    # print the data lines
    foreach $line (sort {$a <=> $b} keys %parsedData) {
        print OUTFILE $parsedData{$line}."\n";
    }

    # close the output file
    close (OUTFILE);

    # end message
    print "The coordinates of '$pdbFile' were saved into '$outputFile'.\n";

}

# end the program
exit;
ADD COMMENT
0
Entering edit mode

stil the same. unless (open(INPUTFILE, $pdbFile)) { print "Cannot read from '$pdbFile'.\nProgram closing.\n"; <stdin>; #exit; (i disable the exit command and its stil the same) }

ADD REPLY
1
Entering edit mode

It worked for me. See this screenshot:

https://www.dropbox.com/s/5csowphrqncui7n/pdb.JPG

Did you copy the entire script above? If so, put it in the directory with your PDB files. Navigate to the directory with the PDB files and type perl coordExtract.pl

ADD REPLY
0
Entering edit mode

OMG, ITS WORKS IN window o/s!!!!!! thanks alot DEEDEE!!!!! b4 tiz i was tryin in ubuntu and it did not work, bt works charm is WINDOWS!!!! THANKS AGAIN!!!!!!!! 1 more thing, can u give all the output in single file???

ADD REPLY
1
Entering edit mode

In Windows, navigate to the directory with the pdb output. Type the following command: type coordinates_* > aggregateOutput.pdb

ADD REPLY
0
Entering edit mode

#!/usr/bin/perl -w #

coordExtract.pl

# parser that allows editing a pdb

file to extract the coordinates

#count for all amino acids existing in the protein $count_of_alanine=0; $count_of_arginine=0; $count_of_asparagine=0; $count_of_aspartic_acid=0; $count_of_cysteine=0; $count_of_glutamic_acid=0; $count_of_glutamine=0; $count_of_glycine=0; $count_of_histidine=0; $count_of_isoleucine=0; $count_of_leucine=0; $count_of_lysine=0; $count_of_methionine=0; $count_of_phenylalanine=0; $count_of_proline=0; $count_of_serine=0; $count_of_threonine=0; $count_of_tryptophan=0; $count_of_tyrosine=0; $count_of_valine=0; #count for groups of amino acids $count_of_charged=0; $count_of_polar=0; $count_of_aromatic=0; $count_of_hydrophobic=0;

input file query

my @pdbFiles = glob("*.pdb"); foreach my $pdbFile (@pdbFiles){

unless (open(INPUTFILE, $pdbFile)) { print "Cannot read from '$pdbFile'.\nProgram closing.\n"; <stdin>; exit; }

load the file into an array

chomp(@dataArray = <inputfile>);

close the file

close(INPUTFILE);

###############
# LET'S WORK! #
###############

parse the input file saving only backbone atoms coordinates

format: [string "ATOM"] [number] [atom] [aa] whateva [3 decimal numbers] whateva with two dots in between

for ($line = 0; $line < scalar @dataArray; $line++) { if ($dataArray[$line]=~/^HEADER\s+(.?)$/) { $header = $1;
} if ($dataArray[$line]=~/^TITLE\s+(.
?)$/) { $parsing{$line} = $1; } if ($dataArray[$line] =~ m/ATOM\s+(\d+)\s+(\w+)\s+(\w{3})\s+.+\s+(\S+.\S+)\s+(\S+.\S+)\s+(\S+.\S+)\s+.+..+..+/ig) { if ($2 eq "N" || $2 eq "CA" || $2 eq "C") { $parsedData{$line} = $1."\t\t".$3."\t\t".$4."\t\t".$5."\t\t".$6;

if($3 eq "ALA"){
$count_of_alanine++;
}
if($3 eq "ARG"){
$count_of_arginine++;
}
if($3 eq "ASN"){
$count_of_asparagine++;
}
if($3 eq "ASP"){
$count_of_aspartic_acid++;
}
if($3 eq "CYS"){
$count_of_cysteine++;
}
if($3 eq "GLU"){
$count_of_glutamic_acid++;
}
if($3 eq "GLN"){
$count_of_glutamine++;
}
if($3 eq "GLY"){
$count_of_glycine++;
}
if($3 eq "HIS"){
$count_of_histidine++;
}
if($3 eq "ILE"){
$count_of_isoleucine++;
}
if($3 eq "LEU"){
$count_of_leucine++;
}
if($2 eq "LYS"){
$count_of_lysine++;
}
if($3 eq "MET"){
$count_of_methionine++;
}
if($3 eq "PHE"){
$count_of_phenylalanine++;
}
if($3 eq "PRO"){
$count_of_proline++;
}
if($3 eq "SER"){
$count_of_serine++;
}
if($3 eq "THR"){
$count_of_threonine++;
}
if($3 eq "TRP"){
$count_of_tryptophan++;
}
if($3 eq "TYR"){
$count_of_tyrosine++;
}
if($3 eq "VAL"){
$count_of_valine++;
}
if($3 eq "ARG"||$3 eq "ASP"||$3 eq "GLU"||$3 eq "LYS"){
$count_of_charged++;
}
if($3 eq "ASN"||$3 eq "GLN"||$3 eq "GLY"||$3 eq "MET"||$3 eq "PRO"){
$count_of_polar++;
}
if($3 eq "PHE"||$3 eq "TRP"||$3 eq "TYR"||$3 eq "HIS"){
$count_of_aromatic++;
}
if($3 eq "ALA"||$3 eq "ILE"||$3 eq "LEU"||$3 eq "VAL"){
$count_of_hydrophobic++;
}
}

}

}

create the output file name

$outputFile = "coordinates_".$pdbFile;

open the output file

open (OUTFILE, ">$outputFile");

print the data lines

print OUTFILE $header, "\n";
foreach $line (sort {$a <=> $b} keys %parsing) {
print OUTFILE $parsing{$line}."\n";

} print OUTFILE $title, "\n"; print OUTFILE "-----------------------------------------------------------------------\n"; print OUTFILE "------------ALL BACKBONE AMINO ACIDS IN THE MEMBRANE PROTEIN-----------\n"; print OUTFILE "-----------------------------------------------------------------------\n"; print OUTFILE "\n";
print OUTFILE "Atom Number\tAmino acid\tX coordinate\tY Coordinate\tZ Coordinate\n"; foreach $line (sort {$a <=> $b} keys %parsedData) { print OUTFILE $parsedData{$line}."\n"; }

print OUTFILE "\n";
print OUTFILE "\n";
print OUTFILE "------------------------------------------------------------------------\n";
print OUTFILE "-----AMINO ACID NUMBERS AND PERCENTAGE FOR THE BACKBONE AMINO ACIDS-----\n";
print OUTFILE "------------------------------------------------------------------------\n";
print OUTFILE "\n";    
print OUTFILE "Amino acid\tTotal Number(N)\n";
print OUTFILE "--------------------------------------------------\n";
print OUTFILE "Alanine\t\t", $count_of_alanine++, "\t\t\n";
print OUTFILE "Arginine\t", $count_of_arginine++, "\t\t\n";
print OUTFILE "Asparagine\t", $count_of_asparagine++, "\t\t\n";
print OUTFILE "Aspartic Acid\t", $count_of_aspartic_acid++, "\t\t\n";
print OUTFILE "Cysteine\t", $count_of_cysteine++, "\t\t\n";
print OUTFILE "Glutamic Acid\t",$count_of_glutamic_acid++, "\t\t\n";
print OUTFILE "Glutamine\t",$count_of_glutamine++, "\t\t\n";
print OUTFILE "Glycine\t\t",$count_of_glycine++, "\t\t\n";
print OUTFILE "Histidine\t",$count_of_histidine++, "\t\t\n";
print OUTFILE "Isoleucine\t",$count_of_isoleucine++, "\t\t\n";
print OUTFILE "Leucine\t\t",$count_of_leucine++, "\t\t\n";
print OUTFILE "Lysine\t\t",$count_of_lysine++, "\t\t\n";
print OUTFILE "Methionine\t",$count_of_methionine++, "\t\t\n";
print OUTFILE "Phenylalanine\t",$count_of_phenylalanine++, "\t\t\n";
print OUTFILE "Proline\t\t",$count_of_proline++, "\t\t\n";
print OUTFILE "Serine\t\t",$count_of_serine++, "\t\t\n";
print OUTFILE "Threonine\t",$count_of_threonine++, "\t\t\n";
print OUTFILE "Tryptophan\t",$count_of_tryptophan++, "\t\t\n";
print OUTFILE "Tyrosine\t",$count_of_tyrosine++, "\t\t\n";
print OUTFILE "Valine\t\t",$count_of_valine++, "\t\t\n";
print OUTFILE "--------------------------------------------------\n";
print OUTFILE "Total\t\t", scalar(keys %parsedData), "\t\t\n";
print OUTFILE "\n";
print OUTFILE "\n";
print OUTFILE "CHARGED AMINO ACIDS\t\t\t",$count_of_charged++,"\n";
print OUTFILE "POLAR AMINO ACIDS\t\t\t",$count_of_polar++,"\n";
print OUTFILE "AROMATIC AMINO ACIDS\t\t\t",$count_of_aromatic++,"\n";
print OUTFILE "HYDROPHOBIC AMINO ACIDS\t\t\t",$count_of_hydrophobic++,"\n";
print OUTFILE "\n";
print OUTFILE "\n";


# close the output file
close (OUTFILE);

# end message
print "The coordinates of '$pdbFile' were saved into '$outputFile'.\n";

}

end the program

exit;

ADD REPLY
0
Entering edit mode

i use this perl script to parse .pdb files in a folder(around 1000 files). unfortunately the loop in the script causing problem.. the number of amino acids and other data in the .pdb files keep on increasing... the data stacking with another .pdb file i guess.. help me with the loop..

ADD REPLY
0
Entering edit mode

i tried, bt its not working.. im using this mac now.. i use this command to read the entire .pdb files 1 folders, "for fl in *.pdb;do perl filename.pl $fl; done".... its working prety well.. but compile all the outputs in only 1 single file???

ADD REPLY
1
Entering edit mode
11.0 years ago
Dan D 7.4k

Step 1: Make a copy of the coordExtract.pl script and change this line:

print "\nEnter the input file: "; $inputFile = ; chomp $inputFile;

to

$inputFile = $ARGV[0];

Step 2: In your Linux bash shell, navigate to the directory containing the .PDB files and type the following:

for f in *.pdb
do
   /path/to/coordExtract.pl $f
done

Make copies of the original script and the PDB files before you test this solution!

ADD COMMENT
0
Entering edit mode

thank you for the commend, bt i using windows OS.. can you give the perl script for windows??

ADD REPLY
1
Entering edit mode

OK, so just copy the files to a linux server and do the above. ;) Seriously though, I'll stop being lazy and write you something that's O/S agnostic. One minute...

ADD REPLY

Login before adding your answer.

Traffic: 2121 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6