For this data, you should be using unpack, not substr() and definitely not a regex if you are just trying to parse the file:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dump qw(dd);
my @data;
while(my $line = <DATA>) {
if ($line =~ /^ATOM/) {
push @data, [unpack "A6A7A4A4A2A*", $line];
}
}
dd @data;
__DATA__
ATOM 10360 H41 C B 602
ATOM 10361 P G B 602
ATOM 10362 C5' G B 602
ATOM 10363 O5' G B 602
If you just want to match lines, Perl already gives you line-buffered data so you don't need a regex, just sort your lines. If you want a specific column, sort @data in the above code. Executing this code:
perl biostar64928.pl
gives you the components of each column:
(
["ATOM", 10360, "H41", "C", "B", 602],
["ATOM", 10361, "P", "G", "B", 602],
["ATOM", 10362, "C5'", "G", "B", 602],
["ATOM", 10363, "O5'", "G", "B", 602],
)
Note that real PDB files contain more lines and with different formats than those beginning with ATOM. However, I think this will still work in that case; you just have to pull out the ATOM elements.
If that's the case, your regex would not work but my solution would :). Just answering the question based on what was provided.
Actually your solution will work; edited my comment. And yes, my regex will work since ATOM lines start with ATOM and are well-defined. Suggest you look at some real PDB data :)
My solution does not depend on what the lines start with, that is the point. It is a solution that will work will any fixed width file.
Yes, I see that. It's a good solution, but then you still have to pull out the arrays where the first element is "ATOM". Regex does that for you straight away. It won't be a huge performance hit since PDB file are not very large.
You are correct. I added a line to address your point (although it's silly in this example :) ).