Question

Perl Reg-Ex Matching

0

Entering edit mode

12.4 years ago

GouthamAtla 12k

How can I match the following lines exactly in Perl Reg-Ex?

ATOM  10360  H41 C   B 602
ATOM  10361  P   G   B 602
ATOM  10362  C5' G   B 602
ATOM  10363  O5' G   B 602

I tried something like:

/^ATOM\s\s\s[0-9]+\s\s\s...[A-Z]\s/

but this also matches with

ATOM   5248  HB2 SER A 326
ATOM   5249  HG  SER A 326
ATOM   5250  N   LEU A 327

perl • 3.4k views

ADD COMMENT • link updated 12.4 years ago by SES 8.6k • written 12.4 years ago by GouthamAtla 12k

score 2 · Answer 1 · 2013-02-25

I assume from the question that:

your PDB file contains 2 or more chains
of which one or more is protein, one or more is nucleic acid
you want to match the nucleic acid, not the protein

You could try this:

^ATOM\s+\d+\s+\w+\s+[ACGT]\s+

This assumes that: (1) second column contains only digits; (2) fourth column contains only A, C, G, T (upper-case).

If you want to match on any upper-case single letter in column 4:

^ATOM\s+\d+\s+\w+\s+[A-Z]{1}\s+

You may also want to look at BioPerl methods for parsing PDB files. I'm having trouble locating a good one-stop resource for that, so you'll have to web search using those terms.

score 2 · Answer 2 · 2013-02-25

2

Entering edit mode

12.4 years ago

Woa ★ 2.9k

Just keep in mind that the PDB ATOM records are not space/tab delimited but having a fixed width, maybe perl's substr() funtion is a better candidate than the regex matching for file parsing.

ADD COMMENT • link 12.4 years ago by Woa ★ 2.9k

score 1 · Answer 3 · 2013-02-25

1

Entering edit mode

12.4 years ago

SES 8.6k

For this data, you should be using unpack, not substr() and definitely not a regex if you are just trying to parse the file:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dump qw(dd);

my @data;

while(my $line = <DATA>) {
    if ($line =~ /^ATOM/) {
        push @data, [unpack "A6A7A4A4A2A*", $line];
    }
}

dd @data;

__DATA__
ATOM  10360  H41 C   B 602
ATOM  10361  P   G   B 602
ATOM  10362  C5' G   B 602
ATOM  10363  O5' G   B 602

If you just want to match lines, Perl already gives you line-buffered data so you don't need a regex, just sort your lines. If you want a specific column, sort @data in the above code. Executing this code:

perl biostar64928.pl

gives you the components of each column:

(
  ["ATOM", 10360, "H41", "C", "B", 602],
  ["ATOM", 10361, "P", "G", "B", 602],
  ["ATOM", 10362, "C5'", "G", "B", 602],
  ["ATOM", 10363, "O5'", "G", "B", 602],
)

ADD COMMENT • link 12.4 years ago by SES 8.6k

0

Entering edit mode

Note that real PDB files contain more lines and with different formats than those beginning with ATOM. However, I think this will still work in that case; you just have to pull out the ATOM elements.

ADD REPLY • link 12.4 years ago by Neilfws 49k

0

Entering edit mode

If that's the case, your regex would not work but my solution would :). Just answering the question based on what was provided.

ADD REPLY • link 12.4 years ago by SES 8.6k

0

Entering edit mode

Actually your solution will work; edited my comment. And yes, my regex will work since ATOM lines start with ATOM and are well-defined. Suggest you look at some real PDB data :)

ADD REPLY • link 12.4 years ago by Neilfws 49k

0

Entering edit mode

My solution does not depend on what the lines start with, that is the point. It is a solution that will work will any fixed width file.

ADD REPLY • link 12.4 years ago by SES 8.6k

0

Entering edit mode

Yes, I see that. It's a good solution, but then you still have to pull out the arrays where the first element is "ATOM". Regex does that for you straight away. It won't be a huge performance hit since PDB file are not very large.

ADD REPLY • link 12.4 years ago by Neilfws 49k

0

Entering edit mode

You are correct. I added a line to address your point (although it's silly in this example :) ).

ADD REPLY • link 12.4 years ago by SES 8.6k

score 0 · Answer 4 · 2013-02-25

0

Entering edit mode

12.4 years ago

diltsjeri ▴ 470

/^ATOM\t[\d]+\t[.]\t[\w]+\t[\w]+\t\d\d\d/

Should work.

ADD COMMENT • link 12.4 years ago by diltsjeri ▴ 470

0

Entering edit mode

The field separator is not tab. it is multiple space.

ADD REPLY • link 12.4 years ago by GouthamAtla 12k

score 0 · Answer 5 · 2013-02-25

0

Entering edit mode

12.4 years ago

GouthamAtla 12k

I think I got it.

/^ATOM\s+[\d]+\s+[A-Z]+\s+\b[A-Z]\b/

Thanks.

ADD COMMENT • link 12.4 years ago by GouthamAtla 12k