Question

Is it possible using a regex in perl to include both accession number and the matched sequence when printing?

1

Entering edit mode

11.0 years ago

agabali ▴ 10

Hello BioStars community,

New to perl and programming in general, so I thought I might try out my luck asking a question here.

I am trying to match a fairly conserved protein sequence to a proteome using a regex. I am able to output the matching lines, as well as their positions, but I cannot find a way to output the accession numbers along with lines that match my conserved protein.

Here's part of my code:

my $proteins;
open( file, "Athaliana_167_protein.fa" ) or die "can't open file!";
while (<file>){
        if (/W[S]TRRKIAI/) {print}
}

Would using lookahead/lookbehinds possibly work to print out the match line and accession number?

Thanks!

regex match perl protein • 3.9k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 11.0 years ago by agabali ▴ 10

0

Entering edit mode

Your code does most likely not work for finding the sequence you are looking for, most fasta files contain linebreaks in the sequence where you will miss your pattern in case it is wrapped, you need to put the whole sequence into one string first.

ADD REPLY • link 11.0 years ago by Michael 55k

Ram · Accepted Answer · 2014-05-16

3

Entering edit mode

11.0 years ago

Michael 55k

This statement is important for parsing fasta efficiently in perl (unless you want to use BioPerl):

local($/) = "\n>";

It allows you to read a complete fasta record instead of each line.

{
  local($/) = "\n>"; # read each fasta record, always use local!
  while (my $fastarec = <FASTA>) {
    chomp $fastarec;
    my ($defline, @seq) = split "\n", $fastarec;  #seq id is the first line
    $defline =~ s/^\>//; # remove left over >, just in case
    my $seq = join "", @seq; # put together the sequence again
    $seq =~ s/\s//g; # remove potential left-over spaces, empty lines etc.
    if $seq =~ /W[H|A]TEVER/ {
     print (join "\n", ">$defline", @seq), "\n"; # output sequence in original formatting
    }
  }
}

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 11.0 years ago by Michael 55k

1

Entering edit mode

+1 Nice tip!

ADD REPLY • link 11.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you for the tips! This worked very well.

ADD REPLY • link 11.0 years ago by agabali ▴ 10

0

Entering edit mode

That's actually how BioPerl does it internally :)

ADD REPLY • link 11.0 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Your script doesn't compile.
Always: use strict; use warnings;
Use lexical file handles.
No need to s/^\>// since > is chomped as the unique FASTA record separator.
No need to s/\s//g when splitting on ' '.
No need to join the sequence lines for the pattern matching (set a local copy of $" to an empty string and do a match on "@{ [ split ' ', $fastarec ] }").

Here's an alternative solution:

use strict;
use warnings;

open my $FASTA, '<', 'Athaliana_167_protein.fa' or die "Can't open file: $!";
local ( $/, $" ) = ( '>', '' );

while (<$FASTA>) {
    chomp if s/(.+)\n// and my $defline = $1 or next;<br />
    print ">$defline\n$_" if "@{ [ split ' ' ] }" =~ /W(H|A)TEVER/;
}

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 11.0 years ago by Kenosis ★ 1.3k