Is it possible using a regex in perl to include both accession number and the matched sequence when printing?
2
1
Entering edit mode
10.5 years ago
agabali ▴ 10

Hello BioStars community,

New to perl and programming in general, so I thought I might try out my luck asking a question here.

I am trying to match a fairly conserved protein sequence to a proteome using a regex. I am able to output the matching lines, as well as their positions, but I cannot find a way to output the accession numbers along with lines that match my conserved protein.

Here's part of my code:

my $proteins;
open( file, "Athaliana_167_protein.fa" ) or die "can't open file!";
while (<file>){
        if (/W[S]TRRKIAI/) {print}
}

Would using lookahead/lookbehinds possibly work to print out the match line and accession number?

Thanks!

regex match perl protein • 3.7k views
ADD COMMENT
0
Entering edit mode

Your code does most likely not work for finding the sequence you are looking for, most fasta files contain linebreaks in the sequence where you will miss your pattern in case it is wrapped, you need to put the whole sequence into one string first.

ADD REPLY
3
Entering edit mode
10.5 years ago
Michael 55k

This statement is important for parsing fasta efficiently in perl (unless you want to use BioPerl):

local($/) = "\n>";

It allows you to read a complete fasta record instead of each line.

{
  local($/) = "\n>"; # read each fasta record, always use local!
  while (my $fastarec = <FASTA>) {
    chomp $fastarec;
    my ($defline, @seq) = split "\n", $fastarec;  #seq id is the first line
    $defline =~ s/^\>//; # remove left over >, just in case
    my $seq = join "", @seq; # put together the sequence again
    $seq =~ s/\s//g; # remove potential left-over spaces, empty lines etc.
    if $seq =~ /W[H|A]TEVER/ {
     print (join "\n", ">$defline", @seq), "\n"; # output sequence in original formatting
    }
  }
}
ADD COMMENT
1
Entering edit mode

+1 Nice tip!

ADD REPLY
0
Entering edit mode

Thank you for the tips! This worked very well.

ADD REPLY
0
Entering edit mode

That's actually how BioPerl does it internally :)

ADD REPLY
0
Entering edit mode
  1. Your script doesn't compile.
  2. Always: use strict; use warnings;
  3. Use lexical file handles.
  4. No need to s/^\>// since > is chomped as the unique FASTA record separator.
  5. No need to s/\s//g when splitting on ' '.
  6. No need to join the sequence lines for the pattern matching (set a local copy of $" to an empty string and do a match on "@{ [ split ' ', $fastarec ] }").

Here's an alternative solution:

use strict;
use warnings;

open my $FASTA, '<', 'Athaliana_167_protein.fa' or die "Can't open file: $!";
local ( $/, $" ) = ( '>', '' );

while (<$FASTA>) {
    chomp if s/(.+)\n// and my $defline = $1 or next;<br />
    print ">$defline\n$_" if "@{ [ split ' ' ] }" =~ /W(H|A)TEVER/;
}
ADD REPLY

Login before adding your answer.

Traffic: 1824 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6