Question

Identifying Amino Acids In A Fasta Sequence File By Their Properties (Hydrophpic, Charged Etc)

0

Entering edit mode

12.2 years ago

Shweta ▴ 20

I have a protein sequence in a file. I want to locate if the sequence hxxhcxc is present in the file or not, if yes, then print the stretch. Here, h=hydrophobic, c=charged, x=any (including remaining) residue/s. How to do this in perl?

What I could think of is make 3 arrays—of hydrophobic, charged and all residues. Compare each array with the file having the FASTA sequence. I can't think of anything beyond this, especially how to maintain the order—that's the main thing. I am a beginner in Perl, so please make the explanation as simple as possible.

Thanks in advance.

perl sequence protein • 3.7k views

ADD COMMENT • link updated 12.2 years ago by Eric ▴ 40 • written 12.2 years ago by Shweta ▴ 20

score 4 · Answer 1 · 2012-09-04

4

Entering edit mode

12.2 years ago

Eric ▴ 40

What you need is a regular expression.

This script should do it:

The code can be compacted a bit, but I thought this was more readable.

#!/usr/bin/perl
use strict;
use warnings;

#This is to unwrap the FASTA formatted file into records
$/=">";
<>;

while (my $line = <>) {
    my ($header, @seq) = split /\n/, $line;
    my $sequence = join '', @seq;

#Find all occurrences of the pattern. hydrophobic = [AVILMFYW], charged = [RHKDE], "." matches any character
    while ( $sequence =~ m/([AVILMFYW]..[AVILMFYW][RHKDE].[RHKDE])/gi ){
        print "$1\n"
    }
}

ADD COMMENT • link 12.2 years ago by Eric ▴ 40

0

Entering edit mode

Hi, thanks for the reply. But this is not showing any output. No errors though, the prompt just moves on to the next line. Also, I didn't quite understand the code (please pardon my ignorance); so if you could just whiz past what it's trying to say, I would be able to refrain from asking silly questions in future

ADD REPLY • link 12.2 years ago by Shweta ▴ 20

0

Entering edit mode

If your FASTA formatted file if names sequences.fa. Usage would be

script.pl sequences.fa > output

The script first changes the end of record separator from a newline (\n) to ">" which is the first character for each FASTA record. This lets you cycle through the sequences one at a time. I believe I saw this method in "Beginning Perl for Bioinformatics". I use it often.

The outer while loop cycles through each FASTA record and combines all of the lines of sequence into a single variable, $sequence, so that we can match the pattern even if it is on multiple lines.

The inner while loop finds each occurrence of the regular expression within $sequence and prints it to STDOUT.

The real work of the script is the regular expression: m/([AVILMFYW]..[AVILMFYW][RHKDE].[RHKDE])/gi

Regular expressions can be tricky, but very powerful. There are several good books on regular expressions and most perl books will at least have a section. "Mastering Regular Expressions" is the serious guide, but there are many good tutorials online.

ADD REPLY • link 12.2 years ago by Eric ▴ 40