I have a protein sequence in a file. I want to locate if the sequence hxxhcxc is present in the file or not, if yes, then print the stretch. Here, h=hydrophobic, c=charged, x=any (including remaining) residue/s. How to do this in perl?
What I could think of is make 3 arrays—of hydrophobic, charged and all residues. Compare each array with the file having the FASTA sequence. I can't think of anything beyond this, especially how to maintain the order—that's the main thing. I am a beginner in Perl, so please make the explanation as simple as possible.
Thanks in advance.
Hi, thanks for the reply. But this is not showing any output. No errors though, the prompt just moves on to the next line. Also, I didn't quite understand the code (please pardon my ignorance); so if you could just whiz past what it's trying to say, I would be able to refrain from asking silly questions in future
If your FASTA formatted file if names sequences.fa. Usage would be
The script first changes the end of record separator from a newline (\n) to ">" which is the first character for each FASTA record. This lets you cycle through the sequences one at a time. I believe I saw this method in "Beginning Perl for Bioinformatics". I use it often.
The outer while loop cycles through each FASTA record and combines all of the lines of sequence into a single variable, $sequence, so that we can match the pattern even if it is on multiple lines.
The inner while loop finds each occurrence of the regular expression within $sequence and prints it to STDOUT.
The real work of the script is the regular expression: m/([AVILMFYW]..[AVILMFYW][RHKDE].[RHKDE])/gi
Regular expressions can be tricky, but very powerful. There are several good books on regular expressions and most perl books will at least have a section. "Mastering Regular Expressions" is the serious guide, but there are many good tutorials online.