Hello,
I am looking for the way to convert amino acid sequences to numerical values based on custom file.
Those numerical values were mixed up with positive and negative values.
Reference file: tab separated
A R N D C Q E G H K
-0.5 3 0.2 3 -1 0.2 3 0 -0.5 5
#!/usr/bin/perl
use strict;
use warnings;
$ARGV[2] or die "use convert.pl ref_data fasta_in fasta_out\n";
my $ref_file = shift @ARGV;
my $in_file = shift @ARGV;
my $out_file = shift @ARGV;
my %code = ();
my @aa = ();
open (my $rh, "<", $ref_file) or die "cannot read $ref_file\n";
my $nl = 0;
while (<$rh>) {
$nl++;
chomp;
my @a = split (/\t/, $_);
if ($nl == 1) { # capture first line with aa
@aa = @a;
}
elsif ($nl == 2) { # capture second line with values, combine both in a Hash
for (my $i = 0; $i<=$#a; $i++) {
$code{ $aa[$i] } = $a[$i];
}
}
else { last; }
}
close $rh;
open (my $ih, "<", $in_file) or die "cannot read $in_file\n";
open (my $oh, ">", $out_file) or die "cannot write $out_file\n";
$/ = "\n>"; # read fasta in blocks
while (<$ih>) {
s/>//g;
my ($id, @seq) = split (/\n/, $_);
my $seq = join "", @seq; # reconstruct sequence if multiline
print $oh ">$id\n"; # fasta header
@seq = split (//, $seq);
my @val = ();
foreach my $aa (@seq) {
if (defined $code{$aa}) { push @val, $code{$aa}; }
else { die "$aa is not defined in $ref_file\n"; }
}
print $oh join "\t", @val;
print $oh "\n";
}
close $ih;
close $oh;
Making some major assumptions about the actual file formats, I'd do something like:
from Bio import SeqIO
with open('reference.txt', 'r') as rh:
content = rh.read()
mappings = dict(zip(content[0].split(), content[1].split())
for record in SeqIO.parse('sequences.fasta', 'fasta'):
scores = []
for char in str(record.seq):
try:
scores.append(mappings[char])
except KeyError:
scores.append('.')
print(">" + record.description + "\n" +
" ".join(scores))
I have just written this from memory/off the top of my head so it's untested. I also expect there's a more concise and faster way to do it with str.translate but this will probably work (with some minor tweaks).
The try/except block will prevent this crashing in the even that you have a residue in the sequence that isn't in the numerical mappings, but it will put a dummy character in (.), so its up to you how you want to handle this.
this looks like a homework ... otherwise please explain why do you need the file in your research or offer to pay at least a coffee for a script ...
Those custom values are from AAindex by amino acid property. I will use it these results for research purpose only after converting.
I would do it in python and use dictionaries