Converting amino acid sequences to numerical values
3
1
Entering edit mode
5.1 years ago
karish03 ▴ 10

Hello, I am looking for the way to convert amino acid sequences to numerical values based on custom file.
Those numerical values were mixed up with positive and negative values.

Reference file: tab separated

A   R   N   D   C   Q   E   G   H    K
-0.5    3   0.2 3   -1  0.2 3   0   -0.5    5

Input file: Multi-fasta file

>Example1
ARN
>Example2
DCQEGH

Expected output: tab separated

>Example1
 -0.5   3   0.2 
>Example2
  3 -1   0.2    3   0   -0.5

Thank you

R perl python sequence script • 1.5k views
ADD COMMENT
0
Entering edit mode

this looks like a homework ... otherwise please explain why do you need the file in your research or offer to pay at least a coffee for a script ...

ADD REPLY
0
Entering edit mode

Those custom values are from AAindex by amino acid property. I will use it these results for research purpose only after converting.

ADD REPLY
0
Entering edit mode

I would do it in python and use dictionaries

ADD REPLY
1
Entering edit mode
5.1 years ago
zx8754 12k

These examples should help you to get started:

strsplit("XYZ", "")
# [[1]]
# [1] "X" "Y" "Z"

lookup <- setNames(c(11, 22, 33), c("X", "Y", "Z"))
lookup[ c("Z", "Z", "Y") ]
#  Z  Z  Y 
# 33 33 22
ADD COMMENT
0
Entering edit mode

Thank you. I will check it.

ADD REPLY
1
Entering edit mode
5.1 years ago
JC 13k

here how I could do in Perl:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use convert.pl ref_data fasta_in fasta_out\n";
my $ref_file = shift @ARGV;
my $in_file  = shift @ARGV;
my $out_file = shift @ARGV;

my %code = ();
my @aa   = ();
open (my $rh, "<", $ref_file) or die "cannot read $ref_file\n";
my $nl = 0;
while (<$rh>) {
    $nl++;
    chomp;
    my @a = split (/\t/, $_);
    if ($nl == 1) { # capture first line with aa
        @aa = @a;
    }
    elsif ($nl == 2) { # capture second line with values, combine both in a Hash
        for (my $i = 0; $i<=$#a; $i++) {
            $code{ $aa[$i] } = $a[$i];
        }
    }
    else { last; }
}
close $rh;

open (my $ih, "<", $in_file) or die "cannot read $in_file\n";
open (my $oh, ">", $out_file) or die "cannot write $out_file\n";
$/ = "\n>"; # read fasta in blocks
while (<$ih>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq; # reconstruct sequence if multiline
    print $oh ">$id\n"; # fasta header
    @seq = split (//, $seq);
    my @val = ();
    foreach my $aa (@seq) {
        if (defined $code{$aa}) { push @val, $code{$aa}; }
        else { die "$aa is not defined in $ref_file\n";  }
    }
    print $oh join "\t", @val;
    print $oh "\n";
}
close $ih;
close $oh;

Testing:

$ perl convert.pl ref.data seqs.pep seqs.vals
$ cat seqs.vals
>Example1
-0.5    3       0.2
>Example2
3       -1      0.2     3       0       -0.5
ADD COMMENT
0
Entering edit mode
5.1 years ago
Joe 21k

Making some major assumptions about the actual file formats, I'd do something like:

from Bio import SeqIO

with open('reference.txt', 'r') as rh:
    content = rh.read()
    mappings = dict(zip(content[0].split(), content[1].split())

for record in SeqIO.parse('sequences.fasta', 'fasta'):
    scores = []
    for char in str(record.seq):
        try:
            scores.append(mappings[char])
        except KeyError:
            scores.append('.')

    print(">" + record.description + "\n" +
          " ".join(scores))

I have just written this from memory/off the top of my head so it's untested. I also expect there's a more concise and faster way to do it with str.translate but this will probably work (with some minor tweaks).

The try/except block will prevent this crashing in the even that you have a residue in the sequence that isn't in the numerical mappings, but it will put a dummy character in (.), so its up to you how you want to handle this.

ADD COMMENT
0
Entering edit mode

I will try to fix it. Thank you

ADD REPLY

Login before adding your answer.

Traffic: 1499 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6