Is There A Perl Script To Select Longest Sequence Region(Translated) Between Stop Codons Translated Using Transeq?
2
1
Entering edit mode
10.9 years ago
biostar ▴ 170

Hi Everyone, I am trying to figure out the easiest way to extract the longest sequence from a translated frame of DNA sequences. The sequences I have were translated using transeq, now I need to extract the region with the longest gap between two stop codons (as indicated by * in the sequence below. The translated sequence look like this:

>Hh_TY_4
 KSHLATLH*LSQCH*NSMSISAF*VIYVYFE*S*HTFRIPFVKTF**FK**NSSIHNEKFVLFP**KDISSRWKSHSCCWRRERPKQFVCIWRWYSICQTSTRVRSRNNQENPVRN*HLQSQ*SEESQRCR
>Hh_TY_5
 KSHLATLH*LSQCH*NSMSISAF*VIYVYFE*S*HTFRIPFVKTFFKNSSIHNEKFVLFPKDISENPVRN*HLQSQ*SEESQRCR

And the output I want is like this:

>Hh_TY_4
KDISSRWKSHSCCWRRERPKQFVCIWRWYSICQTSTRVRSRNNQENPVRN
>Hh_TY_5
HTFRIPFVKTFFKNSSIHNEKFVLFPKDISENPVRN

Any suggestions on this would be really appreciated. Thanks!

perl translation • 3.7k views
ADD COMMENT
1
Entering edit mode
10.9 years ago
JC 13k

try this:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>";
while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    my @orfs = split (/\*/, $seq);
    shift @orfs; pop @orfs; # remove last and first, I think you want only if an ORF is BETWEEN two *, if not, just delete this line
    my $sel = shift @orfs;
    foreach my $next (@orfs) {
        $sel = $next if ((length $sel) < (length $next))
    }
    print ">$id\n$sel\n";
}

then run as: perl getLongestORF.pl < FASTA > OUT

ADD COMMENT
1
Entering edit mode
10.9 years ago
Kenosis ★ 1.3k

Perhaps the following will be helpful:

use strict;
use warnings;
use List::Util qw/max/;

local ( $/, $" ) = qw/ > * /;

while (<>) {
    if ( my ( $id, @seq ) = /([^>\n]+)/g ) {
        my %h;
        push @{ $h{ length $_ } }, $_ for split /\*/, join '', @seq;
        print ">$id\n@{ $h{ max keys %h } }\n";
    }
}

Usage: perl script.pl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

>Hh_TY_4
KDISSRWKSHSCCWRRERPKQFVCIWRWYSICQTSTRVRSRNNQENPVRN
>Hh_TY_5
HTFRIPFVKTFFKNSSIHNEKFVLFPKDISENPVRN

This script handles cases where there is more than one 'longest' sequence region, separating those sequences with a "*". For example, a possible result might be:

>Hh_TY_6
HTFRIPFVKTFFKNSSIHNEKFVLFPKDISENPVRN*KDISSRWKSHSCCWRRERPKQFVCIWRWYSICQTST

This script builds a hash of arrays (HoA) for each set of sequence regions (strings), where the key is the length and the associated value is a reference to a list of sequence regions. max keys %h returns the key with the greatest value--thus the key whose value is a reference to the list of sequence regions with the greatest length. $h{ max keys %h } returns the reference to the list of those strings. @{ $h{ max keys %h } } produces an array of those strings, and since it's enclosed within double quotes, the array is interpolated. Because of the local $" = '*', multiple array elements will be displayed with a "*" between them during this interpolation.

ADD COMMENT
0
Entering edit mode

This is cool, I learned something new. I had to read perlvar and take a long look to understand this one because it's pretty esoteric. I like seeing different ways of solving problems, and this is very useful.

ADD REPLY

Login before adding your answer.

Traffic: 1604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6