Adapt perl script
3
0
Entering edit mode
8.6 years ago
weissert • 0

I am not a Perl coder and hope to find help in this forum. I aim to run the script.pl, shown below on the following input but do not get the desired output:

here is my desired output:

>Pseudo_TTA 33-35
AGATCTGTAAGCTCAAAATGGTAGAGCTCTCGTTATGAGAGTTTGGTGGTTCGAATCCACCCAGATCTGcca
>His_GTG 34-36
GTGGCTGTAGTTTAGTGGTGAGAATTCCACGTTGTGGCCGTGGAGACCTGGGCTCGAATCCCAGCAGCCACAcca
>Trp_CCA 33-35
GGATCCGTGGCGCAATGGTAGCGCGTCTGACTCCAGATCAGAAGGTTGCGTGTTCGATTGACGTCGGGTTCAcca
>Met_CAT 35-37
GGGGTGGTGGCGCAGTTGGCTAGCGCGTAGGTCTCATAATCCTGAGGTCGAGAGTTCGAGCCTCTCTCACCCCAcca

input file1: tRNAfasta.fa

>chrA01_622419_622487_+_Pseudo_TTA_28.44
AGATCTGTAAGCTCAAAATGGTAGAGCTCTCGTTATGAGAGTTTGGTGGTTCGAATCCACCCAGATCTG
>chrA01_2528533_2528604_+_His_GTG_63.70
GTGGCTGTAGTTTAGTGGTGAGAATTCCACGTTGTGGCCGTGGAGACCTGGGCTCGAATCCCAGCAGCCACA
>chrA01_2624977_2625048_+_Trp_CCA_69.73
GGATCCGTGGCGCAATGGTAGCGCGTCTGACTCCAGATCAGAAGGTTGCGTGTTCGATTGACGTCGGGTTCA
>chrA01_21878294_21878379_+_Met_CAT_61.56
GGGGTGGTGGCGCAGTTGGCTAGCGCGTAGGTCTCATAGCTATCTGAGTAATCCTGAGGTCGAGAGTTCGAGCCTCTCTCACCCCA

Input file 2: codonpos.txt

chrA01.trna1 (622419-622487)    Length: 69 bp
Type: Sup   Anticodon: TTA at 33-35 (622451-622453) Score: 28.44
Possible pseudogene:  HMM Sc=-4.46  Sec struct Sc=32.90
         *    |    *    |    *    |    *    |    *    |    *    |    *   
Seq: AGATCTGTAaGCTCAAaaTGGTAGAGCTCTCGTTATGAGAGTtTGGTGGTTCGAATCCACCCAGATCTG
Str: >>>>>>>...>>>>.........<<<<.>>>.....<<<.....>>>>>.......<<<<<<<<<<<<.

chrA01.trna2 (2528533-2528604)  Length: 72 bp
Type: His   Anticodon: GTG at 34-36 (2528566-2528568)   Score: 63.70
         *    |    *    |    *    |    *    |    *    |    *    |    *    | 
Seq: GTGGCTGTAGTTTAGTGGTgAGAATTCCACGTTGTGGCCGTGGAGACCTGGGCTCGAATCCCAGCAGCCACA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.

chrA01.trna3 (2624977-2625048)  Length: 72 bp
Type: Trp   Anticodon: CCA at 33-35 (2625009-2625011)   Score: 69.73
         *    |    *    |    *    |    *    |    *    |    *    |    *    | 
Seq: GGATCCGTGGCGCAATGGTAGCGCGTCTGACTCCAGATCAGAAGGtTGCGTGTTCGATTGACGTCGGGTTCA
Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<.....>>>>.........<<<<<<<<<<<.

chrA01.trna27 (21878294-21878379)   Length: 86 bp
Type: Met   Anticodon: CAT at 35-37 (21878328-21878330) Score: 61.56
Possible intron: 39-50 (21878332-21878343)
         *    |    *    |    *    |    *    |    *    |    *    |    *    |    *    |    *
Seq: GGGGTGGTGGCGCAGTTGGCtAGCGCGTAGGTCTCATAgctatctgagtaATCCTGAGGtCGAGAGTTCGAGCCTCTCTCACCCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>.....................<<<<.....>>>>>.......<<<<<<<<<<<<.

here is the command:

perl script.pl  tRNAfasta.fa codonpos.txt > out.fa

script.pl:

$fa = shift;
open(FA,$fa);

while($line = <FA>){
   chomp $line; 
   if($line =~ />/){
      $header = $line;
   }
   my($tRNA,$chrID,$loc,$strand,$aa,$codon,$len,$pb,$sc,$score) = split(" ",$header);
   $tRNA =~  s/>Mus_musculus_tRNA-//g; 
      $chrID =~ s/[()]//g;
   my($id,$name) = split("-",$chrID);
   $tRNA_id_hash{$id} = $tRNA;
}

$sort = shift;
open(SORT,$sort);

while($name = <SORT>){
   chomp $name;
   $aminoAcidInfo = <SORT>;
   chomp $aminoAcidInfo;
   my($type,$aa,$anti,$anti_nt,$at,$anticodon_loc,$chr) = split(" ",$aminoAcidInfo);
   $HMM = <SORT>;
   if($HMM =~ /intron/){
      $intron = <SORT>;
   }
   $star = <SORT>;
   $seq = <SORT>;
   chomp $seq;
   $seq =~ s/Seq: //g;
   $seq = $seq."CCA";
   $seq = uc($seq);
   $structure = <SORT>;
   $blank = <SORT>;
   my($id,$len) = split(/\t/,$name);
   my($chrID,$loc) = split(" ",$id);
   print ">$tRNA_id_hash{$chrID}\t$anticodon_loc\n$seq\n";
}

Here is the current output which is not at all what I desired :-( I tried to make sense of it but do not get it.

>   33-35
AGATCTGTAAGCTCAAAATGGTAGAGCTCTCGTTATGAGAGTTTGGTGGTTCGAATCCACCCAGATCTG
CCA
>   34-36
STR: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.
CCA
>   |

CCA
>   
CCA
software error sequence gene • 2.4k views
ADD COMMENT
0
Entering edit mode

What is the use of input file1 here? It seems we can generate desired output from input file 2, codonpos.txt.

ADD REPLY
0
Entering edit mode

true. But the script is the only thing on hand and it requires the input of both file 1 and file 2.

ADD REPLY
0
Entering edit mode

if you can explain what the pseudo code/algorithm is or whatever your are trying to achieve in a little more detail, may be I can help you. Waiting for your response!

ADD REPLY
0
Entering edit mode

a) Reformat the header of the fasta sequences to be: >tRNA_name predicted_anti-codon_positions b) Remove predicted introns c) add CCA to the sequence tail

does it help`?

ADD REPLY
0
Entering edit mode

A general advice: add comments to your code.

ADD REPLY
1
Entering edit mode
8.6 years ago
Prakki Rama ★ 2.7k

If perl program is not compulsory, you can get desired output by running a series of "SED" commands (i used 8 in this case :-). I used only your "codonpos.txt" file to generate the output.

   egrep "^Type:|^Seq:" codonpos.txt \|
   sed 's/Sup/Pseudo/g' \| 
   sed '/^Type:/s% (.*%%' \|
   sed 's/Type: />/' \| 
   sed 's/   Anticodon: /_/' \|
   sed 's/ at /_/' \|
   sed 's/Seq: //g' \| 
   sed -e '/^>/! s/\(.*\)/\U\1/' \| 
   sed '/^>/! s%$%cca%'

Updated (as per Giovanni M Dall'Olio suggestion)

egrep "^Type:|^Seq:" input.txt | sed -e 's/Sup/Pseudo/g' -e '/^Type:/s% (.*%%' -e 's/Type: />/' -e 's/   Anticodon: /_/' -e 's/ at /_/' -e 's/Seq: //g' -e '/^>/! s/\(.*\)/\U\1/' -e '/^>/! s%$%cca%'

Output

>Pseudo_TTA_33-35
AGATCTGTAAGCTCAAAATGGTAGAGCTCTCGTTATGAGAGTTTGGTGGTTCGAATCCACCCAGATCTGcca
>His_GTG_34-36
GTGGCTGTAGTTTAGTGGTGAGAATTCCACGTTGTGGCCGTGGAGACCTGGGCTCGAATCCCAGCAGCCACAcca
>Trp_CCA_33-35
GGATCCGTGGCGCAATGGTAGCGCGTCTGACTCCAGATCAGAAGGTTGCGTGTTCGATTGACGTCGGGTTCAcca
>Val_CAC_35-37
GTCTGGGTGGTGTAGTCGGTTATCACGCTAGTCTCACACACTAGAGGTCCCCGGTTCGAACCCGGGCTCAGACAcca
ADD COMMENT
1
Entering edit mode

You could invoke sed only once, specifying each expression with -e. This should reduce IO usage.

ADD REPLY
0
Entering edit mode

@Giovanni M Dall'Olio: Updated. Thank you!

ADD REPLY
0
Entering edit mode

perl is not a must :-)

I copied your script and saved it in script.sed Running on Ubuntu, executed the following commmand:

sed script.sed codonpos.txt

This is what happend:

sed: -e expression #1, char 10: unterminated `s' command
ADD REPLY
0
Entering edit mode

If you copied and pasted Prakki Rama's code into a file named script.sed, you have to type:

bash script.sed > output.fa

But, you can use Prakki Rama's command directly in the terminal, i.e., cd into the folder with the codonpos.txt and paste the command to your terminal.

ADD REPLY
0
Entering edit mode
8.6 years ago
Anima Mundi ★ 2.9k

An ugly (but working) solution in Python:

def define_field(string, field_number): 
    field = ''
    take = False 
    string = ':' + string 
    string = string.replace('\n', '') 

    for char in string:

        if char == ':':

            if field_number - 1 == 0: 
                take = True
                field_number -= 1
            elif field_number - 1 == -1: 
                break
            else:
                field_number -= 1 
        else:
            if take:
                field += char

    return field


type = ''
anti = ''
at = ''

header = ''
seq = ''
for line in open('codonpos.txt'):
    if 'Type' in line:
        type = define_field(line, 2).replace('Anticodon','').replace('_','')
        anti = define_field(line, 3)[0:4]
        at = define_field(line, 3)[7:14]

        header = '>' + type.replace(' ','') + '_' + anti.replace(' ','') + at
        print header
    elif 'Seq' in line:
        print line.replace('Seq: ','').replace('\n','').upper() + 'cca'
    else:
        pass
ADD COMMENT
0
Entering edit mode
8.6 years ago

Here is a perl script for your purpose:

run it like $ perl test.pl

output file called out.fa will be created in same directory where you save this script.

open($fh, "codonpos.txt") or die "Can't open!";
open($fg, ">out.fa") or die "Can't open!";

@file=<$fh>;

foreach $line(@file){

if($line=~/^Type: (\w+)\s+Anticodon:\s(\w+)\sat\s(\d+\-\d+)/){

    if($1 eq 'Sup'){
    $codon='Psuedo';
    }

    else{
    $codon=$1;
    }

print $fg '>'.$codon.'_'.$2.'_'.$3."\n";
}

if($line=~/Seq: (\w+)/){
print $fg $1.'cca'."\n";
}


}

close $fh;
close $fg;

Please let me know if it works for you. In case of any difficulty, please get back to me.

ADD COMMENT

Login before adding your answer.

Traffic: 2286 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6