How To Write A Perl Script To Transform A Sequence In A Tabular File Into A Fasta Format
4
1
Entering edit mode
11.6 years ago

I have sequences in tab-delimited format as follows:

GGCGGATGTAGCCACGTGGATC    35    12
AGCTGCTGTAGGGTATGGCGAGCC    1    1
TGGATAATGGACGAGTACCGCCTG    14    5
......

I need a perl script that extracts the 1st column (i.e. sequence) and output as follows in the out file:

>seq_1
GGCGGATGTAGCCACGTGGATC
>seq_2
AGCTGCTGTAGGGTATGGCGAGCC
>seq_3
TGGATAATGGACGAGTACCGCCTG
......

Could anybody help me make that? Thanks a lot!

perl • 6.7k views
ADD COMMENT
1
Entering edit mode

What have you tried? Also, try not to create tags simply by splitting sentences into words. "to" is not a useful tag.

ADD REPLY
0
Entering edit mode

Thanks for your suggestion.

ADD REPLY
3
Entering edit mode
11.6 years ago
Gabriel R. ★ 2.9k

One line in awk :

cat [your file]  | awk 'BEGIN{COUNTER=1}{print ">seq_"COUNTER"\n"$1; COUNTER++;}' > output.fa
ADD COMMENT
8
Entering edit mode

Or more simply:

awk '{print ">seq_" ++i "\n" $1}' your_file > output.fa

in Awk, the default value of a variable is 0 (no need to declare it). You just need to pre-increment it to start at 1.

Edit: one can also use the internal variable NR (number of records). It avoids the creation of an ad hoc variable i and should be slightly faster (not tested).

awk '{print ">seq_" NR "\n" $1}' your_file > output.fa
ADD REPLY
2
Entering edit mode

This wins the code golf...and is more readable compared to the Perl equivalent.

perl -ne 'print ">seq_".++$i."\n".(split)[0]."\n";' your_file > output.fa
ADD REPLY
0
Entering edit mode

Thank you Alastair. Using shell commands, I got that:

cut -f 1 your_file | nl | sed -e 's/^\ */>seq_/' -e 's/\t/\n/' > output.fa

Does anyone know how to do it using only sed?

ADD REPLY
1
Entering edit mode
11.6 years ago
csiu ▴ 60

Another way to do this is:

$ perl below-script.pl sequence-input.txt

#!/usr/bin/perl                                                             

open (INPUT, $ARGV[0]) or die $!;                                           
open (OUTPUT, ">Output.fa");                                                

while (<INPUT>){                                                            
    chomp;                                                                  
    ($seq) = split("\t");                                                   
    print ">seq_$.\n$seq\n";                                                
}                                                                           

close (OUTPUT);                                                             
close (INPUT);
ADD COMMENT
0
Entering edit mode
11.6 years ago
Naren ▴ 1000

try this:

#!/usr/bin/perl -w
print"Enter Input File: ";   
$in=<STDIN>;
chomp $in;
open FH, "<$in";
open OUT, ">output_sequence.fasta";
$count=0;
@in=<FH>;  
@line=split/\n/,"@in";   

foreach (@line)    
{    
@word=split('\t',$_);  
$count++;
$word[0]=~s/ +//;              
print OUT ">seq_$count\n$word[0]\n"; 
}
close FH;
close OUT;
ADD COMMENT
0
Entering edit mode
11.6 years ago
Kenosis ★ 1.3k

Here's another option: perl -lane 'print "seq_$.\n$F[0]"' inFile >outFile

Or as a script:

use strict;
use warnings;

    while (<>) {
        print "seq_$.\n" . (/(\S+)/)[0] . "\n";
    }

Usage: perl script.pl inFile >outFile

Output of both on your dataset:

seq_1
GGCGGATGTAGCCACGTGGATC
seq_2
AGCTGCTGTAGGGTATGGCGAGCC
seq_3
TGGATAATGGACGAGTACCGCCTG

Hope this helps!

ADD COMMENT

Login before adding your answer.

Traffic: 1816 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6