Question

How To Write A Perl Script To Transform A Sequence In A Tabular File Into A Fasta Format

1

Entering edit mode

11.6 years ago

redspider19800915 ▴ 40

I have sequences in tab-delimited format as follows:

GGCGGATGTAGCCACGTGGATC    35    12
AGCTGCTGTAGGGTATGGCGAGCC    1    1
TGGATAATGGACGAGTACCGCCTG    14    5
......

I need a perl script that extracts the 1st column (i.e. sequence) and output as follows in the out file:

>seq_1
GGCGGATGTAGCCACGTGGATC
>seq_2
AGCTGCTGTAGGGTATGGCGAGCC
>seq_3
TGGATAATGGACGAGTACCGCCTG
......

Could anybody help me make that? Thanks a lot!

perl • 6.7k views

ADD COMMENT • link updated 11.6 years ago by Istvan Albert 102k • written 11.6 years ago by redspider19800915 ▴ 40

1

Entering edit mode

What have you tried? Also, try not to create tags simply by splitting sentences into words. "to" is not a useful tag.

ADD REPLY • link 11.6 years ago by Neilfws 49k

0

Entering edit mode

Thanks for your suggestion.

ADD REPLY • link 11.6 years ago by redspider19800915 ▴ 40

Sukhi Singh · Answer 1 · 2013-05-16

3

Entering edit mode

11.6 years ago

Gabriel R. ★ 2.9k

One line in awk :

cat [your file]  | awk 'BEGIN{COUNTER=1}{print ">seq_"COUNTER"\n"$1; COUNTER++;}' > output.fa

ADD COMMENT • link updated 11.6 years ago by Sukhi Singh 11k • written 11.6 years ago by Gabriel R. ★ 2.9k

8

Entering edit mode

Or more simply:

awk '{print ">seq_" ++i "\n" $1}' your_file > output.fa

in Awk, the default value of a variable is 0 (no need to declare it). You just need to pre-increment it to start at 1.

Edit: one can also use the internal variable NR (number of records). It avoids the creation of an ad hoc variable i and should be slightly faster (not tested).

awk '{print ">seq_" NR "\n" $1}' your_file > output.fa

ADD REPLY • link 11.6 years ago by Frédéric Mahé ★ 3.2k

2

Entering edit mode

This wins the code golf...and is more readable compared to the Perl equivalent.

perl -ne 'print ">seq_".++$i."\n".(split)[0]."\n";' your_file > output.fa

ADD REPLY • link updated 11.6 years ago by Sukhi Singh 11k • written 11.6 years ago by Alastair Kerr 5.3k

0

Entering edit mode

Thank you Alastair. Using shell commands, I got that:

cut -f 1 your_file | nl | sed -e 's/^\ */>seq_/' -e 's/\t/\n/' > output.fa

Does anyone know how to do it using only sed?

ADD REPLY • link 11.6 years ago by Frédéric Mahé ★ 3.2k

score 1 · Answer 2 · 2013-05-16

Another way to do this is:

$ perl below-script.pl sequence-input.txt

#!/usr/bin/perl                                                             

open (INPUT, $ARGV[0]) or die $!;                                           
open (OUTPUT, ">Output.fa");                                                

while (<INPUT>){                                                            
    chomp;                                                                  
    ($seq) = split("\t");                                                   
    print ">seq_$.\n$seq\n";                                                
}                                                                           

close (OUTPUT);                                                             
close (INPUT);

score 0 · Answer 3 · 2013-05-16

try this:

#!/usr/bin/perl -w
print"Enter Input File: ";   
$in=<STDIN>;
chomp $in;
open FH, "<$in";
open OUT, ">output_sequence.fasta";
$count=0;
@in=<FH>;  
@line=split/\n/,"@in";   

foreach (@line)    
{    
@word=split('\t',$_);  
$count++;
$word[0]=~s/ +//;              
print OUT ">seq_$count\n$word[0]\n"; 
}
close FH;
close OUT;

score 0 · Answer 4 · 2013-05-16

Here's another option: perl -lane 'print "seq_$.\n$F[0]"' inFile >outFile

Or as a script:

use strict;
use warnings;

    while (<>) {
        print "seq_$.\n" . (/(\S+)/)[0] . "\n";
    }

Usage: perl script.pl inFile >outFile

Output of both on your dataset:

seq_1
GGCGGATGTAGCCACGTGGATC
seq_2
AGCTGCTGTAGGGTATGGCGAGCC
seq_3
TGGATAATGGACGAGTACCGCCTG

Hope this helps!