Question

Why Perl Or Sed Command Not Working

0

Entering edit mode

11.4 years ago

biolab ★ 1.4k

Hi everyone I have a fasta file like below.

>miR156a
GACAGAA
>miR156b
GACAGAA
>miR156c
GACAGAA
............

I need to format it as below.

    miR156a   GACAGAA
    miR156b   GACAGAA
    miR156c   GACAGAA
    ............

Firstly i replace all new line with tab, and then replace > with new line. In the first step, I used the command sed -e 's/\n/\t/g' IN > OUT. It didn't work. I tried an alternative perl command cat IN | perl -ne 's/\n/\t/' > OUT. This time OUT file contains nothing. What's my problem? Thank you very much for your answers!

perl • 5.8k views

ADD COMMENT • link updated 11.4 years ago by Vivek Krishnakumar ▴ 400 • written 11.4 years ago by biolab ★ 1.4k

0

Entering edit mode

Following my question, i tried new perl command cat IN | perl -ne 'while (<>) {chomp; print "$_\t"}' > OUT and get the following output.

GACAGAA >miR156b^M  GACAGAA >miR156c^M  GACAGAA  ......

Probably mixed use of WINDOWS and LINUX. Could anyone give me some suggestions and comments? Thanks a lot!

ADD REPLY • link 11.4 years ago by biolab ★ 1.4k

0

Entering edit mode

looks like your input file comes from windows and you are on *NIX machine. try running it through dos2unix first e.g. cat IN | dos2unix | perl ...

ADD REPLY • link 11.4 years ago by aheinzel ▴ 130

score 8 · Answer 1 · 2014-01-02

8

Entering edit mode

11.4 years ago

Pavel Senin ★ 1.9k

cat test.fa | sed -n '/>/ {h; N; s/>//; s/[\r\n]/\t/; p}'

miR156a    GACAGAA
miR156b    GACAGAA
miR156c    GACAGAA

how it works:

sed -n '          # turn off default printing
 />/{             # if the pattern matches a sequence header
 h;               # put it in the hold space
 N;               # fetch the next line
 s/>//;           # remove a '>' symbol
 s/[\r\n]/\t/g;   # 'g' - replace all new line with tab
 p }              # print it
 '

ADD COMMENT • link 11.4 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Nice, that's rather more concise than my awk solution!

ADD REPLY • link 11.4 years ago by Devon Ryan 105k

0

Entering edit mode

thanks! i hope it'll work for OP.

ADD REPLY • link 11.4 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

And you could:

cat test.fa | sed 'h; N; s/>\(.*\)[\r\n]/\1\t/'

ADD REPLY • link 11.4 years ago by Kenosis ★ 1.3k

score 4 · Answer 2 · 2014-01-02

4

Entering edit mode

11.4 years ago

Devon Ryan 105k

You're creating an extremely long line, at least if your input file is largish. That's likely screwing things up. Why not just do things in one step:

awk 'BEGIN{ORS="";OFS="";}{gsub(">","",$1); if(NR%2==0) {print "\t",$1,"\n"} else {print "\t",$1}}' foo.fa

ADD COMMENT • link 11.4 years ago by Devon Ryan 105k

5

Entering edit mode

awk '{x=substr($0,2);getline;print x"\t"$0;}' foo.fa

ADD REPLY • link 11.4 years ago by lh3 33k

0

Entering edit mode

Nice, I guess i have a penchant for verbosity :P

ADD REPLY • link 11.4 years ago by Devon Ryan 105k

0

Entering edit mode

this one is cool!

ADD REPLY • link 11.4 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Thank you both! The commands work well!

ADD REPLY • link 11.4 years ago by biolab ★ 1.4k

score 4 · Answer 3 · 2014-01-03

4

Entering edit mode

11.4 years ago

Kenosis ★ 1.3k

Here's another option:

perl -pne 's/>(.+)[\r\n]/$1\t/' foo.fa

Output on your dataset:

miR156a    GACAGAA
miR156b    GACAGAA
miR156c    GACAGAA

ADD COMMENT • link 11.4 years ago by Kenosis ★ 1.3k

score 3 · Answer 4 · 2014-01-05

Since TMTOWTDI ;), here is another Perl-based method, which does not assume that the FASTA sequence is located in one single line following the header:

perl -076 -l12 -ne 'next unless /\w/; chomp; @b = split /\n/; $h = shift @b; $s = join "", @b; print "$h\t$s";' IN > OUT

Here is how it works:

-0 76  : Sets the IFS as ">" (which is `76` in octal format) so that you can iterate through chunks of FASTA sequences
-l 12  : Sets the OFS as "\n" (which is `12` in octal format) and performs automatic line ending processing
-n     : Specifies that the script should automatically loop through every available chunk, separated by IFS. 
-e     : Tells the perl interpreter that the following text is a line of perl code

next unless /\w/; -> Skips any chunk that does not contain data (which is essentially the first chunk, preceding the first occurrence of the ">" symbol)
chomp;            -> Removes any traces of the IFS from the chunk being processed
@b = split /\n/;  -> Splits the chunk into an array, at every newline char
$h = shift @b;    -> Extracts first element of array which is the FASTA header
$s = join "", @b; -> Joins the rest of the array elements into a string, which corresponds to the sequence
print "$h\t$s";   -> Prints out the header and the sequence delimited by a tab

score 2 · Answer 5 · 2014-01-03

2

Entering edit mode

11.4 years ago

Vivek ★ 2.7k

awk '{if(NR % 2 == 1) printf substr($0,2)"\t"; else print $0}' file.fa

Another variation with awk

ADD COMMENT • link 11.4 years ago by Vivek ★ 2.7k

2

Entering edit mode

And with just a few minor changes (but none to your logic):

awk '{printf(NR%2)?substr($0,2)"\t":$0"\n"}' foo.fa

ADD REPLY • link 11.4 years ago by Kenosis ★ 1.3k