Question

Command To Change Fasta File Format

0

Entering edit mode

11.4 years ago

biolab ★ 1.4k

Hi, everyone,

I want to change format 1 into format 2, as shown below. The command I used is :

cut -f 1 inputfile | sed -e 's/\n/\t/g' -e 's/>/\n>' > outputfil.

However, it doesn't work, what's wrong with this command?

FORMAT 1

>gene1 ATPaseII
CTGATGCA
>gene2 Actin1
CGGGCGGTA
>gene3 pesudogene
ATGACTGACTG

FORMAT 2

>gene1  CTGATGCA
>gene2  CGGGCGGTA
>gene3  ATGACTGACTG

Thank you very much.

linux command-line • 3.7k views

ADD COMMENT • link updated 9.8 years ago by Biostar 20 • written 11.4 years ago by biolab ★ 1.4k

Eric Normandeau · Answer 1 · 2013-11-28

3

Entering edit mode

11.4 years ago

Eric Normandeau 11k

Your attempt is kind of close. You may want to try the following command. The parts of the pipeline are separated on different lines to make the whole easier to read. The \characters inform the terminal that the command continues on the next line.

cut -d " " -f 1 f1 | \
    perl -pe 's/>/_newline_>/; s/\n/\t/' | \
    perl -pe 's/_newline_//' | \
    perl -pe 's/_newline_/\n/g' | \
    perl -pe 's/\t$//' > f2

Here are some details about the steps:

cut -d " " -f 1 where -d " "specifies that the delimiter is the space. This removes anything after the first space of the line.
perl -pe is used mostly like sed -e, but sometimes I find it better to use perl, so rather than learning both sed and perl, I suggest learning only perl.
's/>/_newline_>/ adds a unique string to recreate the lines later
's/\n/\t/'replaces the newlines by tabs. At this point, the whole file is only one line.
perl -pe 's/_newline_//' removes the first occurence of _newline_ in the file to avoid starting the file with an empty line later.
perl -pe 's/_newline_/\n/g' changes the _newline_ string with a new line.
perl -pe 's/\t$//' removes tabulations at the end of the lines.

In this example, I use pipes (|) a few times at places that may not be evident. Perl treats the file, or the input it gets through a pipe, one line at a time, as delimited by a new line character(\n or some such). Thus, for example, when I remove all the new line characters at step 4, I create one long line and must use a pipe so that the next transformation can be applied to the whole file, not only the line that is currently being treated. This permits the trick in item 5 where I only remove the first occurrence or _newline_ in the whole file, which is now on one line.

ADD COMMENT • link 11.4 years ago by Eric Normandeau 11k

0

Entering edit mode

Hi Eric, Could you please briefly tell me what's the difference between _newline_ and \n? Thanks a lot!

ADD REPLY • link updated 11.4 years ago by Eric Normandeau 11k • written 11.4 years ago by biolab ★ 1.4k

1

Entering edit mode

As a side note, I edited your comment to use full English. Could u pls is just as easy to write as Could you please. The latter is more polite and is also more pleasant to read for a person who spent a few minutes to help you and future users ;)

ADD REPLY • link 11.4 years ago by Eric Normandeau 11k

0

Entering edit mode

Yes, you are right. As a perl beginer, I think I really learned something about this language, especially the three s/>/_newline_ / commands, which look alike but differs, in your code. THANK YOU.

ADD REPLY • link 11.4 years ago by biolab ★ 1.4k

0

Entering edit mode

_newline_ is just a string I decided to use to mark the positions where I will later put a \n back. I could just as well have used any string, like INSERT_NEWLINE_HERE :)

ADD REPLY • link 11.4 years ago by Eric Normandeau 11k

0

Entering edit mode

Really helpful and informative, THANKS!!

ADD REPLY • link 11.4 years ago by biolab ★ 1.4k

score 3 · Answer 2 · 2013-11-29

3

Entering edit mode

11.4 years ago

Frédéric Mahé ★ 3.2k

Here is a solution with Awk:

awk 'BEGIN {RS = ">"} NR > 1 {print ">"$1"\t"$NF}' inputfile

It uses > to separate records. Skip the first empty record with NR > 1 and print the first $1 and the last part $NF of each record.

ADD COMMENT • link 11.4 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Will this work if the sequences are long and span multiple lines?

ADD REPLY • link 11.4 years ago by Eric Normandeau 11k

1

Entering edit mode

No, it will not. If the sequences span multiple lines, one may first linearize the fasta file (each sequence is written on one line). This can be done with Awk too: awk 'NR==1 {print ; next} {printf (/^>/) ? "\n"$0"\n" : $1}' file.fas

ADD REPLY • link 11.4 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

It's great to know these awk commands, and I was unexpected to learn so many commands when I originally posted this question. THANKS.

ADD REPLY • link 11.4 years ago by biolab ★ 1.4k