Removing (stubborn) new line from Fasta file sequence?
2
1
Entering edit mode
23 months ago
Eliveri ▴ 350

I have a fasta file in this format:

>WP_003850266.1 toxin [Corynebacterium diphtheriae]
MSRKLFASILIGALLGIGAPPSAHAGADDV
EQVGTEEFIKRFGDGASRVVLSLPFAEGS
AVHHNT

Which I want it to appear like

>WP_003850266.1 toxin [Corynebacterium diphtheriae]
MSRKLFASILIGALLGIGAPPSAHAGADDVEQVGTEEFIKRFGDGASRVVLSLPFAEGSAVHHNT

However for the particular fasta file I have, for some reason no matter what I try, the newlines cannot be removed.

I have already tried

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < test.fasta > output.fasta

But the new lines remain ...

fasta • 1.0k views
ADD COMMENT
1
Entering edit mode

Try with bioawk, for example or something similar:

bioawk -cfastx '{print ">"$name"\n"$seq}' test.vcf > out.fasta
ADD REPLY
3
Entering edit mode
23 months ago
seidel 11k

Your file has some lines with carriage returns (\r or ^M), but not all:

tail -2 test.fasta | od -c
0000000    S   T   N   S   R   L   C   A   V   F   V   R   S   G   Q   P
0000020    V   I   G   A   C   T   S   P   Y   D   G   K   Y   W   S   M
0000040    Y   S   R   L   R   K   M   L   Y   L   I   Y   V   A   G   I
0000060    S   V   R   V   H   V   S   K   E   E   Q   Y   Y   D   Y   E
0000100    D   A   T   F   E   T  \r  \n   Y   A   L   T   G   I   S   I
0000120    C   N   P   G   S   S   L   C  \n

One easy solution is to simply preface your command with sed to replace the carriage returns with nothing:

sed -e 's/\r//g' test.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'

The sed part can be read as: substitute/thispattern/forthatpattern/global.

ADD COMMENT
1
Entering edit mode
23 months ago
Carambakaracho ★ 3.3k

I still love to solve these things with Perl oneliners.

perl -nwe 'if(s/^>/\n>/){s/\r?\n$/\n/;}else{s/\r?\n$//};print $_' test.fasta | tail -n +2

Explanation: if you match > at the start, substitute with newline and >: \n> then match optional carriage return \r? and newline \n, replace with \n else match match optional carriage return \r? and newline \n, replace with nothing. Print standard input variable. The tail is required as I didn't include a check for the first line which is an empty line now.

Previously I was convinced Perl regex oneliners are much better than awk as I never cared to learn awk. With more and more time without active Perl development I think I come to acknowledge Perl's picket fencing

ADD COMMENT

Login before adding your answer.

Traffic: 2016 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6