Question

How To Transform An Interleaved Fasta File To A Sequential Fasta File

1

Entering edit mode

11.4 years ago

samlambrechts299 ▴ 170

Hi everybody,

I am trying to convert the sequences in a fasta file from the interleaved format to a sequential format

My input:

>gi|161085638|dbj|AB305033.1| 
        ATATGCCTGAAAGTGGCGGACGGGTGAGTAACACGTGGGTGACCTGCCTCGGAGTGGGGGATAACCATGG
        GAAACTGTGGCTAATACCGCATGGGCTTGTTGGCTTTGGCGGCCAACGAGTAAAGCTTTAGTGCTTCGAG
        AGGGGCCTGCGTCCGATTAGGTAGTTGGTGAGGTAATGGCTCACCAAGCCGATGATCGGTAGCTGGTCTG
>gi|161085638|dbj|AB305644.1| 
        ATATGCCTGAAAGTGGCGGACGGGTGAGTAACACGTGGGTGACCTGCCTCGGAGTGGGGGATAACCATGG
        GAAACTGTGGCTAATACCGCATGGGCTTGTTGGCTTTGGCGGCCAACGAGTAAAGCTTTAGTGCTTCGAG
        AGGGGCCTGCGTCCGATTAGGTAGTTGGTGAGGTAATGGCTCACCAAGCCGATGATCGGTAGCTGGTCTG

Desired output:

 >gi|161085638|dbj|AB305033.1| 
    ATATGCCTGAAAGTGGCGGACGGGTGAGTAACACGTGGGTGACCTGCCTCGGAGTGGGGGATAACCATGGGAAACTGCGGCCAACGAGTAAAGCTTTAGTGCTTC...
 >gi|161085638|dbj|AB305644.1| 
    ATATGCCTGAAAGTGGCGGACGGGTGAGTAACACGTGGGTGACCTGCCTCGGAGTGGGGGATAACCATGGGGCTAATACCGCATGGGCTTGTTGGCTTTGGCGGC...

After unsuccessfully trying to compile a script myself, I have found the following on http://phototrophic.net/node/37:

#!/usr/bin/perl                                                                                                                                                                     
$in = open(IN,"<file.fasta");while ($in=<IN>){chomp $in;if ($in=~m/^>/) { print "\n",$in,"\n";}else{print $in;}}

But when I try to use this I get bashed:

bash: syntax error near unexpected token `in'

If anyone can provide an answer why I get this syntax error, or can help me with a script to convert interleaved files to sequential files, that would be greatly appreciated

Best,

Sam

fasta perl • 8.3k views

ADD COMMENT • link updated 4.6 years ago by Biostar 20 • written 11.4 years ago by samlambrechts299 ▴ 170

score 5 · Answer 1 · 2013-07-10

5

Entering edit mode

11.4 years ago

SES 8.6k

There are a number of things wrong with that Perl code. You have to be careful taking code samples from the internet because sometimes you find really useful information on blogs, and other times you find code like this. I don't know why something is trying to interpret it as a Bash script but it could be multiple things. For just removing the line-wrapping you can use a BioPerl one-liner:

perl -MBio::SeqIO -e 'my $seqin = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fasta'); while (my $seq = $seqin->next_seq) { print ">",$seq->id,"\n",$seq->seq,"\n"; }' < seqs.fasta > seqs_nowrap.fasta

I'll bet you can also do this with seqret or some other program from EMBOSS.

ADD COMMENT • link 11.4 years ago by SES 8.6k

0

Entering edit mode

Thanks, will try this

ADD REPLY • link 11.4 years ago by samlambrechts299 ▴ 170

0

Entering edit mode

Excellent use of Bio::SeqIO.

ADD REPLY • link 11.4 years ago by Kenosis ★ 1.3k

Eric Normandeau · Answer 2 · 2013-07-10

3

Entering edit mode

11.4 years ago

Ashutosh Pandey 12k

See this other question on Biostar:

multiline fasta to single line fasta

ADD COMMENT • link updated 11.4 years ago by Eric Normandeau 11k • written 11.4 years ago by Ashutosh Pandey 12k

1

Entering edit mode

aha ok, that explains why I didn't find an existing question/topic covering this problem, a matter of vocabulary. Thank you for posting this!

ADD REPLY • link 11.4 years ago by samlambrechts299 ▴ 170

score 0 · Answer 3 · 2013-07-11

Here's another option:

use strict;
use warnings;

$/ = '>';
while (<>) {
    chomp;
    s/(.+?\n)(.+)/my $x = $2; $x =~ s|\s+||g; $_ = $x/se or next;
    print ">$1   $_\n";
}

Usage: perl script.pl inFile.fasta [>outFile.fasta]

The second, optional parameter will direct the output to a file.

This uses ">" as the fasta record seperator. A regex captures the id and seq, and removes whitespaces within the seq. All is then finally printed.

Hope this helps!