How To Transform An Interleaved Fasta File To A Sequential Fasta File
3
1
Entering edit mode
11.4 years ago

Hi everybody,

I am trying to convert the sequences in a fasta file from the interleaved format to a sequential format

My input:

>gi|161085638|dbj|AB305033.1| 
        ATATGCCTGAAAGTGGCGGACGGGTGAGTAACACGTGGGTGACCTGCCTCGGAGTGGGGGATAACCATGG
        GAAACTGTGGCTAATACCGCATGGGCTTGTTGGCTTTGGCGGCCAACGAGTAAAGCTTTAGTGCTTCGAG
        AGGGGCCTGCGTCCGATTAGGTAGTTGGTGAGGTAATGGCTCACCAAGCCGATGATCGGTAGCTGGTCTG
>gi|161085638|dbj|AB305644.1| 
        ATATGCCTGAAAGTGGCGGACGGGTGAGTAACACGTGGGTGACCTGCCTCGGAGTGGGGGATAACCATGG
        GAAACTGTGGCTAATACCGCATGGGCTTGTTGGCTTTGGCGGCCAACGAGTAAAGCTTTAGTGCTTCGAG
        AGGGGCCTGCGTCCGATTAGGTAGTTGGTGAGGTAATGGCTCACCAAGCCGATGATCGGTAGCTGGTCTG

Desired output:

 >gi|161085638|dbj|AB305033.1| 
    ATATGCCTGAAAGTGGCGGACGGGTGAGTAACACGTGGGTGACCTGCCTCGGAGTGGGGGATAACCATGGGAAACTGCGGCCAACGAGTAAAGCTTTAGTGCTTC...
 >gi|161085638|dbj|AB305644.1| 
    ATATGCCTGAAAGTGGCGGACGGGTGAGTAACACGTGGGTGACCTGCCTCGGAGTGGGGGATAACCATGGGGCTAATACCGCATGGGCTTGTTGGCTTTGGCGGC...

After unsuccessfully trying to compile a script myself, I have found the following on http://phototrophic.net/node/37:

#!/usr/bin/perl                                                                                                                                                                     
$in = open(IN,"<file.fasta");while ($in=<IN>){chomp $in;if ($in=~m/^>/) { print "\n",$in,"\n";}else{print $in;}}

But when I try to use this I get bashed:

bash: syntax error near unexpected token `in'

If anyone can provide an answer why I get this syntax error, or can help me with a script to convert interleaved files to sequential files, that would be greatly appreciated

Best,

Sam

fasta perl • 8.3k views
ADD COMMENT
5
Entering edit mode
11.4 years ago
SES 8.6k

There are a number of things wrong with that Perl code. You have to be careful taking code samples from the internet because sometimes you find really useful information on blogs, and other times you find code like this. I don't know why something is trying to interpret it as a Bash script but it could be multiple things. For just removing the line-wrapping you can use a BioPerl one-liner:

perl -MBio::SeqIO -e 'my $seqin = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fasta'); while (my $seq = $seqin->next_seq) { print ">",$seq->id,"\n",$seq->seq,"\n"; }' < seqs.fasta > seqs_nowrap.fasta

I'll bet you can also do this with seqret or some other program from EMBOSS.

ADD COMMENT
0
Entering edit mode

Thanks, will try this

ADD REPLY
0
Entering edit mode

Excellent use of Bio::SeqIO.

ADD REPLY
3
Entering edit mode
11.4 years ago

See this other question on Biostar:

multiline fasta to single line fasta

ADD COMMENT
1
Entering edit mode

aha ok, that explains why I didn't find an existing question/topic covering this problem, a matter of vocabulary. Thank you for posting this!

ADD REPLY
0
Entering edit mode
11.4 years ago
Kenosis ★ 1.3k

Here's another option:

use strict;
use warnings;

$/ = '>';
while (<>) {
    chomp;
    s/(.+?\n)(.+)/my $x = $2; $x =~ s|\s+||g; $_ = $x/se or next;
    print ">$1   $_\n";
}

Usage: perl script.pl inFile.fasta [>outFile.fasta]

The second, optional parameter will direct the output to a file.

This uses ">" as the fasta record seperator. A regex captures the id and seq, and removes whitespaces within the seq. All is then finally printed.

Hope this helps!

ADD COMMENT

Login before adding your answer.

Traffic: 2080 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6