Question

Make multi-fasta using perl

0

Entering edit mode

9.6 years ago

sun.nation ▴ 140

Header and sequence are coming in loop:

$header = "seq1";

$sequence = "GTGGTCAGAAAAGTGGCCTCAGCCCATTTCCAGCATCCTATGGCTATCTAAGTAGGAAGACGATGAGCTATGGAA";

Sequence is single line without newline.

I want output multi-fasta like this below. Sequence should have newline after each 60 bases.

>seq1
GTGGTCAGAAAAGTGGCCTCAGCCCATTTCCAG
CATCCTATGGCTATCTAAGTAGGAAGACGATGA
GCTATGGAA

>seq2
AAGTGGCCTCAGCCCATTTCCAGCATCCTATGGC
TATCTAAGTAGGAAGACGATGAGCTATGGAA

Thanks

perl multi-fasta fasta • 4.6k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.6 years ago by sun.nation ▴ 140

1

Entering edit mode

This was asked before here and the solution (pre-v5.10) is:

$sequence =~ s/(.{60})/$1\n/gs;

That formats the sequence, then you just need to join the id and sequence with a newline when you print.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by SES 8.6k

0

Entering edit mode

Thanks a lot SES.

It was a fantastic solution with a line code. It worked good when I added this in my script.

ADD REPLY • link 9.6 years ago by sun.nation ▴ 140

0

Entering edit mode

Can you please give me some hint to create loop in perl? I need that at the end of my perl code.

Thanks Ram and Dylan

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by sun.nation ▴ 140

0

Entering edit mode

Look at the Text::Wrap module if you need a perl specific and easy method.

ADD REPLY • link updated 9.6 years ago by Ram 44k • written 9.6 years ago by dylan.storey ▴ 60

0

Entering edit mode

Thanks. I will try that

ADD REPLY • link 9.6 years ago by sun.nation ▴ 140

0

Entering edit mode

I think the following lines could be an alternative solution.

print ">$header\n";
$seq_len = length($sequence);

for ($i = 0; $i < $seq_len; $i + 60) {
    $segment = substr($sequence, $i, 60);
    print "$segment\n";
}

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by biolab ★ 1.4k

0

Entering edit mode

Yes, the loop is a good basic alternative, but the RegEx is better, this being Perl.

ADD REPLY • link 9.6 years ago by Ram 44k

3

Entering edit mode

The problem with the regex is that it is limited to a max sequence length of ~32k, which gives you trouble with contigs, etc. Here is a loop solution that should be faster than regex.

my $single_line_seq = "ATTGC"x1000;
my $line_width = 80;
print $_,"\n" for unpack("(A$line_width)*", $single_line_seq);

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by thackl ★ 3.0k

0

Entering edit mode

9.6 years ago

Ram 44k

Either use a loop and introduce a new line characters after every 60 characters in the sequence, or use Picard's NormalizeFasta if you'd like a tried and tested method.

ADD COMMENT • link 22 months ago by Ram 44k

0

Entering edit mode

You can also try cat *.fasta | fold > all.fasta if you know that the headers are less than 60 bp.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by dylan.storey ▴ 60

0

Entering edit mode

9.6 years ago

Prakki Rama ★ 2.7k

fasta_formatter -i input.fasta -o output.fasta -w 60

fasta_formatter: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fasta_formatter_usage

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Prakki Rama ★ 2.7k

Ram · Accepted Answer · 2015-05-01

3

Entering edit mode

9.6 years ago

SES 8.6k

Here is a comparison of the two methods:

Here are the results:

$ perl biostars140255.pl 2>/dev/null
Benchmark: timing 10 iterations of regex, unpack...
     regex: 17 wallclock secs (15.48 usr +  1.44 sys = 16.92 CPU) @  0.59/s (n=10)
    unpack: 66 wallclock secs (29.63 usr + 34.99 sys = 64.62 CPU) @  0.15/s (n=10)
       s/iter unpack  regex
unpack   6.46     --   -74%
regex    1.69   282%     --

The benchmark shows that regex is about 6X faster than unpack. Also, there is no length limit (the input in the script is a 100 Mb string). Are you referring to a 32-bit system with those limits?

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.6 years ago by SES 8.6k

1

Entering edit mode

How I see SES now:

image: dwight meme

ADD REPLY • link 22 months ago by Ram 44k

0

Entering edit mode

I was looking for a mic drop gif but just gave up myself.

ADD REPLY • link 9.6 years ago by dylan.storey ▴ 60

0

Entering edit mode

For your viewing pleasure: http://giphy.com/search/mic-drop/3

EDIT: Removing GIF so I don't encourage image uploads just for the sake of uploads.

ADD REPLY • link 22 months ago by Ram 44k

0

Entering edit mode

Ha! I might have been a bit terse but I honestly wasn't sure until I did the benchmark so I wanted to share. Also, I use 5.20 in the script because there is an awesome optimization (called "copy on write") that is lightening fast for copying strings. Here is an example script:

my $x = "A" x 1_000_000;
my $y = $x for 1...1_000_000;

That makes a long string and copies it a bunch of times in memory. Super useful for genome analysis. With Perl version 5.10.1 that takes 250.8 seconds on my machine. With Perl 5.20.2, that takes 0.15 seconds!

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by SES 8.6k

0

Entering edit mode

I had more than 200k in a single fasta, regex worked good. Thanks

ADD REPLY • link 9.6 years ago by sun.nation ▴ 140

0

Entering edit mode

Okay, my bad - I shouldn't have mentioned speed in the first place. unpack can be faster, especially if I'm not too lazy to put print outside the loop. I am also sure, one could optimize things further...

'unpack' => sub {
    say STDERR join("\n", unpack("(A60)*", $seq));
}

         Rate  regex unpack
regex  1.74/s     --   -47%
unpack 3.27/s    88%     --

Regarding the limit of 32766, I got confused there. It only applies to a maximum size of the capture group and hasn't any effect here, unless you want your lines to be longer than 32 kbp ;)

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by thackl ★ 3.0k