From time to time it is necessary to linearize a multiline fasta file.
>in1
AAAA
AAAA
AA
>in2
BBBB
BBBB
B
becomes
>in1
AAAAAAAAAA
>in2
BBBBBBBBB
The fastes way I came up is to use seqkit
for it:
$ seqkit seq -w 0 input.fa
What is the fastest scripting method you are aware? This is what I've found:
$ LC_ALL=C awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}' input.fa
Essentially I'm using awk
to change the field and record separators. It's fast but still slower as seqkit for large files. Interestingly changing the separators is something where mawk
seems to be very, very slow :(
fin swimmer
EDIT:
For those who like to compare their solution. I've used Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz from ensembl as the input.
Here are some benchmarking results including seqtk as suggested by shenwei356 :
$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( seqkit seq -w 0 > /dev/null; ) 1,57s user 0,82s system 109% cpu 2,175 total
$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( seqtk seq > /dev/null; ) 2,22s user 0,59s system 99% cpu 2,817 total
$ time (LC_ALL=C awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}' Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( LC_ALL=C awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" > /dev/null; ) 5,58s user 1,58s system 99% cpu 7,180 total
$ time (awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}' Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}' ) 47,43s user 1,62s system 99% cpu 49,099 total
The speed of that awk command is very impressive. It beats bioawk (
bioawk -c fastx '{print ">"$name; print $seq}'
) with a clear margin. I very much doubt that a faster "scripting" solution exist. Maybe some perl one-liner..Edit. comparison of vanilla seqtk and seqtk made with
-O3
("optimized") instead of-O2
. "Optimized" is faster every time..