Question

Multiline Fasta To Single Line Fasta

23

Entering edit mode

13.5 years ago

Palu ▴ 250

I have a fasta file with following format

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFY
RTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNE
ECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHL
DVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAAD
EEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAK
QLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKE
PAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYR
TIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDE
CKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLD
KILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEAR
RLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSS
KPSTPSTPASKRKVGCIIYLFLYF

but I wanna to look sequence in a single line, not in many line as they are. Any quick method?

fasta • 103k views

ADD COMMENT • link updated 15 months ago by Ram 44k • written 13.5 years ago by Palu ▴ 250

Ram · Answer 1 · 2011-06-16

75

Entering edit mode

13.5 years ago

Pierre Lindenbaum 164k

Using awk:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF

Edit: for Window$.

Download ubuntu http://www.ubuntu.com/download
burn a cd with ubuntu
reboot your computer with this CD
install ubuntu

:-)

ADD COMMENT • link updated 15 months ago by Ram 44k • written 13.5 years ago by Pierre Lindenbaum 164k

16

Entering edit mode

Suggestion for Windows --> Switch to Linux :p

ADD REPLY • link 13.5 years ago by Eric Normandeau 11k

11

Entering edit mode

Good solution but be careful. If you redirect the result to a file,

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa > out.fa

the first line is left empty.

ADD REPLY • link 8.3 years ago by joreamayarom ▴ 140

4

Entering edit mode

+1 for the windows fix :-)

ADD REPLY • link 13.5 years ago by Michael Schubert ★ 7.1k

4

Entering edit mode

There will be an empty line at the beginning it should be removed like: tail -n +2 filein.fa > fileout.fa

ADD REPLY • link 5.1 years ago by Medhat 9.8k

0

Entering edit mode

yeah.. that was ~6 years ago. Now: http://stackoverflow.com/documentation/bioinformatics/4194

ADD REPLY • link 7.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

my bad I just saw it :)

ADD REPLY • link 7.7 years ago by Medhat 9.8k

0

Entering edit mode

...the link is dead and I still get the first line empty after redirecting the output into a file - could you update the answer if there is a more elegant way (not piping through tail) to avoid this?

ADD REPLY • link 5.8 years ago by al-ash ▴ 210

4

Entering edit mode

ADD REPLY • link 5.8 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Modified the original awk code from @Pierre Lindenbaum to below

awk '/^>/ { if(NR>1) print "";  printf("%s\n",$0); next; } { printf("%s",$0);}  END {printf("\n");}'

Uses NR (numbered row) to print a newline only for non-first fasta record. Not the most elegant but I hope this will help someone out there.

ADD REPLY • link 5.6 years ago by ljq ▴ 30

0

Entering edit mode

you can use the answer provided by Jorge Amigo below. (it's not using tail)

ADD REPLY • link 5.8 years ago by lieven.sterck 15k

0

Entering edit mode

Unfortunately i am a window lover. plz suggests something for window..plzzzz

ADD REPLY • link 13.5 years ago by Palu ▴ 250

score 10 · Answer 2 · 2014-11-05

10

Entering edit mode

10.1 years ago

Jorge Amigo 14k

here is a quick and simple perl one-liner:

perl -pe '/^>/ ? print "\n" : chomp' in.fasta > out.fasta

which will output an empty header line. you could use tail (which is faster than sed) to remove it:

perl -pe '/^>/' ? print "\n" : chomp' in.fasta | tail -n +2 > out.fasta

EDIT: even easier (do not use!):

perl -pe 'chomp unless /^>/' in.fasta > out.fasta

EDIT2: this last one liner does not work as expected. use this one instead, which performs inside a single perl call all the logic needed in all lines but the first one by using perl's $. internal line counter variable:

perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' in.fasta > out.fasta

ADD COMMENT • link 5.8 years ago by Jorge Amigo 14k

0

Entering edit mode

Thank you!! This worked perfectly!

ADD REPLY • link 6.4 years ago by rllombardi • 0

0

Entering edit mode

Is there a simple way to embed this one liner into a script that just takes a fasta file as input? Tried doing this myself but I am have no idea how to actually write perl scripts. Normally I would just embed this into system call via an R script, but all the quotes are throwing me off.

ADD REPLY • link 6.4 years ago by caverill ▴ 40

0

Entering edit mode

just create a simple script.pl file containing this (do not use!)

while (<>) { chomp unless /^>/; print }

EDIT: the previous code is wrong. use this one into script.pl instead:

while (<>) { $. > 1 and /^>/ ? print "\n" : chomp; print }

and run it

perl script.pl <in.fasta >out.fasta

ADD REPLY • link 5.8 years ago by Jorge Amigo 14k

0

Entering edit mode

(for future reference:) Sorry to tell but the 'EDIT' version does not work as expected.

Problem is that it will have chomped the line previous to /^>/ and as such will add the header line to the previous sequence line. the other version works perfectly though.

ADD REPLY • link 5.8 years ago by lieven.sterck 15k

1

Entering edit mode

thanks for pointing it out. I've corrected my previous answer and tested thoroughly the new one.

ADD REPLY • link 5.8 years ago by Jorge Amigo 14k

Ram · Answer 3 · 2011-06-16

Also a quick & dirty solution with Perl... (fa2oneline.pl)

#!/usr/bin/perl -w
use strict;

my $input_fasta=$ARGV[0];
open(IN,"<$input_fasta") || die ("Error opening $input_fasta $!");

my $line = <IN>; 
print $line;

while ($line = <IN>)
{
chomp $line;
if ($line=~m/^>gi/) { print "\n",$line,"\n"; }
else { print $line; }
}

print "\n";

then run:

perl fa2oneline.pl sample.fa > out.fa

Result :

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF

Ram · Answer 4 · 2011-06-16

5

Entering edit mode

13.5 years ago

Martin A Hansen 3.0k

Biopieces (www.biopieces.org) is another way:

read_fasta -i file.fna | write_fasta -x

Cheers,
Martin

ADD COMMENT • link updated 15 months ago by Ram 44k • written 13.5 years ago by Martin A Hansen 3.0k

Ram · Answer 5 · 2016-11-04

Python version to convert multi-line to two-line fasta format. It also converts multiple files. Output directory will have all the files with _twoline.fasta as suffix.

from Bio import SeqIO
import os
import re
import argparse

def multi2linefasta(indir,outdir,filelist):
    for items in filelist:
        mfasta = outdir +"/"+re.sub('\..*','',items)+'_twoline.fasta'
        ifile = open(indir+'/'+items,'rU')
        with open(mfasta, 'w') as ofile:
            for record in SeqIO.parse(ifile, "fasta"):
                sequence = str(record.seq)
                ofile.write('>'+record.id+'\n'+sequence+'\n')


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert multiple files of multi-line fasta into two-line fasta format')
    parser.add_argument('-i',type=str,dest='ind',required=True,help="Input directory where all the fasta files are present")
    parser.add_argument('-o',type=str,dest='outd',required=True,help="Ouput directory")
    parser.add_argument('-f',type=str,dest='ffiles',required=True,help="Comma seperated fasta file names without spaces")
    args = parser.parse_args()
    print (args)
    filelist = args.ffiles.split(',')
    if not os.path.exists(args.outd):
        os.makedirs(args.outd)
    multi2linefasta(args.ind,args.outd,filelist)

To run, save the above code to convertfasta.py file

python convertfasta.py -i "/path/to/inputfolder/" -o "/path/to/outputfolder/" -f "file1.fasta,file2.fasta,file3.fasta"

score 3 · Answer 6 · 2018-01-22

It seems that seqtk (https://github.com/lh3/seqtk) can be used for this task, although the webpage only mentions multiline fastq. The command to give is simply (I am using Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz, but one can use any fasta or gzipped fasta):

seqtk seq -l0 Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz | gzip > Homo_sapiens.GRCh37.dna.primary_assembly.singleLines.fa.gz

For a quick proof that this works, try:

zcat Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz | head -n10 | seqtk seq -l0 | cat -A

Ram · Answer 7 · 2011-06-16

1

Entering edit mode

13.5 years ago

Aaronquinlan 12k

Kent source's faToTab. I'm sure EMBOSS has something for this as well. If you are on a Windows machine, I'd just use Galaxy's fasta2tab.

ADD COMMENT • link updated 15 months ago by Ram 44k • written 13.5 years ago by Aaronquinlan 12k

Ram · Answer 8 · 2011-06-16

use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new(-file => "myseq.fasta" , '-format' => 'Fasta');

while ( my $seq = $in->next_seq ) {
    print ">",$seq->id()," ",$seq->desc(),"\n",$seq->seq(),"\n";
}

Output:

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK

>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF

Ram · Answer 9 · 2014-11-06

You can open the merged sequences in the UGENE Sequence View. To do it:

Select "File>Open as" in the main UGENE menu.
Select the "FASTA" format.
Select "Merge sequences into a single sequence to show in the sequence viewer".

By default, UGENE will show the sequence itself, the complementary sequence and translations. It looks like here.

If required, you can export the merged sequence into a new file.

Ram · Answer 10 · 2014-11-06

0

Entering edit mode

10.1 years ago

sayuj.koyyappurath • 0

Hi

Hope this works

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file

ADD COMMENT • link updated 15 months ago by Ram 44k • written 10.1 years ago by sayuj.koyyappurath • 0

score 0 · Answer 11 · 2022-02-26

Hi

Short answer:

Use seqtk as follows:

$ seqtk seq multi-line.fasta > single-line.fasta

Explained:

A toy example in multi-line format:

$ head -n5 celegans_chr1.fa
>gi|449020133|emb|BX284601.5| Caenorhabditis elegans Bristol N2 genomic chromosome, I
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT
AAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGC
CTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT

How to easily convert to single-line fasta (I only print head -c 170 as it would print the whole chromosome otherwise):

$ seqtk seq celegans_chr1.fa | head -c 170
>gi|449020133|emb|BX284601.5| Caenorhabditis elegans Bristol N2 genomic chromosome, I
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA

So you can simply redirect the stdout to a file as follows:

$ seqtk seq celegans_chr1.fa > celegans_chr1_single-line.fa

You may also use a file compressor like gzip to compress the file.