Multiline Fasta To Single Line Fasta

Entering edit mode

13.8 years ago

Palu ▴ 250

I have a fasta file with following format

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFY
RTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNE
ECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHL
DVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAAD
EEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAK
QLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKE
PAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYR
TIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDE
CKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLD
KILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEAR
RLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSS
KPSTPSTPASKRKVGCIIYLFLYF

but I wanna to look sequence in a single line, not in many line as they are. Any quick method?

fasta • 106k views

ADD COMMENT • link updated 18 months ago by Ram 45k • written 13.8 years ago by Palu ▴ 250

Entering edit mode

13.8 years ago

Pierre Lindenbaum 165k

Using awk:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF

Edit: for Window$.

Download ubuntu http://www.ubuntu.com/download
burn a cd with ubuntu
reboot your computer with this CD
install ubuntu

:-)

ADD COMMENT • link updated 18 months ago by Ram 45k • written 13.8 years ago by Pierre Lindenbaum 165k

Entering edit mode

Suggestion for Windows --> Switch to Linux :p

ADD REPLY • link 13.8 years ago by Eric Normandeau 11k

Entering edit mode

Good solution but be careful. If you redirect the result to a file,

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa > out.fa

the first line is left empty.

ADD REPLY • link 8.5 years ago by joreamayarom ▴ 140

Entering edit mode

+1 for the windows fix :-)

ADD REPLY • link 13.8 years ago by Michael Schubert ★ 7.1k

Entering edit mode

There will be an empty line at the beginning it should be removed like: tail -n +2 filein.fa > fileout.fa

ADD REPLY • link 5.4 years ago by Medhat 9.8k

Entering edit mode

yeah.. that was ~6 years ago. Now: http://stackoverflow.com/documentation/bioinformatics/4194

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 165k

Entering edit mode

my bad I just saw it :)

ADD REPLY • link 8.0 years ago by Medhat 9.8k

Entering edit mode

...the link is dead and I still get the first line empty after redirecting the output into a file - could you update the answer if there is a more elegant way (not piping through tail) to avoid this?

ADD REPLY • link 6.0 years ago by al-ash ▴ 210

Entering edit mode

Linearize a fasta sequence

awk -f linearizefasta.awk < input.fa

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

Format back to fasta

tr "\t" "\n" < linearized.tsv

if you know your fasta header have a length < 60

tr "\t" "\n" < linearized.tsv | fold -w 60

view raw README.md hosted with ❤ by GitHub

	/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;}
	{printf("%s",$0);}
	END {printf("\n");}

view raw linearizefasta.awk hosted with ❤ by GitHub

ADD REPLY • link 6.0 years ago by Pierre Lindenbaum 165k

Entering edit mode

Modified the original awk code from @Pierre Lindenbaum to below

awk '/^>/ { if(NR>1) print "";  printf("%s\n",$0); next; } { printf("%s",$0);}  END {printf("\n");}'

Uses NR (numbered row) to print a newline only for non-first fasta record. Not the most elegant but I hope this will help someone out there.

ADD REPLY • link 5.9 years ago by ljq ▴ 30

Entering edit mode

you can use the answer provided by Jorge Amigo below. (it's not using tail)

ADD REPLY • link 6.0 years ago by lieven.sterck 15k

Entering edit mode

Unfortunately i am a window lover. plz suggests something for window..plzzzz

ADD REPLY • link 13.8 years ago by Palu ▴ 250

Entering edit mode

10.4 years ago

Jorge Amigo 14k

here is a quick and simple perl one-liner:

perl -pe '/^>/ ? print "\n" : chomp' in.fasta > out.fasta

which will output an empty header line. you could use tail (which is faster than sed) to remove it:

perl -pe '/^>/' ? print "\n" : chomp' in.fasta | tail -n +2 > out.fasta

EDIT: even easier (do not use!):

perl -pe 'chomp unless /^>/' in.fasta > out.fasta

EDIT2: this last one liner does not work as expected. use this one instead, which performs inside a single perl call all the logic needed in all lines but the first one by using perl's $. internal line counter variable:

perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' in.fasta > out.fasta

ADD COMMENT • link 6.1 years ago by Jorge Amigo 14k

Entering edit mode

Thank you!! This worked perfectly!

ADD REPLY • link 6.7 years ago by rllombardi • 0

Entering edit mode

Is there a simple way to embed this one liner into a script that just takes a fasta file as input? Tried doing this myself but I am have no idea how to actually write perl scripts. Normally I would just embed this into system call via an R script, but all the quotes are throwing me off.

ADD REPLY • link 6.6 years ago by caverill ▴ 40

Entering edit mode

just create a simple script.pl file containing this (do not use!)

while (<>) { chomp unless /^>/; print }

EDIT: the previous code is wrong. use this one into script.pl instead:

while (<>) { $. > 1 and /^>/ ? print "\n" : chomp; print }

and run it

perl script.pl <in.fasta >out.fasta

ADD REPLY • link 6.1 years ago by Jorge Amigo 14k

Entering edit mode

(for future reference:) Sorry to tell but the 'EDIT' version does not work as expected.

Problem is that it will have chomped the line previous to /^>/ and as such will add the header line to the previous sequence line. the other version works perfectly though.

ADD REPLY • link 6.1 years ago by lieven.sterck 15k

Entering edit mode

thanks for pointing it out. I've corrected my previous answer and tested thoroughly the new one.

ADD REPLY • link 6.1 years ago by Jorge Amigo 14k

Entering edit mode

13.8 years ago

toni ★ 2.2k

Also a quick & dirty solution with Perl... (fa2oneline.pl)

#!/usr/bin/perl -w
use strict;

my $input_fasta=$ARGV[0];
open(IN,"<$input_fasta") || die ("Error opening $input_fasta $!");

my $line = <IN>; 
print $line;

while ($line = <IN>)
{
chomp $line;
if ($line=~m/^>gi/) { print "\n",$line,"\n"; }
else { print $line; }
}

print "\n";

then run:

perl fa2oneline.pl sample.fa > out.fa

Result :

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF

ADD COMMENT • link updated 18 months ago by Ram 45k • written 13.8 years ago by toni ★ 2.2k

Entering edit mode

The regex needs to be changed?

$line=~m/^>gi/) should be $line=~m/^>/gi)

ADD REPLY • link 12.4 years ago by stoker.neil ▴ 70

Entering edit mode

sorry tony, thank you for this great help

ADD REPLY • link 13.8 years ago by Palu ▴ 250

Entering edit mode

this still doesnt seem to work, even with the regex change.

ADD REPLY • link 6.6 years ago by caverill ▴ 40

Entering edit mode

13.8 years ago

Martin A Hansen 3.0k

Biopieces (www.biopieces.org) is another way:

read_fasta -i file.fna | write_fasta -x

Cheers,
Martin

ADD COMMENT • link updated 18 months ago by Ram 45k • written 13.8 years ago by Martin A Hansen 3.0k

Entering edit mode

8.4 years ago

adhil.md ▴ 40

Python version to convert multi-line to two-line fasta format. It also converts multiple files. Output directory will have all the files with _twoline.fasta as suffix.

from Bio import SeqIO
import os
import re
import argparse

def multi2linefasta(indir,outdir,filelist):
    for items in filelist:
        mfasta = outdir +"/"+re.sub('\..*','',items)+'_twoline.fasta'
        ifile = open(indir+'/'+items,'rU')
        with open(mfasta, 'w') as ofile:
            for record in SeqIO.parse(ifile, "fasta"):
                sequence = str(record.seq)
                ofile.write('>'+record.id+'\n'+sequence+'\n')


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert multiple files of multi-line fasta into two-line fasta format')
    parser.add_argument('-i',type=str,dest='ind',required=True,help="Input directory where all the fasta files are present")
    parser.add_argument('-o',type=str,dest='outd',required=True,help="Ouput directory")
    parser.add_argument('-f',type=str,dest='ffiles',required=True,help="Comma seperated fasta file names without spaces")
    args = parser.parse_args()
    print (args)
    filelist = args.ffiles.split(',')
    if not os.path.exists(args.outd):
        os.makedirs(args.outd)
    multi2linefasta(args.ind,args.outd,filelist)

To run, save the above code to convertfasta.py file

python convertfasta.py -i "/path/to/inputfolder/" -o "/path/to/outputfolder/" -f "file1.fasta,file2.fasta,file3.fasta"

ADD COMMENT • link updated 18 months ago by Ram 45k • written 8.4 years ago by adhil.md ▴ 40

Entering edit mode

I would like to give you some suggestions or comments...

Use Biopython for parsing fasta files to avoid assumptions about the format. When a parser exists, use it. It will make your code quicker and shorter.
Use the os module for handling directories and paths
Use the sys module to get input as a python script rather than running this function interactively
Use the with open(file) as ofile synthax to handle opening of files

ADD REPLY • link 8.4 years ago by WouterDeCoster 47k

Entering edit mode

Thank You ...................................

ADD REPLY • link 8.4 years ago by adhil.md ▴ 40

Entering edit mode

Thank you it's also worked for me too. Best,

ADD REPLY • link 2.2 years ago by Onur • 0

Entering edit mode

7.2 years ago

teckpor ▴ 30

It seems that seqtk (https://github.com/lh3/seqtk) can be used for this task, although the webpage only mentions multiline fastq. The command to give is simply (I am using Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz, but one can use any fasta or gzipped fasta):

seqtk seq -l0 Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz | gzip > Homo_sapiens.GRCh37.dna.primary_assembly.singleLines.fa.gz

For a quick proof that this works, try:

zcat Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz | head -n10 | seqtk seq -l0 | cat -A

ADD COMMENT • link 7.2 years ago by teckpor ▴ 30

Entering edit mode

13.8 years ago

Aaronquinlan 12k

Kent source's faToTab. I'm sure EMBOSS has something for this as well. If you are on a Windows machine, I'd just use Galaxy's fasta2tab.

ADD COMMENT • link updated 18 months ago by Ram 45k • written 13.8 years ago by Aaronquinlan 12k

Entering edit mode

13.8 years ago

Woa ★ 2.9k

use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new(-file => "myseq.fasta" , '-format' => 'Fasta');

while ( my $seq = $in->next_seq ) {
    print ">",$seq->id()," ",$seq->desc(),"\n",$seq->seq(),"\n";
}

Output:

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK

>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF

ADD COMMENT • link updated 18 months ago by Ram 45k • written 13.8 years ago by Woa ★ 2.9k

Entering edit mode

I messed up with the formatting, The ">" symbols at the beginning of the Fasta header and not shown for some reason.

ADD REPLY • link 13.8 years ago by Woa ★ 2.9k

Entering edit mode

Just indent with 4 spaces, otherwise ">" is interpreted as blockquote.

ADD REPLY • link 13.8 years ago by Neilfws 49k

Entering edit mode

10.4 years ago

oigl ▴ 60

You can open the merged sequences in the UGENE Sequence View. To do it:

Select "File>Open as" in the main UGENE menu.
Select the "FASTA" format.
Select "Merge sequences into a single sequence to show in the sequence viewer".

By default, UGENE will show the sequence itself, the complementary sequence and translations. It looks like here.

If required, you can export the merged sequence into a new file.

ADD COMMENT • link updated 18 months ago by Ram 45k • written 10.4 years ago by oigl ▴ 60

Entering edit mode

10.4 years ago

sayuj.koyyappurath • 0

Hope this works

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file

ADD COMMENT • link updated 18 months ago by Ram 45k • written 10.4 years ago by sayuj.koyyappurath • 0

Entering edit mode

3.1 years ago

Amirosein ▴ 70

Short answer:

Use seqtk as follows:

$ seqtk seq multi-line.fasta > single-line.fasta

Explained:

A toy example in multi-line format:

$ head -n5 celegans_chr1.fa
>gi|449020133|emb|BX284601.5| Caenorhabditis elegans Bristol N2 genomic chromosome, I
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT
AAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGC
CTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT

How to easily convert to single-line fasta (I only print head -c 170 as it would print the whole chromosome otherwise):

$ seqtk seq celegans_chr1.fa | head -c 170
>gi|449020133|emb|BX284601.5| Caenorhabditis elegans Bristol N2 genomic chromosome, I
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA

So you can simply redirect the stdout to a file as follows:

$ seqtk seq celegans_chr1.fa > celegans_chr1_single-line.fa

You may also use a file compressor like gzip to compress the file.

ADD COMMENT • link 3.1 years ago by Amirosein ▴ 70