single line fasta to mult line fasta
8
1
Entering edit mode
8.4 years ago
User 6777 ▴ 20

I know for the reverse procedure, there are lots of methods available. But how can I change my fasta file with sequences like:

>gi|189094002
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSRQQTKITFNRKNIAADDDEDDDDFFD
>gi|68485955
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSRVKIVRRDSQQKQQTKITFNRKNIAGDDEDDDDFFD

to like:

>gi|189094002
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSR
QQTKITFNRKNIAADDDEDDDDFFD
>gi|68485955
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSR
VKIVRRDSQQKQQTKITFNRKNIAGDDEDDDDFFD

I know its just about inserting line breaks in each sequence, but is there any way that I can check whether the fasta file is single line fasta or not, and if is, add line breaks in each sequence..

Ps. I am using windows..

Thanks for your consideration

fasta • 23k views
ADD COMMENT
4
Entering edit mode

If you plan to do more bioinformatics, do yourself (and us) a favor a switch to linux. Things will get so much easier then.

ADD REPLY
0
Entering edit mode

fasta_formatter from fastx tookit can do this (http://hannonlab.cshl.edu/fastx_toolkit/) -w specifies the width

ADD REPLY
0
Entering edit mode

Venu, I want exactly the equivalent of fold command for windows.. may be in perl or python script

ADD REPLY
3
Entering edit mode

Is this a homework question (since you keep asking for perl or python solution)? If all you need is a viable solution then you have plenty to choose from already.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Maybe, but not guaranteed: it might also 'fold' the description/fasta identifier and that's definitely not what you want.

See this reaction in this same thread: C: single line fasta to mult line fasta

ADD REPLY
9
Entering edit mode
8.4 years ago
GenoMax 148k

Since the OP has asked for a solution that will work on windows, reformat.sh from BBMap can be used like so (NN = number of characters you want on one line)

reformat.sh in=your_file.fa out=new_file.fa fastawrap=NN

On windows you would use it like so (adjust windows file path for BBMap install)

java -ea -Xmx200m cp=C:\bbmap\current\ jgi.ReformatReads in=x.fa out=y.fa fastawrap=NN
ADD COMMENT
5
Entering edit mode
8.4 years ago
Medhat 9.8k

from this

https://github.com/jimhester/fasta_utilities

you can use wrap.pl "limits FASTA lines to 80 characters"

Or you can use from here

http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage

fasta_formatter

fasta_formatter -i yourfile.fasta -w 80 -o yourout.fasta

[-w N] = max. sequence line width for output FASTA file. When ZERO (the default), sequence lines will NOT be wrapped - all nucleotides of each sequences will appear on a single line (good for scripting).

advice: If you will work in bioinformatics try any Linux flavor

ADD COMMENT
0
Entering edit mode

but both fasta_utilities and fastx_toolkit are large packages that need to be installed for this. I will prefer a small perl/ python code..

ADD REPLY
5
Entering edit mode
8.4 years ago
st.ph.n ★ 2.7k

Here's a quick one in python:

import sys

inputfile = sys.argv[1]
length = int(sys.argv[2])

print '\nConverting ' + inputfile + ' to ' + str(length) + ' chars per line...'

with open(inputfile, 'r') as f:

     h = []                                             #headers list
     s = []                                             #sequence list
     for line in f:
           if line.startswith(">"):                      #grab headers
               h.append(line.strip().split(">")[1])
          else:
               s.append(line.strip())                   #grab sequences

myseqs = dict(zip(h,s))                                 #map to dictionary, header:sequence

outfile = open(inputfile.split(".fasta")[0] + '_multi-line.fasta', 'w') #open outfile for writing

def print_seqs(header, sequence):
        print >> outfile, '>' + header
        while len(sequence) > 0:
                print >> outfile, sequence[:length]
                sequence = sequence[length:]

for i in myseqs:
        print_seqs(i, myseqs[i])
print 'Converted ' + str(len(myseqs)) + ' sequences.'
  print 'Done.\n'

outfile.close()

Run as 'python your_input_fasta Nchars'

EDIT

A cleaner, more memory efficient solution, per WouterDeCoster comment:

import sys

inputfile = sys.argv[1]
length = int(sys.argv[2])

outfile = open(inputfile.split(".fasta")[0] + '_multi-line.fasta', 'w') #open outfile for writing

with open(inputfile, 'r') as f:
     for line in f:
        if line.startswith(">"):
                print >> outfile, line.strip()
        else:
                sequence = line.strip()
                while len(sequence) > 0:
                        print >>outfile, sequence[:length]
                        sequence = sequence[length:]
ADD COMMENT
1
Entering edit mode

Your code is not really bad, but there is plenty of opportunity for improvement. To start, there is no need to keep everything in memory. That's a bad practice, imagine doing this on a smaller computer with a huge fasta file... Streaming is better. Definitely more memory efficient, very likely quicker than your implementation. You could just read a line, if it's a header write to output file. If it's a sequence, write in chunks of n characters.

ADD REPLY
1
Entering edit mode

@WouterDeCoster, I wrote it the way it is so that the OP can understand what's going on (maybe learn something). I doubt he's using genome size fa files, or if he was, needn't be asking the question in the first place. Otherwise, everything would be contained in the first 'with open' statement of the input file, and not hold everything in memory in both lists, and a dict.

ADD REPLY
1
Entering edit mode

I definitely like your second script. I would say this is as easy to follow as the first (without using a dict etc) for an inexperienced user, but I'm obviously biased.

Just two more comments, if you don't mind:

  • os.path.splitext(inputfile)[0] would be a cleaner way to get the name of the outputfile, or alternatively inputfile.replace('.fasta', '_multiline.fasta'). os.path.splitext(inputfile)[0] has the additional advantage that if the file is .fa instead of .fasta things will still be alright.
  • The print >> synthax is not portable to python 3. You should write outfile.write(line). While python 2 is still around, it will not be supported forever and if you plan to write more scripts, try to write in python3. The >> synthax is even no longer described in the official python documentation of both python2.7 and python3.
ADD REPLY
4
Entering edit mode
8.4 years ago

if your data can fit in memory, here is a simple HTML form that should do the job:

ADD COMMENT
0
Entering edit mode

This is pretty amazing, would never have thought about using html 0.o

ADD REPLY
1
Entering edit mode

Yes. Pierre always could provide some amazing solution. The most smart guy in the world.

ADD REPLY
0
Entering edit mode

Hi Pierre Lindenbaum,

This is amazing script. Thanks for the sharing.

Gopal

ADD REPLY
2
Entering edit mode
8.4 years ago
Guangchuang Yu ★ 2.6k

seqmagick:

library(seqmagick)
fa_read(INFILE) %>% fa_write(OUTFILE, type = "interleaved")
ADD COMMENT
0
Entering edit mode

Is there any perl/python equivalent code?

ADD REPLY
12
Entering edit mode
8.4 years ago
venu 7.1k

I would use fold command

fold -w 60 file.fa > out.fa

Check man fold

ADD COMMENT
4
Entering edit mode

fold will wrongly fold the header line when it's too long, for example:

$ fold -w 60 dataset_A.fa | cat -A | more
>ADMP01000001 Bacteroides xylanisolvens SD CC 2a contig00137$
, whole genome shotgun sequence. [Bacteroides xylanisolvens $
SD CC 2a]$
GCATGGACCCTTAAACCTACTATATAGTCTATACTAATAATACTCTCTGTTTAACCCATA$
ACAATGCGACAGACTTAAATCCCATATAAATAAAAAAAACGAGCTAATTCTTTTTATAGA$
ACCACTCGCCACCGCATCAGAAAAGCCTAAAGGCTTTTCCATAGTACCTCAAATTTCCTG$
ADD REPLY
0
Entering edit mode

He is using Windows™ from Microsoft Corp...

ADD REPLY
0
Entering edit mode

This answer is perfect and simplest as long as the header is not too long.

ADD REPLY
5
Entering edit mode
8.4 years ago

Since you are using Windows, I highly recommend seqkit, which is cross-platform and ultrafast!

It's light weight (~8M) and out-of-the-box, no dependencies, no compilation, no configuration. Just download compressed executable file of your operating system, and uncompress it and then use in command line terminal.

For your case, just:

seqkit seq -w 60 seqs.fa > seqs2.fa
ADD COMMENT
0
Entering edit mode

Thank you so much, I've tried all the other tools in this page (i needed a linux solution) and this is the easiest and the only one that gave me what i needed. Thanks!

ADD REPLY
0
Entering edit mode
6.9 years ago
conchoecia • 0

A solution using bioawk and fold. This avoids fold's incorrect behavior to fold long fasta headers. This should work with fasta files that contain multiple sequences.

Awk people - I had to use echo twice since printf didn't place the newline in the correct position after the seqname...

bioawk -cfastx '{system("printf \">%s\"" $name " >> multiline.fasta"); system("echo '' >> multiline.fasta"); system("echo " $seq " | fold -w 60 - >> multiline.fasta") }' singleline.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 4330 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6