single line fasta to mult line fasta

1

Entering edit mode

8.8 years ago

User 6777 ▴ 20

I know for the reverse procedure, there are lots of methods available. But how can I change my fasta file with sequences like:

>gi|189094002
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSRQQTKITFNRKNIAADDDEDDDDFFD
>gi|68485955
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSRVKIVRRDSQQKQQTKITFNRKNIAGDDEDDDDFFD

to like:

>gi|189094002
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSR
QQTKITFNRKNIAADDDEDDDDFFD
>gi|68485955
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSR
VKIVRRDSQQKQQTKITFNRKNIAGDDEDDDDFFD

I know its just about inserting line breaks in each sequence, but is there any way that I can check whether the fasta file is single line fasta or not, and if is, add line breaks in each sequence..

Ps. I am using windows..

Thanks for your consideration

fasta • 24k views

ADD COMMENT • link updated 7.4 years ago by conchoecia • 0 • written 8.8 years ago by User 6777 ▴ 20

4

Entering edit mode

If you plan to do more bioinformatics, do yourself (and us) a favor a switch to linux. Things will get so much easier then.

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

0

Entering edit mode

fasta_formatter from fastx tookit can do this (http://hannonlab.cshl.edu/fastx_toolkit/) -w specifies the width

ADD REPLY • link 8.8 years ago by microfuge ★ 2.0k

0

Entering edit mode

Venu, I want exactly the equivalent of fold command for windows.. may be in perl or python script

ADD REPLY • link 8.8 years ago by User 6777 ▴ 20

3

Entering edit mode

Is this a homework question (since you keep asking for perl or python solution)? If all you need is a viable solution then you have plenty to choose from already.

ADD REPLY • link 8.8 years ago by GenoMax 151k

0

Entering edit mode

Will this work?

https://en.wikipedia.org/wiki/Fold_(Unix)

ADD REPLY • link 7.7 years ago by Akliao • 0

0

Entering edit mode

Maybe, but not guaranteed: it might also 'fold' the description/fasta identifier and that's definitely not what you want.

See this reaction in this same thread: C: single line fasta to mult line fasta

ADD REPLY • link 7.7 years ago by WouterDeCoster 48k

9

Entering edit mode

8.8 years ago

GenoMax 151k

Since the OP has asked for a solution that will work on windows, reformat.sh from BBMap can be used like so (NN = number of characters you want on one line)

reformat.sh in=your_file.fa out=new_file.fa fastawrap=NN

On windows you would use it like so (adjust windows file path for BBMap install)

java -ea -Xmx200m cp=C:\bbmap\current\ jgi.ReformatReads in=x.fa out=y.fa fastawrap=NN

ADD COMMENT • link updated 8.8 years ago by Brian Bushnell 20k • written 8.8 years ago by GenoMax 151k

5

Entering edit mode

8.8 years ago

Medhat 9.8k

from this

https://github.com/jimhester/fasta_utilities

you can use wrap.pl "limits FASTA lines to 80 characters"

Or you can use from here

http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage

fasta_formatter

fasta_formatter -i yourfile.fasta -w 80 -o yourout.fasta

[-w N] = max. sequence line width for output FASTA file. When ZERO (the default), sequence lines will NOT be wrapped - all nucleotides of each sequences will appear on a single line (good for scripting).

advice: If you will work in bioinformatics try any Linux flavor

ADD COMMENT • link 8.8 years ago by Medhat 9.8k

0

Entering edit mode

but both fasta_utilities and fastx_toolkit are large packages that need to be installed for this. I will prefer a small perl/ python code..

ADD REPLY • link 8.8 years ago by User 6777 ▴ 20

5

Entering edit mode

8.8 years ago

st.ph.n ★ 2.7k

Here's a quick one in python:

import sys

inputfile = sys.argv[1]
length = int(sys.argv[2])

print '\nConverting ' + inputfile + ' to ' + str(length) + ' chars per line...'

with open(inputfile, 'r') as f:

     h = []                                             #headers list
     s = []                                             #sequence list
     for line in f:
           if line.startswith(">"):                      #grab headers
               h.append(line.strip().split(">")[1])
          else:
               s.append(line.strip())                   #grab sequences

myseqs = dict(zip(h,s))                                 #map to dictionary, header:sequence

outfile = open(inputfile.split(".fasta")[0] + '_multi-line.fasta', 'w') #open outfile for writing

def print_seqs(header, sequence):
        print >> outfile, '>' + header
        while len(sequence) > 0:
                print >> outfile, sequence[:length]
                sequence = sequence[length:]

for i in myseqs:
        print_seqs(i, myseqs[i])
print 'Converted ' + str(len(myseqs)) + ' sequences.'
  print 'Done.\n'

outfile.close()

Run as 'python your_input_fasta Nchars'

EDIT

A cleaner, more memory efficient solution, per WouterDeCoster comment:

import sys

inputfile = sys.argv[1]
length = int(sys.argv[2])

outfile = open(inputfile.split(".fasta")[0] + '_multi-line.fasta', 'w') #open outfile for writing

with open(inputfile, 'r') as f:
     for line in f:
        if line.startswith(">"):
                print >> outfile, line.strip()
        else:
                sequence = line.strip()
                while len(sequence) > 0:
                        print >>outfile, sequence[:length]
                        sequence = sequence[length:]

ADD COMMENT • link 8.8 years ago by st.ph.n ★ 2.7k

1

Entering edit mode

Your code is not really bad, but there is plenty of opportunity for improvement. To start, there is no need to keep everything in memory. That's a bad practice, imagine doing this on a smaller computer with a huge fasta file... Streaming is better. Definitely more memory efficient, very likely quicker than your implementation. You could just read a line, if it's a header write to output file. If it's a sequence, write in chunks of n characters.

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

1

Entering edit mode

@WouterDeCoster, I wrote it the way it is so that the OP can understand what's going on (maybe learn something). I doubt he's using genome size fa files, or if he was, needn't be asking the question in the first place. Otherwise, everything would be contained in the first 'with open' statement of the input file, and not hold everything in memory in both lists, and a dict.

ADD REPLY • link 8.8 years ago by st.ph.n ★ 2.7k

1

Entering edit mode

I definitely like your second script. I would say this is as easy to follow as the first (without using a dict etc) for an inexperienced user, but I'm obviously biased.

Just two more comments, if you don't mind:

os.path.splitext(inputfile)[0] would be a cleaner way to get the name of the outputfile, or alternatively inputfile.replace('.fasta', '_multiline.fasta'). os.path.splitext(inputfile)[0] has the additional advantage that if the file is .fa instead of .fasta things will still be alright.
The print >> synthax is not portable to python 3. You should write outfile.write(line). While python 2 is still around, it will not be supported forever and if you plan to write more scripts, try to write in python3. The >> synthax is even no longer described in the official python documentation of both python2.7 and python3.

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

4

Entering edit mode

8.8 years ago

Pierre Lindenbaum 166k

if your data can fit in memory, here is a simple HTML form that should do the job:

	<html xmlns='http://www.w3.org/1999/xhtml'>

	<body>
	<form>
	<label for="ZmlsZXMKfiles">Select a FASTA file</label>:<input type="file" id="ZmlsZXMKfiles" multiple="true"/>
	</form>
	<pre id="ac921c19f1aff"></pre>
	<script type="application/ecmascript">


	function handleFileSelectFasta(evt)
	{

	var files = evt.target.files; // FileList object

	if(files.length==0) return;
	var pre= document.getElementById("ac921c19f1aff");
	while(pre.hasChildNodes()) pre.removeChild( pre.firstChild );

	for(var i=0;i < files.length;++i)
	{
	var reader = new FileReader();
	reader.onprogress=function(evt) {}
	reader.onloadend=function(e)
	{
	if(e.target.result==null) return;
	var lines=e.target.result.split(/[\n\r]/);

	var dna="";
	var title="";
	var i=0;
	var line;
	for(;;)
	{
	if(i==lines.length \|\| (line=lines[i])[0]=='>')
	{
	if(dna.length != 0)
	{
	pre.appendChild(document.createTextNode(title+"\n"));
	while(dna.length != 0)
	{
	var n=Math.min(dna.length,50);
	pre.appendChild(document.createTextNode(dna.substring(0,n)+"\n"));
	dna= dna.substring(n);
	}
	}
	if(i===lines.length) break;
	title=line;
	dna="";
	}
	else
	{
	dna+=line.trim();
	}
	++i;
	}
	};
	reader.readAsText(files[i]);
	}
	}

	document.getElementById('ZmlsZXMKfiles').addEventListener('change', handleFileSelectFasta, false);
	</script>

	</body>
	</html>

view raw biostar205745.html hosted with ❤ by GitHub

ADD COMMENT • link 8.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

This is pretty amazing, would never have thought about using html 0.o

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

1

Entering edit mode

Yes. Pierre always could provide some amazing solution. The most smart guy in the world.

ADD REPLY • link 8.8 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

Hi Pierre Lindenbaum,

This is amazing script. Thanks for the sharing.

Gopal

ADD REPLY • link 7.1 years ago by Gopal Joshi • 0

2

Entering edit mode

8.8 years ago

Guangchuang Yu ★ 2.6k

seqmagick:

library(seqmagick)
fa_read(INFILE) %>% fa_write(OUTFILE, type = "interleaved")

ADD COMMENT • link 8.8 years ago by Guangchuang Yu ★ 2.6k

0

Entering edit mode

Is there any perl/python equivalent code?

ADD REPLY • link 8.8 years ago by User 6777 ▴ 20

12

Entering edit mode

8.8 years ago

venu 7.1k

I would use fold command

fold -w 60 file.fa > out.fa

Check man fold

ADD COMMENT • link 8.8 years ago by venu 7.1k

4

Entering edit mode

fold will wrongly fold the header line when it's too long, for example:

$ fold -w 60 dataset_A.fa | cat -A | more
>ADMP01000001 Bacteroides xylanisolvens SD CC 2a contig00137$
, whole genome shotgun sequence. [Bacteroides xylanisolvens $
SD CC 2a]$
GCATGGACCCTTAAACCTACTATATAGTCTATACTAATAATACTCTCTGTTTAACCCATA$
ACAATGCGACAGACTTAAATCCCATATAAATAAAAAAAACGAGCTAATTCTTTTTATAGA$
ACCACTCGCCACCGCATCAGAAAAGCCTAAAGGCTTTTCCATAGTACCTCAAATTTCCTG$

ADD REPLY • link 8.8 years ago by shenwei356 8.7k

0

Entering edit mode

He is using Windows™ from Microsoft Corp...

ADD REPLY • link 8.8 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

This answer is perfect and simplest as long as the header is not too long.

ADD REPLY • link 5.3 years ago by scchess ▴ 640

5

Entering edit mode

8.8 years ago

shenwei356 8.7k

Since you are using Windows, I highly recommend seqkit, which is cross-platform and ultrafast!

It's light weight (~8M) and out-of-the-box, no dependencies, no compilation, no configuration. Just download compressed executable file of your operating system, and uncompress it and then use in command line terminal.

For your case, just:

seqkit seq -w 60 seqs.fa > seqs2.fa

ADD COMMENT • link 8.8 years ago by shenwei356 8.7k

0

Entering edit mode

Thank you so much, I've tried all the other tools in this page (i needed a linux solution) and this is the easiest and the only one that gave me what i needed. Thanks!

ADD REPLY • link 6.6 years ago by deboraguerrapeixe • 0

0

Entering edit mode

7.4 years ago

conchoecia • 0

A solution using bioawk and fold. This avoids fold's incorrect behavior to fold long fasta headers. This should work with fasta files that contain multiple sequences.

Awk people - I had to use echo twice since printf didn't place the newline in the correct position after the seqname...

bioawk -cfastx '{system("printf \">%s\"" $name " >> multiline.fasta"); system("echo '' >> multiline.fasta"); system("echo " $seq " | fold -w 60 - >> multiline.fasta") }' singleline.fasta

ADD COMMENT • link 7.4 years ago by conchoecia • 0

Login before adding your answer.