Question

Convert multiple-sequence fasta file to single long sequence

0

Entering edit mode

9.8 years ago

aberry814 ▴ 80

I have a fasta file containing millions of sequences and I want a simple script to convert this file into one long sequence. ie delete all headers and remove any spaces and line breaks. I can always add a ">seq_name" to the first line afterwards, so maintaining the top header is not necessary.

I've searched the forums but can only find scripts that do the reverse. I'm using millions of reads as a substitute for a complete genome, and my current pipeline cannot reconcile this, so I want to trick it into thinking that this is one long genome sequence.

Thanks for any help!!!

sequence • 14k views

ADD COMMENT • link updated 7.4 years ago by 201314918 • 0 • written 9.8 years ago by aberry814 ▴ 80

3

Entering edit mode

Home work?

grep -v "^>" test.fasta | awk 'BEGIN { ORS=""; print ">Sequence_name\n" } { print }' > new.fasta

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.8 years ago by GouthamAtla 12k

0

Entering edit mode

Haha not homework. Actual work done being attempted by a below-average programmer (me).

This appears to work perfectly, thanks!

ADD REPLY • link 9.8 years ago by aberry814 ▴ 80

1

Entering edit mode

This removes all line breaks as well.

ADD REPLY • link 9.8 years ago by GouthamAtla 12k

0

Entering edit mode

Hi Guys I also want to remove the breaks in a multiline FASTA file. But I can't. Can anyone clarify for me . I am vary new to Bioinformatics. Thanks in Advance

ADD REPLY • link 7.4 years ago by 201314918 • 0

0

Entering edit mode

using seqkit:

$ seqkit seq -w0 input.fa

Please move your post to a new post and try any one/all of the solutions provided above, before posting.

ADD REPLY • link 7.4 years ago by cpad0112 21k

2

Entering edit mode

9.8 years ago

Brian Bushnell 20k

An alternative, from BBTools:

fuse.sh in=sequences.fa out=fused.fa pad=0 fastawrap=2000000000

Note that this will fail when the length of the output sequence approaches 2 billion. But most programs will fail on unwrapped fasta lines exceeding 2Gbp anyway, so I don't really care about that. "pad" will put that many Ns in between discrete sequences.

It's generally better practice to write or use programs that can handle wrapped fasta, than to convert fasta to unwrapped before loading it. But there are always exceptions.

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.8 years ago by Brian Bushnell 20k

0

Entering edit mode

9.8 years ago

Malcolm.Cook ★ 1.5k

Here's a perl one-liner:

perl -n  -e 'print if 1 == $. || ! m/^>/'  test.fa > out.fa

or, to stream edit destructively in-place:

perl -n -i -e 'print if 1 == $. || ! m/^>/' test.fa

also does not delete newlines or whitespace - but is this really needed by your downstream process?

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.8 years ago by Malcolm.Cook ★ 1.5k

Ram · Accepted Answer · 2015-08-03

7

Entering edit mode

9.8 years ago

kloetzl ★ 1.1k

$ cat multi.fasta | grep -v '^>' | grep '^.' | tr -d '[:blank:]' | cat >( echo '>seq_name') - > all.fasta

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.8 years ago by kloetzl ★ 1.1k

1

Entering edit mode

Thanks! This works well except it doesn't delete the line breaks (easy enough to do after the fact.)

ADD REPLY • link 9.8 years ago by aberry814 ▴ 80

0

Entering edit mode

This way it deletes the newlines as well

cat multi.fasta | grep -v '^>' | grep '^.' | tr -d '[:blank:]' | tr -d '\n' | cat <( echo '>seq_name') - > multi_concat.fasta

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 5.6 years ago by chefarov ▴ 170