Convert multiple-sequence fasta file to single long sequence
3
0
Entering edit mode
9.3 years ago
aberry814 ▴ 80

I have a fasta file containing millions of sequences and I want a simple script to convert this file into one long sequence. ie delete all headers and remove any spaces and line breaks. I can always add a ">seq_name" to the first line afterwards, so maintaining the top header is not necessary.

I've searched the forums but can only find scripts that do the reverse. I'm using millions of reads as a substitute for a complete genome, and my current pipeline cannot reconcile this, so I want to trick it into thinking that this is one long genome sequence.

Thanks for any help!!!

sequence • 13k views
ADD COMMENT
3
Entering edit mode

Home work?

grep -v "^>" test.fasta | awk 'BEGIN { ORS=""; print ">Sequence_name\n" } { print }' > new​.fasta
ADD REPLY
0
Entering edit mode

Haha not homework. Actual work done being attempted by a below-average programmer (me).

This appears to work perfectly, thanks!

ADD REPLY
1
Entering edit mode

This removes all line breaks as well.

ADD REPLY
0
Entering edit mode

Hi Guys I also want to remove the breaks in a multiline FASTA file. But I can't. Can anyone clarify for me . I am vary new to Bioinformatics. Thanks in Advance

ADD REPLY
0
Entering edit mode

using seqkit:

$ seqkit seq -w0 input.fa

Please move your post to a new post and try any one/all of the solutions provided above, before posting.

ADD REPLY
7
Entering edit mode
9.3 years ago
kloetzl ★ 1.1k
$ cat multi.fasta | grep -v '^>' | grep '^.' | tr -d '[:blank:]' | cat >( echo '>seq_name') - > all.fasta
ADD COMMENT
1
Entering edit mode

Thanks! This works well except it doesn't delete the line breaks (easy enough to do after the fact.)

ADD REPLY
0
Entering edit mode

This way it deletes the newlines as well

cat multi.fasta | grep -v '^>' | grep '^.' | tr -d '[:blank:]' | tr -d '\n' | cat <( echo '>seq_name') - > multi_concat.fasta
ADD REPLY
2
Entering edit mode
9.3 years ago

An alternative, from BBTools:

fuse.sh in=sequences.fa out=fused.fa pad=0 fastawrap=2000000000

Note that this will fail when the length of the output sequence approaches 2 billion. But most programs will fail on unwrapped fasta lines exceeding 2Gbp anyway, so I don't really care about that. "pad" will put that many Ns in between discrete sequences.

It's generally better practice to write or use programs that can handle wrapped fasta, than to convert fasta to unwrapped before loading it. But there are always exceptions.

ADD COMMENT
0
Entering edit mode
9.3 years ago
Malcolm.Cook ★ 1.5k

Here's a perl one-liner:

perl -n  -e 'print if 1 == $. || ! m/^>/'  test.fa > out.fa

or, to stream edit destructively in-place:

perl -n -i -e 'print if 1 == $. || ! m/^>/' test.fa

also does not delete newlines or whitespace - but is this really needed by your downstream process?

ADD COMMENT

Login before adding your answer.

Traffic: 2147 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6